Nekonomicon - MPlayer 1.0pre3 tardist

lewis
From london
Who joined Nov. 27, 2003, 12:30 p.m.

Wrote the following at April 1, 2004, 6 a.m...

The actual glDrawPixel() call is not exactly fast. Without the color matrix stuff it's not so bad, but with it it gets many times slower. Wish I had VPro graphics with their asynchronous pixel writes :(

But yes, it's those loops which are the big problem. I've tried to look at how mplayer's software scaler does the packing conversion but it's really evil code with macros all over the place, and I can't understand it even after preprocessing.

Thanks for the help Matthias, I'll try those changes and get back to you.

Squeen, I am using system scope pthreads, or at least both CPUs are being used according to gr_osview. I don't even have capabilities turned on... I thought pthreads did that by default? I did look at IRIX's sproc() and shared arena stuff, but I don't see that it would be any faster - pthreads share the same memory too. And I must emphasise that mplayer is really not designed for this, it likes its output plugins to be all serialized so it can order and time stuff properly. I would imagine that as long as the drawing can keep up with the codec it would work acceptably well, but I do see some tearing which doesn't happen with X11.

BTW using glZoomPixels() for zooming is /much/ faster than either the software or the SDL scaler.

While I'm here, why can't I get pixie to run on anything linked with IRIX's libc? It complains about, umm, something arcane do to with code blocks. Works fine on the o32 libc, but that ain't much use :) Profiling would really be a help here...

Brombear
From Frankfurt (Rhein-Main Area) / Germany
Who joined Oct. 5, 2003, 8:42 a.m.

Wrote the following at April 1, 2004, 6:19 a.m...

I'll try to enhance it from my head a bit more, so don't be surprised if it doesn't work out of the box. For sure there are better ways to optimizer it with macros and stuff, but I think this should make some difference.

Code: Select all


   int ixPixel=0;
   

   int ixPixel2=0;
   

   int ixDest=0;
   

   for(i = 0; i < 480; ++i)
   

   {
   

   for(j = 0; j < 720; ++j)
   

   {
   

   int ixPixel3 = ixPixel2 + j/2;
   

   packed[ixDest++] = Y[ixPixel];
   

   packed[ixDest++] = U[ixPixel3 ];
   

   packed[ixDest++] = V[ixPixel3 ];
   

   ++ixPixel;
   

   }
   

   ixPixel2 +=180;
   

   }

Life is what happens while we are making other plans

squeen
Moderator
From Maryland, USA
Who joined May 9, 2003, 6:10 a.m.

Wrote the following at April 1, 2004, 8:07 a.m...

OK I wasn't sure about the capabilities for SYSTEM scope threads, I was also using REACT when I enabled them. It must have been the CPU isolation that need special user privledges.

In order to get a performance boost with pthreads, I would think you'd need to run two "worker" threads that waited for frame info to process. For example, your loops could be divided in half with each thread being parsed out half. The threads would run in an infinite loop and wait on a mutex or some other syncronization flag that told it new data had arrived. The "master" thread would then queue both "workers" when data was in the buffer. This is similiar to what the loop #pragamas do in OpenMP.

As an alternate you could buffer the frames and have one thread work the nect frame and another work the frame after that.

Just some thoughts. I wish you the best of luck with this.

Last comment: You know you've parallelized all you can when BOTH CPU's are pegged to the limit, otherwise it's just IRIX handing the job back and forth between the CPUs at each system interupt.

I use the following .grosview setting and it helps make the difference clear;

Code: Select all


   opt arbsize
   

   opt interval(1)
   

   cpu strip samples(30)
   

   rmem max tracksum
   

   wait

lewis
From london
Who joined Nov. 27, 2003, 12:30 p.m.

Wrote the following at April 1, 2004, 10:12 a.m...

Brombear, I tried your code (with a couple of changes added so it actually worked :)), and it made no difference whatsoever. What you did makes sense but I reckon gcc does that kind of optimization pretty well, expessions evaluating to constants inside loops and whatnot. I dare say it reorganizes loops to run downwards when it can because that made no difference either.

From crudely commenting bits out it seems the packing takes about 40% of a single 300Mhz R12k, drawing and swapping with no colourspace conversion about 15%, and drawing with colourspace conversion about 60%. So on bits where the codec is chucking out frames at full speed it can't keep up.

Squeen, both CPUs are most assuredly absolutely pegged most of the time :) Why would having more than one drawing thread make any difference on a dual machine? The first CPU is occupied entirely with the MPEG2 codec... if I had an Origin, then yeah, but I think that can wait :)

Brombear
From Frankfurt (Rhein-Main Area) / Germany
Who joined Oct. 5, 2003, 8:42 a.m.

Wrote the following at April 1, 2004, 10:01 p.m...

Sorry to hear that it didn't work out.

I never trust compilers :twisted:

Matthias

Life is what happens while we are making other plans

hamei
From over the rainbow
Who joined Feb. 24, 2004, 4:10 p.m.

Wrote the following at April 1, 2004, 11:47 p.m...

lewis wrote: Why would having more than one drawing thread make any difference on a dual machine? The first CPU is occupied entirely with the MPEG2 codec... if I had an Origin, then yeah, but I think that can wait

I think you need to know more about how the scheduler works to really say. I'm not up on SGI multi-processing but in OS/2 (very thread-oriented system) the scheduler makes all those decisions in a round-robin fashion. As any task gets completed its cpu becomes available. The scheduler hands the next thread to the first available processor. Your computer DOES have more going on than this one process. You can't really say "one cpu is occupied entirely with" by any stretch of the imagination, unless that task was particularly written to do that AND the operating system would let you HAVE that much priority.

I'd be real surprised if it did. OS/2 certainly won't let you. You're in applications, not systems level ....

chervarium
From Sofia, BG, EU
Who joined Jan. 9, 2004, 4:02 a.m.

Wrote the following at April 2, 2004, 3:13 p.m...

hamei wrote:

lewis wrote: Why would having more than one drawing thread make any difference on a dual machine? The first CPU is occupied entirely with the MPEG2 codec... if I had an Origin, then yeah, but I think that can wait :)

I think you need to know more about how the scheduler works to really say. I'm not up on SGI multi-processing but in OS/2 (very thread-oriented system) the scheduler makes all those decisions in a round-robin fashion. As any task gets completed its cpu becomes available. The scheduler hands the next thread to the first available processor. Your computer DOES have more going on than this one process. You can't really say "one cpu is occupied entirely with" by any stretch of the imagination, unless that task was particularly written to do that AND the operating system would let you HAVE that much priority.

I'd be real surprised if it did. OS/2 certainly won't let you. You're in applications, not systems level ....

To hamei:
IRIX has a syscall for doing exactly these things, namely sysmp(2), which grants fine control. On multiprocessor system one may do a lot of things with processes and CPUs, for example restricting a processor from running any processes except those assigned to it, isolating CPU, disabling scheduler interrupt for a given processor (thus granting realtime priority), etc. etc.

To lewis:
I am with you captain

LAMMEN GORTHAUR

chervarium
From Sofia, BG, EU
Who joined Jan. 9, 2004, 4:02 a.m.

Wrote the following at April 3, 2004, 5:38 a.m...

lewis, would you try this approach, I believe some of the bottlenecks are cleared
(notice1: code is edited since its first version! Reread)
(notice2: R10k/R12k pipeline)

Code: Select all


   /* EDITED! */
   

   #include <sys/types.h>
   

   

   ...
   

   

   static int indices1[480], indices2[480];
   

   

   ...
   

   

   void indices_init() {
   

   int i;
   

   

   for(i = 0; i < 480; i++) {
   

   indices1[i] = i*720;
   

   indices2[i] = (i>>1)*360;
   

   }
   

   return;
   

   }
   

   

   ...
   

   

   register int idx_i1, idx_i2, idx_i22, i, j;
   

   static u_int32_t packed[480][720];
   

   

   ...
   

   

   for(;;) {
   

   sem_wait(&sem);
   

   for(i = 0; i < 480; i++) {
   

   idx_i1 = indices1[i];
   

   idx_i2 = indices2[i];
   

   for(j = 0, idx_i22 = idx_i2; j < 720; idx_i22 = idx_i2 + (++j)>>1)
   

   packed[i][j] =
   

   (Y[idx_i1 + j]) |
   

   (U[idx_i22] << 8) |
   

   (V[idx_i22] << 16);
   

   /* the R, G and B components may be wrong, I am not into GL so much */
   

   }
   

   glDrawPixels(720, 480, GL_RGBA, GL_UNSIGNED_INT_8_8_8_8, packed);
   

   glXSwapBuffers(dpy, glxWin);
   

   }

LAMMEN GORTHAUR

lewis
From london
Who joined Nov. 27, 2003, 12:30 p.m.

Wrote the following at April 7, 2004, 1:46 p.m...

I tried your code before you edited it, and... nothing changed. I'm losing sight of what exactly is going on here. What you did makes a lot of sense, I really should have thought of using only one 32bit memory store like that :) Although you do have the bytes back to front. GL_ABGR_EXT is your friend. What do you mean by "R10k pipeline"? Are you relying on speculative memory fetches or something?

I think something odd must be happening because nothing I do changes how much CPU time is used. Plus, it uses a lot more CPU when playing VOBs than when playing raw stuff, which seems back to front. I dunno. I'd post my modified source but it's huge and I'm on dialup here - maybe I could do diffs to the specific CVS version I have.

I think it's fair to say that IRIX's SMP is rather more advanced than OS/2's, y'know :) There doesn't seem to be any job swapping going on apart from the usual every-couple-of-seconds-measurment-artifact thing.

nvukovlj
From London, UK
Who joined June 9, 2003, 8:27 a.m.

Wrote the following at April 7, 2004, 2:01 p.m...

Ok, this may have very little to do with anything, but a couple of questions...

I see you're using semaphores - are they being used just as binary semaphores, or can the value go above 1 ?
Also, is the semaphore being created meant to be used accross processes, or within a process, as currently it is being initialised as a semaphore that can be shared amongst processes.

BTW, if the semaphore is binary, usually a mutex is quicker than a semaphore.

Regards,

Nik.

chervarium
From Sofia, BG, EU
Who joined Jan. 9, 2004, 4:02 a.m.

Wrote the following at April 7, 2004, 4:12 p.m...

lewis wrote: I tried your code before you edited it, and... nothing changed. I'm losing sight of what exactly is going on here. What you did makes a lot of sense, I really should have thought of using only one 32bit memory store like that :) Although you do have the bytes back to front. GL_ABGR_EXT is your friend. What do you mean by "R10k pipeline"? Are you relying on speculative memory fetches or something?
...

I wrote this with several paradigms in mind. First, lb/sb are slow instructions and would break the dispatch bad, so better only storing a word after 3 fetches than storing bytes. Second, binary arithmetic instead the normal one when it comes to this. It makes spaghetti code but is slightly faster. And the third one, watching the assembly output of the compiler, realigning/changing the C code, and giving back some hints (I like it portable). After some tweaking I got emitted the prefetches I wanted and the instruction pipelining I wanted (the "R10k/R12k ..."). The final thing (the one here) worked well, I got it ~5 times faster (10000 frames (~5m at 30fps) in stable 1m55s against stable 9m) performance than the first two codes.

Looks like the bottleneck then might be in the gl libraries itself... I don't have your source tree but if you don't mind to share it I would play some profiling around. I got interested into this.

Btw. I am building with MIPSpro which has better optimiser.

nvukovlj wrote: I see you're using semaphores - are they being used just as binary semaphores, or can the value go above 1 ?
Also, is the semaphore being created meant to be used accross processes, or within a process, as currently it is being initialised as a semaphore that can be shared amongst processes.

The semaphore switches at 30Hz so wouldn't make a great penalty.

LAMMEN GORTHAUR

hamei
From over the rainbow
Who joined Feb. 24, 2004, 4:10 p.m.

Wrote the following at April 8, 2004, 4:31 a.m...

lewis wrote: I think it's fair to say that IRIX's SMP is rather more advanced than OS/2's, y'know .

In fact ... well, it probably is, since it doesn't have all that x86 crap to deal with. But then, Unix is still stuck in the process world even tho threading is often many times faster.

Remember who IBM is ? And what they made their fortune building ? IBM is seldom best at anything but neither are they very far behind. They were designing mainframes while Jim Clark was still a $400/month math teacher .....

you should know a little more about OS/2 before you criticize. I would suspect you've never used it ? Many many nice features and good ideas. Too bad the Redmond crooks suffocated what could have been an entirely different peecee experience.

squeen
Moderator
From Maryland, USA
Who joined May 9, 2003, 6:10 a.m.

Wrote the following at April 14, 2004, 7:56 a.m...

lewis wrote: Squeen, I am using system scope pthreads, or at least both CPUs are being used according to gr_osview. I don't even have capabilities turned on... I thought pthreads did that by default? .

Bumped into this again: here's the pthread_attr_setscope man page entry:

IRIX wrote: The pthread_attr_setscope() function sets the thread scheduling scope
attribute in the object attr to the value scope. Possible values for
scope are PTHREAD_SCOPE_SYSTEM, PTHREAD_SCOPE_BOUND_NP and
PTHREAD_SCOPE_PROCESS. The scheduling scope for the attribute object
attr, is returned via the oscope parameter of pthread_attr_getscope().
The default scheduling scope is PTHREAD_SCOPE_PROCESS.

Threads created with system scope have a direct effect on scheduling by
the kernel [see pthread_attr_setschedpolicy() and
pthread_attr_setschedparam()]. System scope threads are therefore
suitable for real-time applications [see realtime]. For example a system
scope thread may run at a higher priority than interrupt threads and
system daemons. Creation of system scope threads requires the
[b]CAP_SCHED_MGT[\b] capability [see capability].

Only SYSTEM_SCOPE and SCOPE_BOUND_NP (i.e. bound to a CPU and not portable to non SGI's) threads with be scheduled by the kernel as a separate entity (i.e. can be assigned to an independent CPU), otherwise they just live in the space of the current process.

My code looks something like this

Code: Select all


   void launchThread(void)
   

   {
   

   int                  rtn;
   

   pthread_attr_t       pthread_attributes;
   

   pthread_t            pthread_id;
   

   

   if ( rtn = pthread_attr_init(&pthread_attributes) ) {
   

   fprintf(stderr,"launchThread: pthread_attr_init failed (%d)\n", rtn);
   

   return;
   

   }
   

   if ( rtn = pthread_attr_setscope(&pthread_attributes, PTHREAD_SCOPE_SYSTEM) ) {
   

   fprintf(stderr,"launchThread: pthread_attr_setscope failed (%d)\n", rtn);
   

   return;
   

   }
   

   if ( rtn = pthread_create( &pthread_id, &pthread_attributes,
   

   (void *(*)(void *)) system_scope_thread, NULL) ) {
   

   fprintf(stderr,"launchThread: pthread_create failed (%d)\n", rtn);
   

   return;
   

   }
   

   printf("thread launched\n");
   

   return;
   

   };

without running as a privledged user, it fails at the setscope checkpoint.

Hope that helps.

lewis
From london
Who joined Nov. 27, 2003, 12:30 p.m.

Wrote the following at April 20, 2004, 6:53 a.m...

Squeen, thanks, that does make sense. One wouldn't want just anybody to be making system scope threads :) I think process scope is still fine for this purpose, though.

BTW my Octane is not around so I can't do more on this for a couple of months. But I will at some point... anyone else is welcome to take what's above and carry on. Just stick the first code I posted in the libvo folder as vo_whatever.c, and add links to it in video_out.c or wherever it is. And mess with the configure script.

schleusel
From NRW, Germany
Who joined Oct. 20, 2003, 6:49 a.m.

Wrote the following at April 28, 2004, 1:10 p.m...

Hi!

i've updated the tardist with the new 1.0pre4 release: http://www.kanera.de/irix/MPlayer-1.0pre4.tardist

testers welcome ;-)

Again built with MipsPro 7.4.1 and aggressive optimizations. I hope I got most problems resulting from my MipsPro related
changes sorted out by now, that is: the codec support should be on par with gcc builds from the unmodified source -
at least my growing collection of weird testfiles seems to imply this so far ;-)

But please let me know if you find
any problems that might be related to my changes!

Some package specific information:

- again includes mips3 and mips4 selections

- again all external libs are linked statically, hence there are no prereqs.

- now installs to /opt/mplayer instead of /usr/local (as /usr/local is just wrong for packaged stuff)

- enabled codecs/plugins:
Input: ftp network edl tv live.com matroska(internal)
Codecs: xvid libavcodec real faad2(internal) libmpeg2 liba52 mp3lib libvorbis libmad gif
Audio output: sgi sdl mpegpes(file)
Video output: sdl gif89a jpeg mpegpes(file) opengl x11 xover tga

- added an opt subsystem containing Release Notes, the original source and the lengthy patch used for this build

I guess that sums it up, check the Release Notes and the included default mplayer.conf for further information :-)

so long,
Timo

squeen
Moderator
From Maryland, USA
Who joined May 9, 2003, 6:10 a.m.

Wrote the following at April 29, 2004, 4:02 a.m...

lewis wrote: Squeen, thanks, that does make sense. One wouldn't want just anybody to be making system scope threads I think process scope is still fine for this purpose, though.

BTW my Octane is not around so I can't do more on this for a couple of months. But I will at some point... anyone else is welcome to take what's above and carry on. Just stick the first code I posted in the libvo folder as vo_whatever.c, and add links to it in video_out.c or wherever it is. And mess with the configure script.

One more comment on pthreads.

The PROCESS scope threads run in the same process as the one that launched them. But there is just one kernel entity for IRIX to schedule. Therefore you get concurrence (if one thread blocks the other keeps going) but not multi-cpu parallelism. This is known an Nx1 parallelism.

The SYSTEM scope attaches a kernel entity to each thread so IRIX can schedule them for CPU time independently. This will give you multiprocess parallelism across multiple CPUs. This is NxN parallelism, but it requires special user privledges since the kernel entity runs in the real-time priority band which is above that of most other processes except (usually!) the kernel itself.

The BOUND_NP refers to the thread being "bound" to a kernel entity. It gives you paralleism across CPU (NxN) but doens't run a real-time priority and therefore doesn't require special user privledges. The "NP" refers to "not portable" since this is IRIX only and not a POSIX standard.

If you want to speed things up by using both of the CPUs on your Octane, I'd recommed the BOUND_NP priority level.

Also, I've directly noticed a difference between PROCESS and BOUND_NP when I run "top". In my app, the process show about "100%" CPU usage using PROCESS scope but shows around "150%" CPU usage when I go with BOUND_NP.

schleusel
From NRW, Germany
Who joined Oct. 20, 2003, 6:49 a.m.

Wrote the following at May 2, 2004, 2:41 p.m...

schleusel wrote: i've updated the tardist with the new 1.0pre4 release: http://www.kanera.de/irix/MPlayer-1.0pre4.tardist

uploaded a new revision (128) of the package. Only difference is a small change to vo_gl2 (hopefully removing any remaining fullscreen flickering problems with large files on slower machines)

so long,
Timo

lewis
From london
Who joined Nov. 27, 2003, 12:30 p.m.

Wrote the following at May 3, 2004, 12:32 p.m...

Squeen, thanks. Thinking about it, that might explain why none of the above optimisations made any difference. Shall experiment, at some point.

brent
Who joined May 4, 2004, 9:04 p.m.

Wrote the following at May 4, 2004, 9:20 p.m...

rld: Error: unresolvable symbol in /opt/mplayer/bin/mplayer: _nanf_val
rld: Fatal Error: this executable has unresolvable symbols

this is on a 200 MHz r5k O2, irix 6.5.10. i'm just getting started with IRIX, so forgive my ignorance. might this be an issue I'm seeing due to my fairly dated copy of 6.5? in case that's the issue, i'm looking to get a more recent release.

I tried compiling MPlayer by hand recently (though only with gcc) and got fairly disgusted with the Makefile. After crashing and burning many times (due I think to the order of arguments fed to the linker) I gave up and went in search of packages... only to get the above error.

chervarium
From Sofia, BG, EU
Who joined Jan. 9, 2004, 4:02 a.m.

Wrote the following at May 5, 2004, 12:27 a.m...

brent wrote: rld: Error: unresolvable symbol in /opt/mplayer/bin/mplayer: _nanf_val
rld: Fatal Error: this executable has unresolvable symbols

You'll have to upgrade to a recent IRIX version. This MPlayer package is built against recent libraries and you won't be able to run it on 6.5.10.

LAMMEN GORTHAUR

SGI: Video

MPlayer 1.0pre3 tardist - Page 2