SGI: Video

MPlayer 1.0pre3 tardist - Page 3

schleusel wrote:
schleusel wrote: i've updated the tardist with the new 1.0pre4 release: http://www.kanera.de/irix/MPlayer-1.0pre4.tardist


uploaded a new revision (128) of the package. Only difference is a small change to vo_gl2 (hopefully removing any remaining fullscreen flickering problems with large files on slower machines)


schleusel, verry spiffy. Just tried it again and on an Octane MXI at 400 mhz and it works real good. Only made the one change in the config file to gl2 and away we went. Played every avi and mpg file I have at about 60 - 70% cpu utilization. Maybe it's time to up that cpu to a dually ....

Now I can watch porn-o-graphee on my SGI, muchas gracias :-)
With Schleusels patches to mplayer 1.0pre4 i have succeeded in running a speedshop trace. My machine is I2 HI+TRAM 195MHz 384 Mb, 6.5.22m+patches, MIPSPro 7.4.2m+patches. flags were '-O3 -r10000 -mips4 -n32'. Sample file was a fansub of Psychic Academy episode 4.
BTW, when compiling mplayer, do not strip the resulting executable! Then:

Code: Select all

ssrun -v -exp usertime ./mplayer -vo gl2 -vf format=RGB24 -nosound -benchmark your.avi

which results in a file called mplayer.usertime.somenumbers
Then run prof:

Code: Select all

prof mplayer.usertime.somenumbers > out.txt

And this is what comes out:

Code: Select all

-------------------------------------------------------------------------
SpeedShop profile listing generated Sat Jul  3 20:07:29 2004

prof mplayer.usertime.m17232

mplayer (n32): Target program
usertime: Experiment name
ut:cu: Marching orders
R10000 / R10010: CPU / FPU
1: Number of CPUs
195: Clock frequency (MHz.)
Experiment notes--
From file mplayer.usertime.m17232:
Caliper point 0 at target begin, PID 17232
/usr2/local/src/MPlayer-1.0pre4/mplayer -nosound -benchmark psychic_academy_ep04.avi
Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of statistical callstack sampling data (usertime)--
494: Total Samples
0: Samples with incomplete traceback
14.820: Accumulated Time (secs.)
30.0: Sample interval (msecs.)
-------------------------------------------------------------------------
Function list, in descending order by exclusive time
-------------------------------------------------------------------------
[index]  excl.secs excl.%   cum.%  incl.secs incl.%    samples  procedure  (dso: file, line)

[14]      3.810  25.7%   25.7%      3.810  25.7%        127  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 313)
[20]      2.490  16.8%   42.5%      2.490  16.8%         83  simple_idct_add (mplayer: simple_idct.c, 399)
[21]      1.530  10.3%   52.8%      1.530  10.3%         51  __ioctl (libc.so.1: stat.c, 32; compiled in ioctl.s)
[23]      1.410   9.5%   62.3%      1.410   9.5%         47  __glMgrWaitForDMAWrite (libGLcore.so: mgras_pxdma.c, 368)
[29]      0.930   6.3%   68.6%      0.930   6.3%         31  put_pixels16_l2 (mplayer: dsputil.c, 67)
[30]      0.810   5.5%   74.1%      0.810   5.5%         27  __glMgrim_Finish (libGLcore.so: mgras_modes.c, 60)
[33]      0.420   2.8%   76.9%      0.420   2.8%         14  simple_idct_put (mplayer: simple_idct.c, 389)
[34]      0.420   2.8%   79.8%      0.420   2.8%         14  yuv2rgb_c_24_bgr (mplayer: yuv2rgb.c, 332)
[32]      0.390   2.6%   82.4%      0.450   3.0%         15  msmpeg4_decode_block (mplayer: msmpeg4.c, 1676)
[35]      0.300   2.0%   84.4%      0.300   2.0%         10  memset (libc.so.1: stat.c, 32; compiled in bzero.s)
[7]      0.240   1.6%   86.0%      4.530  30.6%        151  MPV_decode_mb (mplayer: mpegvideo.c, 3093)
[39]      0.180   1.2%   87.2%      0.180   1.2%          6  simple_idct (mplayer: simple_idct.c, 409)
[42]      0.150   1.0%   88.3%      0.150   1.0%          5  __write (libc.so.1: flush.c, 58; compiled in write.s)
[49]      0.120   0.8%   89.1%      0.120   0.8%          4  __read (libc.so.1: malloc.c, 907; compiled in read.s)
[50]      0.120   0.8%   89.9%      0.120   0.8%          4  ff_h263_update_motion_val (mplayer: h263.c, 614)
[31]      0.090   0.6%   90.5%      0.690   4.7%         23  msmpeg4v34_decode_mb (mplayer: msmpeg4.c, 1582)
[63]      0.090   0.6%   91.1%      0.090   0.6%          3  h263_pred_motion (mplayer: h263.c, 1573)
[64]      0.090   0.6%   91.7%      0.090   0.6%          3  put_no_rnd_pixels16_xy2_c (mplayer: dsputil.c, 897)
[65]      0.090   0.6%   92.3%      0.090   0.6%          3  _BSD_getime (libc.so.1: flush.c, 58; compiled in BSD_getime.s)
[28]      0.060   0.4%   92.7%      1.110   7.5%         37  mpeg_motion (mplayer: mpegvideo.c, 2464)
.
snip


Amazing! :shock: More that 25% is spent in yuv2rgb, the colorspace conversion. The inverse discrete cosine transform is #2. Looks like we can kick some butt by:
1) Write a faster colorspace converter routine. MIPS asm, SGI_color_matrix, your momma on a calculator, anything seems better than this one.
2) The idct routine. If i can get SCSL libraries installed, there's a good chance it has some speedupped fast fourier transform routines. Also complib for somewhat older machines is an option.

project! :D
Wow...
I suspected the yuv2rgb to be a decent hit, but 25% of the time?

Maybe I'll spend some time today looking over that...see what might be able to be optimized quickly for a small boost...

Of course I haven't a clue what patches you mentioned did / where they could be found? I'll most likely take a stock download of the latest mplayer and attempt it on that...

Sadly I don't think I have SpeedShop installed so who knows how much I'll be able to help...
The SCSL FFT functions are very fast, but not nessecarily much use... what might be useful is the BLAS stuff, a lot of time in the IDCT is probably spent in cross multiplying matrices etc.

It's a real pity that mplayer as a whole isn't more thread-friendly - that time waiting for GL would be better spent crunching.

I'm not sure their yuv-rgb routine sucks that much... IIRC someone on the dev lists said 40% of a P3 was typical, although of course most PC cards can do it in hardware.

I sorted out my pthreads properly as per squeen's advice above, and get a reliable 20 FPS playing a DVD on a dual 300, no TRAM, both CPUs completely maxed. Not ideal. But it plays MPEG4 better than a 350Mhz iMac, which is something.
lewis wrote: ...get a reliable 20 FPS playing a DVD ...


If you feel this is an improvement over the version on which you started, could you make a patch against the source and post it under the contrib section of the Nekochan downloads?
I know this thread is long dead, but some of the information in this post is incorrect,
and I hope to shed some light on it.

First some background: On IRIX a process consists of one or more kernel execution vehicles, which we call "uthreads" (i.e. "user threads). A normal non-threaded process has one and only one uthread. A Pthreaded application is composed of one or more Pthreads, which are assigned to and execute atop one or more uthreads. In a way you can think of a uthread as a virtual processor (in fact that's what they're called inside the library), and you can think of a Pthread a virtual process -- the process executes on a processor, and something handles scheduling, context switching, and the like.

So, given that background...

squeen wrote: The PROCESS scope threads run in the same process as the one that launched them. But there is just one kernel entity for IRIX to schedule. Therefore you get concurrence (if one thread blocks the other keeps going) but not multi-cpu parallelism. This is known an Nx1 parallelism.


This isn't quite true. The IRIX Pthreads library dynamically creates as man uthreads as it can productively use to schedule individual Pthreads atop them. For PROCESS scope threads this means that as various Pthreads block, or as Pthreads are created and there are additional processors available in the system, the library will spawn additional uthreads to handle any Pthreads which are runnable. The library takes care of scheduling which Pthread is executing atop which uthread at any given moment, and the relationship between a Pthread and a uthread is not fixed -- they may switch associations at any time. In other words, it's just like how the kernel switches normal processes to run on different CPUs -- the relationship between a process and the CPU it is running on may change.

In other words, IRIX PROCESS scope threads follow an MxN threading model. "M" Pthreads running atop "N" uthreads.

squeen wrote: The SYSTEM scope attaches a kernel entity to each thread so IRIX can schedule them for CPU time independently. This will give you multiprocess parallelism across multiple CPUs. This is NxN parallelism, but it requires special user privledges since the kernel entity runs in the real-time priority band which is above that of most other processes except (usually!) the kernel itself.


Almost. When a SYSTEM scope thread is created on IRIX, the Pthreads library creates a uthread specifically for that Pthread, assigns the Pthread to that uthread, and never changes the relationship between them. That is, the uthread never executes any Pthread other than the one for which it was created, and the Pthread never runs on any uthread other than the one which was created for it. The exact same thing is true for BOUND_NP threads as well.
This is a 1:1 threading model.

The priority of kernel execution has nothing to do with why the CAP_SCHED_MGT capability is required in order to create SYSTEM scope threads. The only reason that CAP_SCHED_MGT is required is because the user may choose to alter the priority of that thread, and may just choose a priority which boosts it above other system threads, which can cause system lockups if great care isn't taken -- but it's all something that realtime programmers are familiar with and know how to deal with. But due to the delicate nature of such decisions, the additional capability is required.

SYSTEM scope threads are also scheduled onto CPUs only by the kernel, which has all the appropriate knowledge to handle realtime events. The Pthreads library does not handle scheduling of SYSTEM scope threads because it is not realtime aware in and of itself, and thus would be unsuitable to such tasks. Only by tieing a uthread to a Pthread and letting the kernel take care of the details, essentially removing the library from all scheduling decisions for SYSTEM scope threads, can realtime scheduling work.

Which leads to...

squeem wrote: The BOUND_NP refers to the thread being "bound" to a kernel entity. It gives you paralleism across CPU (NxN) but doens't run a real-time priority and therefore doesn't require special user privledges. The "NP" refers to "not portable" since this is IRIX only and not a POSIX standard.


This is mostly correct, though I don't really agree with calling it NxN. The only difference between a BOUND_NP thread and a SYSTEM thread is the ability to alter the thread's priority and other special abilities that CAP_SCHED_MGT grants to a thread. Otherwise they are completely identical. And thank you very much for pointing out what "NP" means -- that seems to be lost on most people until it's explained.

So why BOUND_NP scope threads if they're just "crippled" SYSTEM scope threads? There is a class of applications, mosty in HPC areas, that would like to tie a given thread to a given CPU (see dplace(1)) for performance reasons (e.g. cache warmth, memory locality), but which do not need scheduling management capabilities (e.g. setting thread priorities). The BOUND_NP scope thread fills this niche. The 1x1 binding of uthread to Pthread allows the application to staticaly set a CPU on which to run, and set up memory locality and other characteristics for that thread, something which could not be accomplished if the Pthreads library was constantly rearranging the assignment between Pthreads and uthreads, as it does for PROCESS scope threads.

For what it's worth, and if you care, Linux to my knowledge only has the equivalent of IRIX's BOUND_NP scope. Both the PROCESS and SYSTEM scope threads on Linux behave the same as IRIX's BOUND_NP scope.

squeem wrote: If you want to speed things up by using both of the CPUs on your Octane, I'd recommed the BOUND_NP priority level.

Also, I've directly noticed a difference between PROCESS and BOUND_NP when I run "top". In my app, the process show about "100%" CPU usage using PROCESS scope but shows around "150%" CPU usage when I go with BOUND_NP.


I disagree with this method to speed things up. Yes, there are some individual cases where using BOUND_NP scope threads instead of PROCESS scope threads could eek out a performance advantage -- however the Pthreads library will schedule as much useful work as possible for PROCESS scope threads -- bearing in mind that it will ramp up the number of uthreads over time, so short-lived Pthreads may not trigger the creation of as many additional uthreads as expected.

I've no explanation for why you would see such a difference in CPU usage between PROCESS and BOUND_NP scope threads, unless you're running into some degenerate case in the Pthreads scheduling code. I'd have to be very familiar with the application to give a good explanation. I'd be curious to know whether you are actually seeing a performance benefit with the bound scope threads, or if that extra 50% CPU time is being wasted in something like lock contention. Is 50% more work getting done or are things running 33% faster, or is the room just getting 2% warmer? ;)
Thanks bcasavan for the clarifications. I'm curious as to you background on IRIX pthreads (current/former SGI programmer?).

It's only your last point I would disagree with: I've never seen PROCESS scope threads appear across more than one CPU. I understand that the function call

Code: Select all

C SYNOPSIS
#include <pthread.h>

int pthread_setconcurrency(int level);

int pthread_getconcurrency(void);

DESCRIPTION
Threads which are created with the PTHREAD_SCOPE_PROCESS attribute (which
is the default) [see pthread_attr_setscope()], are scheduled on a number
of kernel execution vehicles.  By default the number of execution
vehicles used is adjusted by the library as the application runs and is
called the concurrency level.  This is different from the traditional
notion of concurrency because it includes any threads blocked by the
application in the kernel (for example to do IO).  The library raises or
lowers the level to maintain a balance between user context switches and
CPU bandwidth.

is a hint for doing this, and supports your point. However, in practical application I always end up with just one pegged CPU.

Also, the default priority of SYSTEM pthreads seems to be at the bottom of the real-time priority band, which is above all normal user processes.

Anyway, practically speaking, it would be great to realize some sort of dual processor speed-up of mplayer. Any suggestions? :)

EDIT : one last question : When did pthread_barriers appear in IRIX? I just stumbled across them in the past month. I'm pretty sure they weren't there pre-6.5.21.
squeen wrote: Thanks bcasavan for the clarifications. I'm curious as to you background on IRIX pthreads (current/former SGI programmer?).


For almost four years I've been the primary Pthreads engineer at SGI. Yes, that means that the bustedness that happened at 6.5.18 was my fault. :( But fixing that and numerous other problems were my fault too. :) Hopefully things are much better now than when I picked up responsibility during 6.5.13.

squeen wrote: It's only your last point I would disagree with: I've never seen PROCESS scope threads appear across more than one CPU.


It may be due to the fact that the Pthreads library doesn't spawn new uthreads immediately, but waits until it has seen some history of runnable threads in its ready queue, then spawns additional uthreads. This prevents extremely short-lived threads (which are common in some types of programs) from invoking the rather expensive uthread creation only to disappear moments later. It can take several seconds for this to happen.

After your post, I wrote and ran the following program, and after a few seconds I saw approximately 200% CPU utilization, both on an O2K and an Octane (and 100% on an O2).

Code: Select all

#include <pthread.h>
#include <unistd.h>

volatile int done = 0;
pthread_t pt[2];

/* ARGSUSED */
void* ptfunc(void* notused) {
pthread_detach(pthread_self());
while(!done);
pthread_exit(NULL);
/* NOTREACHED */
}

int main(void) {
pthread_create(&pt[0], NULL, ptfunc, NULL);
pthread_create(&pt[1], NULL, ptfunc, NULL);

sleep(60);
done = 1;
sleep(1); /* Give the threads a chance to exit */
return 0;
}


You can use the pthread_setconcurrency(3P) call to manually provide a hint to the library as to the number of uthreads to use. When I insert a "pthread_setconcurrency(3)" before the pthread_create calls in the program above, I immediately see 200% CPU time instead of the delayed uthread creation behavior. (The value of 3 because there are two process scope threads created, plus the main thread which is system scope.)

squeen wrote: Also, the default priority of SYSTEM pthreads seems to be at the bottom of the real-time priority band, which is above all normal user processes.


Yikes. You have a knack for finding the one area of the library I'm not very familiar with. (Un?)fortunately I've never had to fix a bug surrounding thread priorities, so I'm not 100% sure why you see that behavior. It appears the priority is always results from pthread_setschedparam, or is inheirited from the creating thread.

squeen wrote: Anyway, practically speaking, it would be great to realize some sort of dual processor speed-up of mplayer. Any suggestions? :)


See above regarding pthread_setconcurrency(3P). If for some reason your threads are extremely short-lived (under a few seconds), they may not contribute enough weight to the library's run-queue "load average", and thus additional uthreads are not being created. That'd be my first guess. The second guess would be an incorrect concurrency setting (e.g. forgetting to take into account the main program thread), or accidentally setting the environment variable PT_ITC (which has the same effect as pthread_setconcurrency(3P), useful if you can't modify the source code). Or, it could just be a real lack of parallelism in the execution of the code, but with your previous comments about seeing 150% utilization with bound scope threads, I doubt that's the problem.

squeen wrote: EDIT : one last question : When did pthread_barriers appear in IRIX? I just stumbled across them in the past month. I'm pretty sure they weren't there pre-6.5.21.


That would be some work our realtime guys did... something I reviewed, but didn't write myself.

According to the revision logs, it looks like 6.5.23. If you use them, remember to wrap calls in _MIPS_SYMBOL_PRESENT tests (see /usr/nclude/optional_sym.h) so that binaries compiled on 6.5.23+ work on older releases. If you haven't read it before, check out the document titled "The Mandate of Application Compatibility in SGI IRIX 6.5" from SGI's Techpubs library.

Hope that helps,
Brent
bcasavan wrote: For almost four years I've been the primary Pthreads engineer at SGI.

It's an honor! Welcome to Nekochan, were old IRIXer go to fade away :) :( .
I'm pthreading right now (with an without REACT) and loving it. Thanks! All has been well since 6.5.21.

I'll try out your sample program shortly, but I have one pressing question -- would you expect the pthread_barrier call to have microsecond latency or ten's of microsecond latency on an Onyx350, where each thread is SYSTEM scope and running on an isolated processor, non-premptively? I am looking to speed up non-real-time performance, and when I used semaphores I was disappointed by a speed decrease when I doubled the threads (BOUND_NP, preemptive). The threads take about 500 usec each (2x= 1 millisec sequentially)), but need to sync a couple of times during the execution (say every 4x125 usec). What I'd like to see is a 2x speed up running them on 2 CPUs, versus the sequentially result.

Lastly, anything new on the REACT horizon? (IRIX or Linux) :)
I'm not sure, as I'm definitely not a realtime guy. I do see that it involves making a system call on both the "waiter" and "waker" sides, so at a minimum you'll be bound by how long that takes.

I'll try to remember to ask the guy who implemented it (he's just around the corner from me) when he gets back into work tomorrow. PM me to remind me if I don't respond by tomorrow afternoon (US Central).

Brent
squeen wrote: Thanks bcasavan for the clarifications. I'm curious as to you background on IRIX pthreads (current/former SGI programmer?).


Boy, squeen, would you make a rotten Sherlock Holmes ! Brent has been posting useful information in the sgi groups for years ! mighty fine day when SGI people of his calibre start to show up here :D
I've never claimed to have a clue.
My SGI is Holmes, Mycroft---not I!

Thanks for the info Brent. Let me know what the real-time fella says.
squeen wrote: I've never claimed to have a clue.
My SGI is Holmes, Mycroft---not I!


Look left. Have you see your avatar recently ? heh heh heh

sorry, couldn't resist :wink:
squeen,

Sorry for forgetting to post. The engineer who implemented the Pthreads barriers said that he expects latencies would be on the tens of microseconds level, though he has done no testing to actually measure this. There is room for latency improvement in these barriers, though I don't think we'd do so without serious customer demand.

In short, the Pthreads barriers were aimed at functionality with reasonable performance, whereas something like MPI barriers are aimed at the highest performance possible, and thus have latencies on the order of microseconds.

Clear as mud? :)

Brent
The version of mencoder included in the tardist refuses to encode mp3 audio. It seems that the most common divx/avi format seems to be an Xvid video track with an mp3 audio track in an avi wrapper. Unfortunately, mencoder seems to depend on lame and complains that it wasn't compiled with lame mp3 support.

Can you add support for mp3 audio encoding in a future rev of your tardist?