The collected works of bcasavan

I know this thread is long dead, but some of the information in this post is incorrect,
and I hope to shed some light on it.

First some background: On IRIX a process consists of one or more kernel execution vehicles, which we call "uthreads" (i.e. "user threads). A normal non-threaded process has one and only one uthread. A Pthreaded application is composed of one or more Pthreads, which are assigned to and execute atop one or more uthreads. In a way you can think of a uthread as a virtual processor (in fact that's what they're called inside the library), and you can think of a Pthread a virtual process -- the process executes on a processor, and something handles scheduling, context switching, and the like.

So, given that background...

squeen wrote: The PROCESS scope threads run in the same process as the one that launched them. But there is just one kernel entity for IRIX to schedule. Therefore you get concurrence (if one thread blocks the other keeps going) but not multi-cpu parallelism. This is known an Nx1 parallelism.


This isn't quite true. The IRIX Pthreads library dynamically creates as man uthreads as it can productively use to schedule individual Pthreads atop them. For PROCESS scope threads this means that as various Pthreads block, or as Pthreads are created and there are additional processors available in the system, the library will spawn additional uthreads to handle any Pthreads which are runnable. The library takes care of scheduling which Pthread is executing atop which uthread at any given moment, and the relationship between a Pthread and a uthread is not fixed -- they may switch associations at any time. In other words, it's just like how the kernel switches normal processes to run on different CPUs -- the relationship between a process and the CPU it is running on may change.

In other words, IRIX PROCESS scope threads follow an MxN threading model. "M" Pthreads running atop "N" uthreads.

squeen wrote: The SYSTEM scope attaches a kernel entity to each thread so IRIX can schedule them for CPU time independently. This will give you multiprocess parallelism across multiple CPUs. This is NxN parallelism, but it requires special user privledges since the kernel entity runs in the real-time priority band which is above that of most other processes except (usually!) the kernel itself.


Almost. When a SYSTEM scope thread is created on IRIX, the Pthreads library creates a uthread specifically for that Pthread, assigns the Pthread to that uthread, and never changes the relationship between them. That is, the uthread never executes any Pthread other than the one for which it was created, and the Pthread never runs on any uthread other than the one which was created for it. The exact same thing is true for BOUND_NP threads as well.
This is a 1:1 threading model.

The priority of kernel execution has nothing to do with why the CAP_SCHED_MGT capability is required in order to create SYSTEM scope threads. The only reason that CAP_SCHED_MGT is required is because the user may choose to alter the priority of that thread, and may just choose a priority which boosts it above other system threads, which can cause system lockups if great care isn't taken -- but it's all something that realtime programmers are familiar with and know how to deal with. But due to the delicate nature of such decisions, the additional capability is required.

SYSTEM scope threads are also scheduled onto CPUs only by the kernel, which has all the appropriate knowledge to handle realtime events. The Pthreads library does not handle scheduling of SYSTEM scope threads because it is not realtime aware in and of itself, and thus would be unsuitable to such tasks. Only by tieing a uthread to a Pthread and letting the kernel take care of the details, essentially removing the library from all scheduling decisions for SYSTEM scope threads, can realtime scheduling work.

Which leads to...

squeem wrote: The BOUND_NP refers to the thread being "bound" to a kernel entity. It gives you paralleism across CPU (NxN) but doens't run a real-time priority and therefore doesn't require special user privledges. The "NP" refers to "not portable" since this is IRIX only and not a POSIX standard.


This is mostly correct, though I don't really agree with calling it NxN. The only difference between a BOUND_NP thread and a SYSTEM thread is the ability to alter the thread's priority and other special abilities that CAP_SCHED_MGT grants to a thread. Otherwise they are completely identical. And thank you very much for pointing out what "NP" means -- that seems to be lost on most people until it's explained.

So why BOUND_NP scope threads if they're just "crippled" SYSTEM scope threads? There is a class of applications, mosty in HPC areas, that would like to tie a given thread to a given CPU (see dplace(1)) for performance reasons (e.g. cache warmth, memory locality), but which do not need scheduling management capabilities (e.g. setting thread priorities). The BOUND_NP scope thread fills this niche. The 1x1 binding of uthread to Pthread allows the application to staticaly set a CPU on which to run, and set up memory locality and other characteristics for that thread, something which could not be accomplished if the Pthreads library was constantly rearranging the assignment between Pthreads and uthreads, as it does for PROCESS scope threads.

For what it's worth, and if you care, Linux to my knowledge only has the equivalent of IRIX's BOUND_NP scope. Both the PROCESS and SYSTEM scope threads on Linux behave the same as IRIX's BOUND_NP scope.

squeem wrote: If you want to speed things up by using both of the CPUs on your Octane, I'd recommed the BOUND_NP priority level.

Also, I've directly noticed a difference between PROCESS and BOUND_NP when I run "top". In my app, the process show about "100%" CPU usage using PROCESS scope but shows around "150%" CPU usage when I go with BOUND_NP.


I disagree with this method to speed things up. Yes, there are some individual cases where using BOUND_NP scope threads instead of PROCESS scope threads could eek out a performance advantage -- however the Pthreads library will schedule as much useful work as possible for PROCESS scope threads -- bearing in mind that it will ramp up the number of uthreads over time, so short-lived Pthreads may not trigger the creation of as many additional uthreads as expected.

I've no explanation for why you would see such a difference in CPU usage between PROCESS and BOUND_NP scope threads, unless you're running into some degenerate case in the Pthreads scheduling code. I'd have to be very familiar with the application to give a good explanation. I'd be curious to know whether you are actually seeing a performance benefit with the bound scope threads, or if that extra 50% CPU time is being wasted in something like lock contention. Is 50% more work getting done or are things running 33% faster, or is the room just getting 2% warmer? ;)
squeen wrote: Thanks bcasavan for the clarifications. I'm curious as to you background on IRIX pthreads (current/former SGI programmer?).


For almost four years I've been the primary Pthreads engineer at SGI. Yes, that means that the bustedness that happened at 6.5.18 was my fault. :( But fixing that and numerous other problems were my fault too. :) Hopefully things are much better now than when I picked up responsibility during 6.5.13.

squeen wrote: It's only your last point I would disagree with: I've never seen PROCESS scope threads appear across more than one CPU.


It may be due to the fact that the Pthreads library doesn't spawn new uthreads immediately, but waits until it has seen some history of runnable threads in its ready queue, then spawns additional uthreads. This prevents extremely short-lived threads (which are common in some types of programs) from invoking the rather expensive uthread creation only to disappear moments later. It can take several seconds for this to happen.

After your post, I wrote and ran the following program, and after a few seconds I saw approximately 200% CPU utilization, both on an O2K and an Octane (and 100% on an O2).

Code: Select all

#include <pthread.h>
#include <unistd.h>

volatile int done = 0;
pthread_t pt[2];

/* ARGSUSED */
void* ptfunc(void* notused) {
pthread_detach(pthread_self());
while(!done);
pthread_exit(NULL);
/* NOTREACHED */
}

int main(void) {
pthread_create(&pt[0], NULL, ptfunc, NULL);
pthread_create(&pt[1], NULL, ptfunc, NULL);

sleep(60);
done = 1;
sleep(1); /* Give the threads a chance to exit */
return 0;
}


You can use the pthread_setconcurrency(3P) call to manually provide a hint to the library as to the number of uthreads to use. When I insert a "pthread_setconcurrency(3)" before the pthread_create calls in the program above, I immediately see 200% CPU time instead of the delayed uthread creation behavior. (The value of 3 because there are two process scope threads created, plus the main thread which is system scope.)

squeen wrote: Also, the default priority of SYSTEM pthreads seems to be at the bottom of the real-time priority band, which is above all normal user processes.


Yikes. You have a knack for finding the one area of the library I'm not very familiar with. (Un?)fortunately I've never had to fix a bug surrounding thread priorities, so I'm not 100% sure why you see that behavior. It appears the priority is always results from pthread_setschedparam, or is inheirited from the creating thread.

squeen wrote: Anyway, practically speaking, it would be great to realize some sort of dual processor speed-up of mplayer. Any suggestions? :)


See above regarding pthread_setconcurrency(3P). If for some reason your threads are extremely short-lived (under a few seconds), they may not contribute enough weight to the library's run-queue "load average", and thus additional uthreads are not being created. That'd be my first guess. The second guess would be an incorrect concurrency setting (e.g. forgetting to take into account the main program thread), or accidentally setting the environment variable PT_ITC (which has the same effect as pthread_setconcurrency(3P), useful if you can't modify the source code). Or, it could just be a real lack of parallelism in the execution of the code, but with your previous comments about seeing 150% utilization with bound scope threads, I doubt that's the problem.

squeen wrote: EDIT : one last question : When did pthread_barriers appear in IRIX? I just stumbled across them in the past month. I'm pretty sure they weren't there pre-6.5.21.


That would be some work our realtime guys did... something I reviewed, but didn't write myself.

According to the revision logs, it looks like 6.5.23. If you use them, remember to wrap calls in _MIPS_SYMBOL_PRESENT tests (see /usr/nclude/optional_sym.h) so that binaries compiled on 6.5.23+ work on older releases. If you haven't read it before, check out the document titled "The Mandate of Application Compatibility in SGI IRIX 6.5" from SGI's Techpubs library.

Hope that helps,
Brent
I'm not sure, as I'm definitely not a realtime guy. I do see that it involves making a system call on both the "waiter" and "waker" sides, so at a minimum you'll be bound by how long that takes.

I'll try to remember to ask the guy who implemented it (he's just around the corner from me) when he gets back into work tomorrow. PM me to remind me if I don't respond by tomorrow afternoon (US Central).

Brent
squeen,

Sorry for forgetting to post. The engineer who implemented the Pthreads barriers said that he expects latencies would be on the tens of microseconds level, though he has done no testing to actually measure this. There is room for latency improvement in these barriers, though I don't think we'd do so without serious customer demand.

In short, the Pthreads barriers were aimed at functionality with reasonable performance, whereas something like MPI barriers are aimed at the highest performance possible, and thus have latencies on the order of microseconds.

Clear as mud? :)

Brent