lewis wrote:
I tried your code before you edited it, and... nothing changed. I'm losing sight of what exactly is going on here. What you did makes a lot of sense, I really should have thought of using only one 32bit memory store like that :) Although you do have the bytes back to front. GL_ABGR_EXT is your friend. What do you mean by "R10k pipeline"? Are you relying on speculative memory fetches or something?
...
I wrote this with several paradigms in mind. First, lb/sb are slow instructions and would break the dispatch bad, so better only storing a word after 3 fetches than storing bytes. Second, binary arithmetic instead the normal one when it comes to this. It makes spaghetti code but is slightly faster. And the third one, watching the assembly output of the compiler, realigning/changing the C code, and giving back some hints (I like it portable). After some tweaking I got emitted the prefetches I wanted and the instruction pipelining I wanted (the "R10k/R12k ..."). The final thing (the one here) worked well, I got it ~5 times faster (10000 frames (~5m at 30fps) in stable 1m55s against stable 9m) performance than the first two codes.
Looks like the bottleneck then might be in the gl libraries itself... I don't have your source tree but if you don't mind to share it I would play some profiling around. I got interested into this.
Btw. I am building with MIPSpro which has better optimiser.
nvukovlj wrote:
I see you're using semaphores - are they being used just as binary semaphores, or can the value go above 1 ?
Also, is the semaphore being created meant to be used accross processes, or within a process, as currently it is being initialised as a semaphore that can be shared amongst processes.
The semaphore switches at 30Hz so wouldn't make a great penalty.