Hi all,
Sorry for not being so active on Nekoware builds the last couple of weeks, but i really wanted to take part in Schleusel's and Vegac's attempts in making MPlayer just a little bit faster, so i can watch neato movie stuff on my I2 Impact . It has cost me a lot of time, but boy, am i glad i spent it with them and MPlayer. I will show you what i did, so maybe you can learn from my trials of getting that app speed up it's framerate. The optimisation is not done yet, it still ongoing, and we only just begun searching out the possibilities of hardware colorspace conversion, but the methods behind the software optimisation part has now been understood. Here it comes. I hope Neko won't mind me breaking the record of longest post ever on Nekochan. Beer is in the mail, Pete
Since my first speedshop run, in the mplayer 1.0pre3 thread viewtopic.php?t=1374 i've read man pages of ssrun just to get myself acquainted with the most used options. Instead of 'ssrun -exp totaltime' or 'ssrun -exp usertime' i now do 'ssrun -exp fpcsampx' to get my Finegrained-ProgramCounter-SAMPling-with-4-bytes timing for all the routines which MPlayer is busy with. So lets do a standard MIPSPro 7.4.2 '-O3 -r10000 -mips4 -n32' build of MPlayer 1.0-pre4, run 'ssrun -exp fpcsampx ./mplayer' with a standard .avi file (Schleusel and i use "courtyard.avi" from a Call Of Duty Demo video (11.4MB), to be found on the net). Machine is an R10K@180MHz Origin200:
and so forth.... Note the total time, which is 64 seconds. One third of the program's time is used in the InverseDiscreteCosineTransform adition, another third in the colorspace converter yuv2rgb_c_24_rgb and the rest in the rest.
So Schleusel, Vegac and me sorta divided up the tasks. Schleusel tried some optimisations flags (-IPA) and porting details, Vegac had a look at the colorspace conversion and possible OpenGL O2/ICE speedups, and i got libavcodec/simple_idct.c :) simple_idct_add is just a small routine, consisting of two parts of each 8 'for' loops, which is basically the way how a double 1D-IDCT works. First the rows are transformed, and then the colums are transformed. The fransformed values are added to the already existing image, hence 'add'. For more info, read up on IDCT:
(1) http://skal.planet-d.net/coding/dct.html
(2) http://www-vs.informatik.uni-ulm.de/bib ... paper.html
(3) http://rnvs.informatik.tu-chemnitz.de/~ ... /IDCT.html
Here's the interesting part:
At first glance i thought, what a lot of branches/decisions! If this routine is to become fast, it has to get rid of all those branches. Careful observing shows that the 'if' statements can be removed safely, because the condition (col[8*x]) is true when col[8*x]!=0. But if col[8*x]==0 then nothing is added or subtraced to the coefficients inside the 'if' statement anyway, so the if statement is superfluous:
will become:
So now i have to do more instructions, but will it weigh up to the time spent in those conditions? Answer later.
When reading (1) it becomes clear that these are indeed matrix operations of which 4 of them are so called 'rotations' which involves cosines. The cosine coefficients are all in separate '#defines' and converted to integers, so that makes this routine a 'fast-integer 1D-IDCT' Now write out all the multiplications and coefficients:
=
=
So after a lot of wizardry, i'm left with some fairly symmetric multiplications. Now comes the clever part. Also from (1), A multiplication of the form:
can also be written as:
which saves you one expensive multiplication per 2x2 matrix multiplication. This sort of 2x2 matrix multiplication BTW is supposed to be called a Butterfly, because of the butterfly shape of the diagrammatic form.
This is the reason why i had to get rid of those 'if' branches in the beginning. I couldn't have written out those butterflies without those pieces inside an 'if'. In this case it is worthwile, but one always has to test these things carefully.
Enter the SIMD (Single Instruction on Multiple Data) instructions on the MIPS 4 Instruction set.
These instruction perform a multiplication and addition/subtraction in one go! They are only to be found on mips4 instruction sets, so you gotta compile with -mips4 to get them. Also they are only for floats, either singles and doubles, not for integers! :(
BUT, the R10000 is a pretty nifty piece of machinery. It is blessed with both an Integer ALU and floating point ALU and can issue 2 integer ops and 1 floating point ops in one clocktick. So what if we substitute a part of the calculation with floats instead of integers? Maybe the compiler can then weave these two "threads" together, with the added bonus of 'madd'/'nmsub' instructions for the floating point part. So i devised this:
All the calculation involving the 'b' coefficients are floats and all the 'a's are integers. Got some temp storage variables as well, since the R10000 has a large number of registers, so this never hurts.
The BUTTERFLY0 macro is for the floats, because that gives the compiler the chance to issue the madd instructions and the normal BUTTERFLY macro is for the integers, because of the saving of one multiplication. Oh and the order of macro's for the float series (b) is independent from the integer series (a), which makes it easy for the compiler/CPU to balance the load between the two ALU's
If you look carefully look at the assembly output ('c99 -S'), you can see the weaving of instructions. First the 'old' routine:
Normal
171 clockticks! Because all the ops are integers, the integer ALU gets extremely busy, thereby stalling the throughput.
Optimised
Tadaaa, 55 ticks! Amazing what a little software pipelining can do for your code!
Running speedshop again proves it:
Summary of statistical PC sampling data (fpcsampx)--
49298: Total samples
49.298: Accumulated time (secs.)
1.0: Time per sample (msecs.)
4: Sample bin width (bytes)
-------------------------------------------------------------------------
Function list, in descending order by time
-------------------------------------------------------------------------
[index] secs % cum.% samples function (dso: file, line)
[1] 20.239 41.1% 41.1% 20239 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 319)
[2] 7.177 14.6% 55.6% 7177 idctSparseColAdd (mplayer: simple_idct.c, 209)
[3] 4.454 9.0% 64.6% 4454 put_pixels8_c (mplayer: dsputil.c, 897)
[4] 2.382 4.8% 69.5% 2382 msmpeg4_decode_block (mplayer: msmpeg4.c, 1676)
[5] 1.948 4.0% 73.4% 1948 idctRowCondDC (mplayer: simple_idct.c, 104)
[6] 1.551 3.1% 76.6% 1551 simple_idct_put (mplayer: simple_idct.c, 313)
[7] 1.301 2.6% 79.2% 1301 put_pixels16_xy2_c (mplayer: dsputil.c, 897)
IDCT is a bit more scattered now, three routines instead of one, but added up (idctSparseColAdd +idctRowCondDC + simple_idct_put) gives 21.7% which is half the time of yuv2rgb_c_24_rgb(41.1%). Compared that with the starting speedshop run, this is a 50% reduction in time spent in IDCT! Looking at the total times. 64.0 versus 49.3 seconds is 23% speedup of the app. Whoa! Granted, it's only for this .avi file and this specific IDCT. Other codecs need other routines, but it's a start.
Well, hope you have read through it and picked up some ideas. Next time i'll be looking at yuv2rgb.c software wise.
The patch for libavcodec/simple_idct.c is now living at http://www.mechanics.citg.tudelft.nl/~e ... pre4.patch get it and try it. Schleusel and i will try to pester MPlayer CVS guru's to get us some libavcodec/mips subdirectory where we can store this.
%-)
Sorry for not being so active on Nekoware builds the last couple of weeks, but i really wanted to take part in Schleusel's and Vegac's attempts in making MPlayer just a little bit faster, so i can watch neato movie stuff on my I2 Impact . It has cost me a lot of time, but boy, am i glad i spent it with them and MPlayer. I will show you what i did, so maybe you can learn from my trials of getting that app speed up it's framerate. The optimisation is not done yet, it still ongoing, and we only just begun searching out the possibilities of hardware colorspace conversion, but the methods behind the software optimisation part has now been understood. Here it comes. I hope Neko won't mind me breaking the record of longest post ever on Nekochan. Beer is in the mail, Pete
Since my first speedshop run, in the mplayer 1.0pre3 thread viewtopic.php?t=1374 i've read man pages of ssrun just to get myself acquainted with the most used options. Instead of 'ssrun -exp totaltime' or 'ssrun -exp usertime' i now do 'ssrun -exp fpcsampx' to get my Finegrained-ProgramCounter-SAMPling-with-4-bytes timing for all the routines which MPlayer is busy with. So lets do a standard MIPSPro 7.4.2 '-O3 -r10000 -mips4 -n32' build of MPlayer 1.0-pre4, run 'ssrun -exp fpcsampx ./mplayer' with a standard .avi file (Schleusel and i use "courtyard.avi" from a Call Of Duty Demo video (11.4MB), to be found on the net). Machine is an R10K@180MHz Origin200:
Code:
ssrun -exp fpcsampx ./mplayer -vo null -vf format=rgb24 -nosound courtyard.avi
prof -lines mplayer.fpcsampx.#
prof -lines mplayer.fpcsampx.#
Code:
-------------------------------------------------------------------------
SpeedShop profile listing generated Thu Jul 22 09:54:04 2004
prof -lines mplayer.fpcsampx.m249164
mplayer (n32): Target program
fpcsampx: Experiment name
pc,4,1000,0:cu: Marching orders
R10000 / R10010: CPU / FPU
4: Number of CPUs
180: Clock frequency (MHz.)
Experiment notes--
From file mplayer.fpcsampx.m249164:
Caliper point 0 at target begin, PID 249164
/usr1/local/everdij/MPlayer-1.0pre4/mplayer -vo null -vf format=rgb24 -nosound courtyard.avi
Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of statistical PC sampling data (fpcsampx)--
64010: Total samples
64.010: Accumulated time (secs.)
1.0: Time per sample (msecs.)
4: Sample bin width (bytes)
-------------------------------------------------------------------------
Function list, in descending order by time
-------------------------------------------------------------------------
[index] secs % cum.% samples function (dso: file, line)
[1] 21.754 34.0% 34.0% 21754 simple_idct_add (mplayer: simple_idct.c, 399)
[2] 20.301 31.7% 65.7% 20301 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 319)
[3] 4.593 7.2% 72.9% 4593 put_pixels8_c (mplayer: dsputil.c, 897)
[4] 3.701 5.8% 78.7% 3701 simple_idct_put (mplayer: simple_idct.c, 389)
[5] 2.402 3.8% 82.4% 2402 msmpeg4_decode_block (mplayer: msmpeg4.c, 1676)
[6] 1.323 2.1% 84.5% 1323 put_pixels16_xy2_c (mplayer: dsputil.c, 897)
[7] 1.214 1.9% 86.4% 1214 memset (libc.so.1: stat.c, 32; compiled in bzero.s)
SpeedShop profile listing generated Thu Jul 22 09:54:04 2004
prof -lines mplayer.fpcsampx.m249164
mplayer (n32): Target program
fpcsampx: Experiment name
pc,4,1000,0:cu: Marching orders
R10000 / R10010: CPU / FPU
4: Number of CPUs
180: Clock frequency (MHz.)
Experiment notes--
From file mplayer.fpcsampx.m249164:
Caliper point 0 at target begin, PID 249164
/usr1/local/everdij/MPlayer-1.0pre4/mplayer -vo null -vf format=rgb24 -nosound courtyard.avi
Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of statistical PC sampling data (fpcsampx)--
64010: Total samples
64.010: Accumulated time (secs.)
1.0: Time per sample (msecs.)
4: Sample bin width (bytes)
-------------------------------------------------------------------------
Function list, in descending order by time
-------------------------------------------------------------------------
[index] secs % cum.% samples function (dso: file, line)
[1] 21.754 34.0% 34.0% 21754 simple_idct_add (mplayer: simple_idct.c, 399)
[2] 20.301 31.7% 65.7% 20301 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 319)
[3] 4.593 7.2% 72.9% 4593 put_pixels8_c (mplayer: dsputil.c, 897)
[4] 3.701 5.8% 78.7% 3701 simple_idct_put (mplayer: simple_idct.c, 389)
[5] 2.402 3.8% 82.4% 2402 msmpeg4_decode_block (mplayer: msmpeg4.c, 1676)
[6] 1.323 2.1% 84.5% 1323 put_pixels16_xy2_c (mplayer: dsputil.c, 897)
[7] 1.214 1.9% 86.4% 1214 memset (libc.so.1: stat.c, 32; compiled in bzero.s)
and so forth.... Note the total time, which is 64 seconds. One third of the program's time is used in the InverseDiscreteCosineTransform adition, another third in the colorspace converter yuv2rgb_c_24_rgb and the rest in the rest.
So Schleusel, Vegac and me sorta divided up the tasks. Schleusel tried some optimisations flags (-IPA) and porting details, Vegac had a look at the colorspace conversion and possible OpenGL O2/ICE speedups, and i got libavcodec/simple_idct.c :) simple_idct_add is just a small routine, consisting of two parts of each 8 'for' loops, which is basically the way how a double 1D-IDCT works. First the rows are transformed, and then the colums are transformed. The fransformed values are added to the already existing image, hence 'add'. For more info, read up on IDCT:
(1) http://skal.planet-d.net/coding/dct.html
(2) http://www-vs.informatik.uni-ulm.de/bib ... paper.html
(3) http://rnvs.informatik.tu-chemnitz.de/~ ... /IDCT.html
Here's the interesting part:
Code:
/* signed 16x16 -> 32 multiply add accumulate */
#define W1 22725 //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W2 21407 //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W3 19266 //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W4 16383 //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W5 12873 //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W6 8867 //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W7 4520 //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define ROW_SHIFT 11
#define COL_SHIFT 20 // 6
#define MAC16(rt, ra, rb) rt += (ra) * (rb)
/* signed 16x16 -> 32 multiply */
#define MUL16(rt, ra, rb) rt = (ra) * (rb)
static inline void idctSparseColAdd (uint8_t *dest, int line_size,
DCTELEM * col)
{
int a0, a1, a2, a3, b0, b1, b2, b3;
uint8_t *cm = cropTbl + MAX_NEG_CROP;
/* XXX: I did that only to give same values as previous code */
a0 = W4 * (col[8*0] + ((1<<(COL_SHIFT-1))/W4));
a1 = a0;
a2 = a0;
a3 = a0;
a0 += + W2*col[8*2];
a1 += + W6*col[8*2];
a2 += - W6*col[8*2];
a3 += - W2*col[8*2];
MUL16(b0, W1, col[8*1]);
MUL16(b1, W3, col[8*1]);
MUL16(b2, W5, col[8*1]);
MUL16(b3, W7, col[8*1]);
MAC16(b0, + W3, col[8*3]);
MAC16(b1, - W7, col[8*3]);
MAC16(b2, - W1, col[8*3]);
MAC16(b3, - W5, col[8*3]);
if(col[8*4]){
a0 += + W4*col[8*4];
a1 += - W4*col[8*4];
a2 += - W4*col[8*4];
a3 += + W4*col[8*4];
}
if (col[8*5]) {
MAC16(b0, + W5, col[8*5]);
MAC16(b1, - W1, col[8*5]);
MAC16(b2, + W7, col[8*5]);
MAC16(b3, + W3, col[8*5]);
}
if(col[8*6]){
a0 += + W6*col[8*6];
a1 += - W2*col[8*6];
a2 += + W2*col[8*6];
a3 += - W6*col[8*6];
}
if (col[8*7]) {
MAC16(b0, + W7, col[8*7]);
MAC16(b1, - W5, col[8*7]);
MAC16(b2, + W3, col[8*7]);
MAC16(b3, - W1, col[8*7]);
}
dest[0] = cm[dest[0] + ((a0 + b0) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a1 + b1) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a2 + b2) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a3 + b3) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a3 - b3) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a2 - b2) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a1 - b1) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a0 - b0) >> COL_SHIFT)];
}
#define W1 22725 //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W2 21407 //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W3 19266 //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W4 16383 //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W5 12873 //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W6 8867 //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W7 4520 //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define ROW_SHIFT 11
#define COL_SHIFT 20 // 6
#define MAC16(rt, ra, rb) rt += (ra) * (rb)
/* signed 16x16 -> 32 multiply */
#define MUL16(rt, ra, rb) rt = (ra) * (rb)
static inline void idctSparseColAdd (uint8_t *dest, int line_size,
DCTELEM * col)
{
int a0, a1, a2, a3, b0, b1, b2, b3;
uint8_t *cm = cropTbl + MAX_NEG_CROP;
/* XXX: I did that only to give same values as previous code */
a0 = W4 * (col[8*0] + ((1<<(COL_SHIFT-1))/W4));
a1 = a0;
a2 = a0;
a3 = a0;
a0 += + W2*col[8*2];
a1 += + W6*col[8*2];
a2 += - W6*col[8*2];
a3 += - W2*col[8*2];
MUL16(b0, W1, col[8*1]);
MUL16(b1, W3, col[8*1]);
MUL16(b2, W5, col[8*1]);
MUL16(b3, W7, col[8*1]);
MAC16(b0, + W3, col[8*3]);
MAC16(b1, - W7, col[8*3]);
MAC16(b2, - W1, col[8*3]);
MAC16(b3, - W5, col[8*3]);
if(col[8*4]){
a0 += + W4*col[8*4];
a1 += - W4*col[8*4];
a2 += - W4*col[8*4];
a3 += + W4*col[8*4];
}
if (col[8*5]) {
MAC16(b0, + W5, col[8*5]);
MAC16(b1, - W1, col[8*5]);
MAC16(b2, + W7, col[8*5]);
MAC16(b3, + W3, col[8*5]);
}
if(col[8*6]){
a0 += + W6*col[8*6];
a1 += - W2*col[8*6];
a2 += + W2*col[8*6];
a3 += - W6*col[8*6];
}
if (col[8*7]) {
MAC16(b0, + W7, col[8*7]);
MAC16(b1, - W5, col[8*7]);
MAC16(b2, + W3, col[8*7]);
MAC16(b3, - W1, col[8*7]);
}
dest[0] = cm[dest[0] + ((a0 + b0) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a1 + b1) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a2 + b2) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a3 + b3) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a3 - b3) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a2 - b2) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a1 - b1) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a0 - b0) >> COL_SHIFT)];
}
At first glance i thought, what a lot of branches/decisions! If this routine is to become fast, it has to get rid of all those branches. Careful observing shows that the 'if' statements can be removed safely, because the condition (col[8*x]) is true when col[8*x]!=0. But if col[8*x]==0 then nothing is added or subtraced to the coefficients inside the 'if' statement anyway, so the if statement is superfluous:
Code:
if(col[8*4]){
a0 += + W4*col[8*4];
a1 += - W4*col[8*4];
a2 += - W4*col[8*4];
a3 += + W4*col[8*4];
}
if (col[8*5]) {
MAC16(b0, + W5, col[8*5]);
MAC16(b1, - W1, col[8*5]);
MAC16(b2, + W7, col[8*5]);
MAC16(b3, + W3, col[8*5]);
}
if(col[8*6]){
a0 += + W6*col[8*6];
a1 += - W2*col[8*6];
a2 += + W2*col[8*6];
a3 += - W6*col[8*6];
}
if (col[8*7]) {
MAC16(b0, + W7, col[8*7]);
MAC16(b1, - W5, col[8*7]);
MAC16(b2, + W3, col[8*7]);
MAC16(b3, - W1, col[8*7]);
}
a0 += + W4*col[8*4];
a1 += - W4*col[8*4];
a2 += - W4*col[8*4];
a3 += + W4*col[8*4];
}
if (col[8*5]) {
MAC16(b0, + W5, col[8*5]);
MAC16(b1, - W1, col[8*5]);
MAC16(b2, + W7, col[8*5]);
MAC16(b3, + W3, col[8*5]);
}
if(col[8*6]){
a0 += + W6*col[8*6];
a1 += - W2*col[8*6];
a2 += + W2*col[8*6];
a3 += - W6*col[8*6];
}
if (col[8*7]) {
MAC16(b0, + W7, col[8*7]);
MAC16(b1, - W5, col[8*7]);
MAC16(b2, + W3, col[8*7]);
MAC16(b3, - W1, col[8*7]);
}
will become:
Code:
a0 += + W4*col[8*4];
a1 += - W4*col[8*4];
a2 += - W4*col[8*4];
a3 += + W4*col[8*4];
MAC16(b0, + W5, col[8*5]);
MAC16(b1, - W1, col[8*5]);
MAC16(b2, + W7, col[8*5]);
MAC16(b3, + W3, col[8*5]);
a0 += + W6*col[8*6];
a1 += - W2*col[8*6];
a2 += + W2*col[8*6];
a3 += - W6*col[8*6];
MAC16(b0, + W7, col[8*7]);
MAC16(b1, - W5, col[8*7]);
MAC16(b2, + W3, col[8*7]);
MAC16(b3, - W1, col[8*7]);
a1 += - W4*col[8*4];
a2 += - W4*col[8*4];
a3 += + W4*col[8*4];
MAC16(b0, + W5, col[8*5]);
MAC16(b1, - W1, col[8*5]);
MAC16(b2, + W7, col[8*5]);
MAC16(b3, + W3, col[8*5]);
a0 += + W6*col[8*6];
a1 += - W2*col[8*6];
a2 += + W2*col[8*6];
a3 += - W6*col[8*6];
MAC16(b0, + W7, col[8*7]);
MAC16(b1, - W5, col[8*7]);
MAC16(b2, + W3, col[8*7]);
MAC16(b3, - W1, col[8*7]);
So now i have to do more instructions, but will it weigh up to the time spent in those conditions? Answer later.
When reading (1) it becomes clear that these are indeed matrix operations of which 4 of them are so called 'rotations' which involves cosines. The cosine coefficients are all in separate '#defines' and converted to integers, so that makes this routine a 'fast-integer 1D-IDCT' Now write out all the multiplications and coefficients:
Code:
a0 = W4 * (col[8*0] + ((1<<(COL_SHIFT-1))/W4));
a1 = a0;
a2 = a0;
a3 = a0;
a0 += + W2*col[8*2];
a1 += + W6*col[8*2];
a2 += - W6*col[8*2];
a3 += - W2*col[8*2];
MUL16(b0, W1, col[8*1]);
MUL16(b1, W3, col[8*1]);
MUL16(b2, W5, col[8*1]);
MUL16(b3, W7, col[8*1]);
MAC16(b0, + W3, col[8*3]);
MAC16(b1, - W7, col[8*3]);
MAC16(b2, - W1, col[8*3]);
MAC16(b3, - W5, col[8*3]);
a0 += + W4*col[8*4];
a1 += - W4*col[8*4];
a2 += - W4*col[8*4];
a3 += + W4*col[8*4];
MAC16(b0, + W5, col[8*5]);
MAC16(b1, - W1, col[8*5]);
MAC16(b2, + W7, col[8*5]);
MAC16(b3, + W3, col[8*5]);
a0 += + W6*col[8*6];
a1 += - W2*col[8*6];
a2 += + W2*col[8*6];
a3 += - W6*col[8*6];
MAC16(b0, + W7, col[8*7]);
MAC16(b1, - W5, col[8*7]);
MAC16(b2, + W3, col[8*7]);
MAC16(b3, - W1, col[8*7]);
a1 = a0;
a2 = a0;
a3 = a0;
a0 += + W2*col[8*2];
a1 += + W6*col[8*2];
a2 += - W6*col[8*2];
a3 += - W2*col[8*2];
MUL16(b0, W1, col[8*1]);
MUL16(b1, W3, col[8*1]);
MUL16(b2, W5, col[8*1]);
MUL16(b3, W7, col[8*1]);
MAC16(b0, + W3, col[8*3]);
MAC16(b1, - W7, col[8*3]);
MAC16(b2, - W1, col[8*3]);
MAC16(b3, - W5, col[8*3]);
a0 += + W4*col[8*4];
a1 += - W4*col[8*4];
a2 += - W4*col[8*4];
a3 += + W4*col[8*4];
MAC16(b0, + W5, col[8*5]);
MAC16(b1, - W1, col[8*5]);
MAC16(b2, + W7, col[8*5]);
MAC16(b3, + W3, col[8*5]);
a0 += + W6*col[8*6];
a1 += - W2*col[8*6];
a2 += + W2*col[8*6];
a3 += - W6*col[8*6];
MAC16(b0, + W7, col[8*7]);
MAC16(b1, - W5, col[8*7]);
MAC16(b2, + W3, col[8*7]);
MAC16(b3, - W1, col[8*7]);
=
Code:
a0 = W4 * col[8*0] + (1<<(COL_SHIFT-1));
a1 = a0;
a2 = a0;
a3 = a0;
a0 += W4*col[8*4];
a1 -= W4*col[8*4];
a2 -= W4*col[8*4];
a3 += W4*col[8*4];
a0 += col[8*2]*W2;
a1 += col[8*2]*W6;
a2 -= col[8*2]*W6;
a3 -= col[8*2]*W2;
a0 += col[8*6]*W6;
a1 -= col[8*6]*w2;
a2 += col[8*6]*W2;
a3 -= col[8*6]*W6;
b0 = col[8*1]*W1;
b1 = col[8*1]*W3;
b2 = col[8*1]*W5;
b3 = col[8*1]*W7;
b0 += col[8*3]*W3;
b1 -= col[8*3]*W7;
b2 -= col[8*3]*W1;
b3 -= col[8*3]*W5;
b0 += col[8*5]*W5;
b1 -= col[8*5]*W1;
b2 += col[8*5]*W7;
b3 += col[8*5]*W3;
b0 += col[8*7]*W7;
b1 -= col[8*7]*W5;
b2 += col[8*7]*W3;
b3 -= col[8*7]*W1;
a1 = a0;
a2 = a0;
a3 = a0;
a0 += W4*col[8*4];
a1 -= W4*col[8*4];
a2 -= W4*col[8*4];
a3 += W4*col[8*4];
a0 += col[8*2]*W2;
a1 += col[8*2]*W6;
a2 -= col[8*2]*W6;
a3 -= col[8*2]*W2;
a0 += col[8*6]*W6;
a1 -= col[8*6]*w2;
a2 += col[8*6]*W2;
a3 -= col[8*6]*W6;
b0 = col[8*1]*W1;
b1 = col[8*1]*W3;
b2 = col[8*1]*W5;
b3 = col[8*1]*W7;
b0 += col[8*3]*W3;
b1 -= col[8*3]*W7;
b2 -= col[8*3]*W1;
b3 -= col[8*3]*W5;
b0 += col[8*5]*W5;
b1 -= col[8*5]*W1;
b2 += col[8*5]*W7;
b3 += col[8*5]*W3;
b0 += col[8*7]*W7;
b1 -= col[8*7]*W5;
b2 += col[8*7]*W3;
b3 -= col[8*7]*W1;
=
Code:
int d0,d2=col[8*2],d4=W4*col[8*4],d6=col[8*6];
int d1=col[8*1],d3=col[8*3],d5=col[8*5],d7=col[8*7];
a0 = W4 * col[8*0] + (1<<(COL_SHIFT-1));
a1 = a0;
a0 += d4;
a1 -= d4;
a3 = a0;
a2 = a1;
a0 += d2*W2 + d6*W6;
a1 += d2*W6 - d6*W2;
a2 +=-d2*W6 + d6*W2;
a3 +=-d2*W2 - d6*W6;
b0 = d1*W1 + d7*W7;
b3 = d1*W7 - d7*W1;
b2 = d1*W5 + d7*W3;
b1 = d1*W3 - d7*W5;
b0 += d3*W3 + d5*W5;
b3 +=-d3*W5 + d5*W3;
b1 +=-d3*W7 - d5*W1;
b2 +=-d3*W1 + d5*W7;
int d1=col[8*1],d3=col[8*3],d5=col[8*5],d7=col[8*7];
a0 = W4 * col[8*0] + (1<<(COL_SHIFT-1));
a1 = a0;
a0 += d4;
a1 -= d4;
a3 = a0;
a2 = a1;
a0 += d2*W2 + d6*W6;
a1 += d2*W6 - d6*W2;
a2 +=-d2*W6 + d6*W2;
a3 +=-d2*W2 - d6*W6;
b0 = d1*W1 + d7*W7;
b3 = d1*W7 - d7*W1;
b2 = d1*W5 + d7*W3;
b1 = d1*W3 - d7*W5;
b0 += d3*W3 + d5*W5;
b3 +=-d3*W5 + d5*W3;
b1 +=-d3*W7 - d5*W1;
b2 +=-d3*W1 + d5*W7;
So after a lot of wizardry, i'm left with some fairly symmetric multiplications. Now comes the clever part. Also from (1), A multiplication of the form:
Code:
t0 = W0 * d0 + W1 * d1;
t1 = W0 * d1 - W1 * d0;
t1 = W0 * d1 - W1 * d0;
can also be written as:
Code:
int tmp = W0 * (d0 + d1);
t0 = tmp + (W1 - W0) * d1;
t1 = tmp - (W1 + W0) * d0;
t0 = tmp + (W1 - W0) * d1;
t1 = tmp - (W1 + W0) * d0;
which saves you one expensive multiplication per 2x2 matrix multiplication. This sort of 2x2 matrix multiplication BTW is supposed to be called a Butterfly, because of the butterfly shape of the diagrammatic form.
This is the reason why i had to get rid of those 'if' branches in the beginning. I couldn't have written out those butterflies without those pieces inside an 'if'. In this case it is worthwile, but one always has to test these things carefully.
Enter the SIMD (Single Instruction on Multiple Data) instructions on the MIPS 4 Instruction set.
Code:
madd ==> a = a + b*c
nmsub ==> a = a - b*c
nmsub ==> a = a - b*c
These instruction perform a multiplication and addition/subtraction in one go! They are only to be found on mips4 instruction sets, so you gotta compile with -mips4 to get them. Also they are only for floats, either singles and doubles, not for integers! :(
BUT, the R10000 is a pretty nifty piece of machinery. It is blessed with both an Integer ALU and floating point ALU and can issue 2 integer ops and 1 floating point ops in one clocktick. So what if we substitute a part of the calculation with floats instead of integers? Maybe the compiler can then weave these two "threads" together, with the added bonus of 'madd'/'nmsub' instructions for the floating point part. So i devised this:
Code:
#define BUTTERFLY(t0,t1,W0,W1,d0,d1) \
do { \
int tmp = W0 * (d0 + d1); \
t0 = tmp + (W1 - W0) * d1; \
t1 = tmp - (W1 + W0) * d0; \
} while (0)
#define BUTTERFLY0(t0,t1,W0,W1,d0,d1) \
do { \
t0 = W0 * d0 + W1 * d1; \
t1 = W0 * d1 - W1 * d0; \
} while (0)
#define BUTTERFLYADD0(t0,t1,W0,W1,d0,d1)\
do { \
t0 += W0 * d0 + W1 * d1; \
t1 += W0 * d1 - W1 * d0; \
} while (0)
static inline void idctSparseColAdd (uint8_t *dest, int line_size,
DCTELEM * col)
{
int a0, a1, a2, a3;
float b0, b1, b2, b3;
uint8_t *cm = cropTbl + MAX_NEG_CROP;
int d0,d4=W4*col[8*4];
float d1=col[8*1],d3=col[8*3],d5=col[8*5],d7=col[8*7];
/* XXX: I did that only to give same values as previous code */
a0 = W4 * col[8*0] + (1<<(COL_SHIFT-1));
a1 = a0;
BUTTERFLY0(b0,b3,d1,d7,W1,W7);
a0 += d4;
a1 -= d4;
BUTTERFLY0(b2,b1,d1,d7,W5,W3);
a3 = a0;
a2 = a1;
BUTTERFLYADD0(b3,b0,d3,d5,-W5,W3);
BUTTERFLY(d0,d4,col[8*2],col[8*6],W2,W6);
BUTTERFLYADD0(b1,b2,d3,d5,-W7,-W1);
a0 += d0;
a1 += d4;
a2 -= d4;
a3 -= d0;
dest[0] = cm[dest[0] + ((a0 + (int)b0) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a1 + (int)b1) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a2 + (int)b2) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a3 + (int)b3) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a3 - (int)b3) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a2 - (int)b2) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a1 - (int)b1) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a0 - (int)b0) >> COL_SHIFT)];
}
do { \
int tmp = W0 * (d0 + d1); \
t0 = tmp + (W1 - W0) * d1; \
t1 = tmp - (W1 + W0) * d0; \
} while (0)
#define BUTTERFLY0(t0,t1,W0,W1,d0,d1) \
do { \
t0 = W0 * d0 + W1 * d1; \
t1 = W0 * d1 - W1 * d0; \
} while (0)
#define BUTTERFLYADD0(t0,t1,W0,W1,d0,d1)\
do { \
t0 += W0 * d0 + W1 * d1; \
t1 += W0 * d1 - W1 * d0; \
} while (0)
static inline void idctSparseColAdd (uint8_t *dest, int line_size,
DCTELEM * col)
{
int a0, a1, a2, a3;
float b0, b1, b2, b3;
uint8_t *cm = cropTbl + MAX_NEG_CROP;
int d0,d4=W4*col[8*4];
float d1=col[8*1],d3=col[8*3],d5=col[8*5],d7=col[8*7];
/* XXX: I did that only to give same values as previous code */
a0 = W4 * col[8*0] + (1<<(COL_SHIFT-1));
a1 = a0;
BUTTERFLY0(b0,b3,d1,d7,W1,W7);
a0 += d4;
a1 -= d4;
BUTTERFLY0(b2,b1,d1,d7,W5,W3);
a3 = a0;
a2 = a1;
BUTTERFLYADD0(b3,b0,d3,d5,-W5,W3);
BUTTERFLY(d0,d4,col[8*2],col[8*6],W2,W6);
BUTTERFLYADD0(b1,b2,d3,d5,-W7,-W1);
a0 += d0;
a1 += d4;
a2 -= d4;
a3 -= d0;
dest[0] = cm[dest[0] + ((a0 + (int)b0) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a1 + (int)b1) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a2 + (int)b2) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a3 + (int)b3) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a3 - (int)b3) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a2 - (int)b2) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a1 - (int)b1) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a0 - (int)b0) >> COL_SHIFT)];
}
All the calculation involving the 'b' coefficients are floats and all the 'a's are integers. Got some temp storage variables as well, since the R10000 has a large number of registers, so this never hurts.
The BUTTERFLY0 macro is for the floats, because that gives the compiler the chance to issue the madd instructions and the normal BUTTERFLY macro is for the integers, because of the saving of one multiplication. Oh and the order of macro's for the float series (b) is independent from the integer series (a), which makes it easy for the compiler/CPU to balance the load between the two ALU's
If you look carefully look at the assembly output ('c99 -S'), you can see the weaving of instructions. First the 'old' routine:
Normal
Code:
# Program Unit: idctSparseColAdd
.ent idctSparseColAdd
idctSparseColAdd: # 0x340
.dynsym idctSparseColAdd sto_default
.frame $sp, 80, $31
# lgra_spill_temp_9 = 0
# lgra_spill_temp_10 = 8
# lgra_spill_temp_11 = 16
# lgra_spill_temp_12 = 24
# lra_spill_temp_13 = 32
# lra_spill_temp_14 = 40
# lra_spill_temp_15 = 48
# lra_spill_temp_16 = 56
# lra_spill_temp_17 = 64
# lra_spill_temp_18 = 72
.loc 1 255 1
# 251 }
# 252
# 253 static inline void idctSparseColAdd (uint8_t *dest, int line_size,
# 254 DCTELEM * col)
# 255 {
.BB1.idctSparseColAdd: # 0x340
#<freq>
#<freq> BB:1 frequency = 1.00000 (heuristic)
#<freq>
addiu $sp,$sp,-80 # [0]
sd $16,0($sp) # [1] lgra_spill_temp_9
.loc 1 265 9
# 261 a1 = a0;
# 262 a2 = a0;
# 263 a3 = a0;
# 264
# 265 a0 += + W2*col[8*2];
lh $16,32($6) # [0] id:201
addiu $10,$0,21407 # [0]
mult $16,$10 # [2]
.loc 1 266 9
# 266 a1 += + W6*col[8*2];
addiu $12,$0,8867 # [1]
.loc 1 255 1
sd $19,24($sp) # [2] lgra_spill_temp_12
.loc 1 265 9
mflo $19 # [8]
nop # [2]
nop # [2]
.loc 1 266 9
.
.
.
.loc 1 322 9
lbu $19,0($18) # [163] id:229
sra $24,$24,20 # [160]
.loc 1 323 1
# 323 }
ld $16,0($sp) # [164] lgra_spill_temp_9
.loc 1 322 9
addu $19,$19,$24 # [165]
addu $17,$17,$19 # [166]
.loc 1 323 1
ld $19,24($sp) # [165] lgra_spill_temp_12
.loc 1 322 9
lbu $17,384($17) # [168] id:230 cropTbl+0x0
sb $17,0($18) # [169] id:231
.loc 1 323 1
ld $17,8($sp) # [170] lgra_spill_temp_10
ld $18,16($sp) # [171] lgra_spill_temp_11
jr $31 # [162]
addiu $sp,$sp,80 # [162]
.end idctSparseColAdd
.section .text
.ent idctSparseColAdd
idctSparseColAdd: # 0x340
.dynsym idctSparseColAdd sto_default
.frame $sp, 80, $31
# lgra_spill_temp_9 = 0
# lgra_spill_temp_10 = 8
# lgra_spill_temp_11 = 16
# lgra_spill_temp_12 = 24
# lra_spill_temp_13 = 32
# lra_spill_temp_14 = 40
# lra_spill_temp_15 = 48
# lra_spill_temp_16 = 56
# lra_spill_temp_17 = 64
# lra_spill_temp_18 = 72
.loc 1 255 1
# 251 }
# 252
# 253 static inline void idctSparseColAdd (uint8_t *dest, int line_size,
# 254 DCTELEM * col)
# 255 {
.BB1.idctSparseColAdd: # 0x340
#<freq>
#<freq> BB:1 frequency = 1.00000 (heuristic)
#<freq>
addiu $sp,$sp,-80 # [0]
sd $16,0($sp) # [1] lgra_spill_temp_9
.loc 1 265 9
# 261 a1 = a0;
# 262 a2 = a0;
# 263 a3 = a0;
# 264
# 265 a0 += + W2*col[8*2];
lh $16,32($6) # [0] id:201
addiu $10,$0,21407 # [0]
mult $16,$10 # [2]
.loc 1 266 9
# 266 a1 += + W6*col[8*2];
addiu $12,$0,8867 # [1]
.loc 1 255 1
sd $19,24($sp) # [2] lgra_spill_temp_12
.loc 1 265 9
mflo $19 # [8]
nop # [2]
nop # [2]
.loc 1 266 9
.
.
.
.loc 1 322 9
lbu $19,0($18) # [163] id:229
sra $24,$24,20 # [160]
.loc 1 323 1
# 323 }
ld $16,0($sp) # [164] lgra_spill_temp_9
.loc 1 322 9
addu $19,$19,$24 # [165]
addu $17,$17,$19 # [166]
.loc 1 323 1
ld $19,24($sp) # [165] lgra_spill_temp_12
.loc 1 322 9
lbu $17,384($17) # [168] id:230 cropTbl+0x0
sb $17,0($18) # [169] id:231
.loc 1 323 1
ld $17,8($sp) # [170] lgra_spill_temp_10
ld $18,16($sp) # [171] lgra_spill_temp_11
jr $31 # [162]
addiu $sp,$sp,80 # [162]
.end idctSparseColAdd
.section .text
171 clockticks! Because all the ops are integers, the integer ALU gets extremely busy, thereby stalling the throughput.
Optimised
Code:
# Program Unit: idctSparseColAdd
.ent idctSparseColAdd
idctSparseColAdd: # 0x200
.dynsym idctSparseColAdd sto_default
.frame $sp, 0, $31
.loc 1 204 1
# 200 }
# 201
# 202 static inline void idctSparseColAdd (uint8_t *dest, int line_size,
# 203 DCTELEM * col)
# 204 {
.BB1.idctSparseColAdd: # 0x200
#<freq>
#<freq> BB:1 frequency = 1.00000 (heuristic)
#<freq>
.loc 1 209 32
# 205 int a0, a1, a2, a3;
# 206 float b0, b1, b2, b3;
# 207 uint8_t *cm = cropTbl + MAX_NEG_CROP;
# 208 int d0,d4=W4*col[8*4];
# 209 float d1=col[8*1],d3=col[8*3],d5=col[8*5],d7=col[8*7];
lh $8,80($6) # [0] id:169
.loc 1 209 20
lh $10,48($6) # [1] id:168
.loc 1 209 32
mtc1 $8,$f0 # [2]
.loc 1 209 8
lh $12,16($6) # [2] id:167
.loc 1 209 20
mtc1 $10,$f7 # [3]
.loc 1 224 9
# 220 a3 = a0;
# 221 a2 = a1;
# 222
# 223 BUTTERFLYADD0(b3,b0,d3,d5,-W5,W3);
# 224 BUTTERFLY(d0,d4,col[8*2],col[8*6],W2,W6);
lh $1,32($6) # [3] id:172
addiu $9,$0,30274 # [1]
.loc 1 209 44
lh $11,112($6) # [4] id:170
.loc 1 209 32
cvt.s.w $f0,$f0 # [5]
.loc 1 224 9
mult $1,$9 # [5]
.loc 1 209 8
mtc1 $12,$f1 # [4]
.
subu $15,$15,$25 # [13]
.loc 1 234 9
madd.s $f9,$f9,$f0,$f6 # [18] <== Float op [18], combined with integer op [18]
.loc 1 232 9
trunc.w.s $f13,$f13 # [21]
.loc 1 217 9
subu $2,$2,$3 # [18] <== Integer op [18], combined with float op [18]
.loc 1 236 9
# 235 dest += line_size;
# 236 dest[0] = cm[dest[0] + ((a2 + (int)b2) >> COL_SHIFT)];
mul.s $f6,$f7,$f6 # [19] <== Float op [19], combined with integer op [19]
.loc 1 217 9
lui $3,8 # [15]
addu $2,$15,$2 # [19] <== Integer op [19], combined with float op [19]
.loc 1 224 9
addu $9,$24,$9 # [18] <== Integer op [18], combined with float op [18]
.loc 1 232 9
mfc1 $7,$f13 # [23]
.loc 1 217 9
addu $2,$2,$3 # [20] <== Integer op [20], combined with float op [20]
.loc 1 236 9
nmsub.s $f6,$f6,$f0,$f11 # [20] <== Float op [20], combined with integer op [20]
.loc 1 227 9
.
lbu $8,384($8) # [48] id:193 cropTbl+0x0
.loc 1 245 9
# 245 dest += line_size;
addu $2,$5,$9 # [45]
.loc 1 244 9
sb $8,0($9) # [49] id:194
.loc 1 246 9
# 246 dest[0] = cm[dest[0] + ((a0 - (int)b0) >> COL_SHIFT)];
subu $4,$6,$7 # [46]
lbu $3,0($2) # [50] id:195
sra $4,$4,20 # [47]
addu $3,$3,$4 # [52]
addu $1,$1,$3 # [53]
lbu $1,384($1) # [54] id:196 cropTbl+0x0
.loc 1 247 1
# 247 }
jr $31 # [48]
.loc 1 246 9
sb $1,0($2) # [55] id:197
.end idctSparseColAdd
.section .text
.align 6
.ent idctSparseColAdd
idctSparseColAdd: # 0x200
.dynsym idctSparseColAdd sto_default
.frame $sp, 0, $31
.loc 1 204 1
# 200 }
# 201
# 202 static inline void idctSparseColAdd (uint8_t *dest, int line_size,
# 203 DCTELEM * col)
# 204 {
.BB1.idctSparseColAdd: # 0x200
#<freq>
#<freq> BB:1 frequency = 1.00000 (heuristic)
#<freq>
.loc 1 209 32
# 205 int a0, a1, a2, a3;
# 206 float b0, b1, b2, b3;
# 207 uint8_t *cm = cropTbl + MAX_NEG_CROP;
# 208 int d0,d4=W4*col[8*4];
# 209 float d1=col[8*1],d3=col[8*3],d5=col[8*5],d7=col[8*7];
lh $8,80($6) # [0] id:169
.loc 1 209 20
lh $10,48($6) # [1] id:168
.loc 1 209 32
mtc1 $8,$f0 # [2]
.loc 1 209 8
lh $12,16($6) # [2] id:167
.loc 1 209 20
mtc1 $10,$f7 # [3]
.loc 1 224 9
# 220 a3 = a0;
# 221 a2 = a1;
# 222
# 223 BUTTERFLYADD0(b3,b0,d3,d5,-W5,W3);
# 224 BUTTERFLY(d0,d4,col[8*2],col[8*6],W2,W6);
lh $1,32($6) # [3] id:172
addiu $9,$0,30274 # [1]
.loc 1 209 44
lh $11,112($6) # [4] id:170
.loc 1 209 32
cvt.s.w $f0,$f0 # [5]
.loc 1 224 9
mult $1,$9 # [5]
.loc 1 209 8
mtc1 $12,$f1 # [4]
.
subu $15,$15,$25 # [13]
.loc 1 234 9
madd.s $f9,$f9,$f0,$f6 # [18] <== Float op [18], combined with integer op [18]
.loc 1 232 9
trunc.w.s $f13,$f13 # [21]
.loc 1 217 9
subu $2,$2,$3 # [18] <== Integer op [18], combined with float op [18]
.loc 1 236 9
# 235 dest += line_size;
# 236 dest[0] = cm[dest[0] + ((a2 + (int)b2) >> COL_SHIFT)];
mul.s $f6,$f7,$f6 # [19] <== Float op [19], combined with integer op [19]
.loc 1 217 9
lui $3,8 # [15]
addu $2,$15,$2 # [19] <== Integer op [19], combined with float op [19]
.loc 1 224 9
addu $9,$24,$9 # [18] <== Integer op [18], combined with float op [18]
.loc 1 232 9
mfc1 $7,$f13 # [23]
.loc 1 217 9
addu $2,$2,$3 # [20] <== Integer op [20], combined with float op [20]
.loc 1 236 9
nmsub.s $f6,$f6,$f0,$f11 # [20] <== Float op [20], combined with integer op [20]
.loc 1 227 9
.
lbu $8,384($8) # [48] id:193 cropTbl+0x0
.loc 1 245 9
# 245 dest += line_size;
addu $2,$5,$9 # [45]
.loc 1 244 9
sb $8,0($9) # [49] id:194
.loc 1 246 9
# 246 dest[0] = cm[dest[0] + ((a0 - (int)b0) >> COL_SHIFT)];
subu $4,$6,$7 # [46]
lbu $3,0($2) # [50] id:195
sra $4,$4,20 # [47]
addu $3,$3,$4 # [52]
addu $1,$1,$3 # [53]
lbu $1,384($1) # [54] id:196 cropTbl+0x0
.loc 1 247 1
# 247 }
jr $31 # [48]
.loc 1 246 9
sb $1,0($2) # [55] id:197
.end idctSparseColAdd
.section .text
.align 6
Tadaaa, 55 ticks! Amazing what a little software pipelining can do for your code!
Running speedshop again proves it:
Code:
Summary of statistical PC sampling data (fpcsampx)--
49298: Total samples
49.298: Accumulated time (secs.)
1.0: Time per sample (msecs.)
4: Sample bin width (bytes)
-------------------------------------------------------------------------
Function list, in descending order by time
-------------------------------------------------------------------------
[index] secs % cum.% samples function (dso: file, line)
[1] 20.239 41.1% 41.1% 20239 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 319)
[2] 7.177 14.6% 55.6% 7177 idctSparseColAdd (mplayer: simple_idct.c, 209)
[3] 4.454 9.0% 64.6% 4454 put_pixels8_c (mplayer: dsputil.c, 897)
[4] 2.382 4.8% 69.5% 2382 msmpeg4_decode_block (mplayer: msmpeg4.c, 1676)
[5] 1.948 4.0% 73.4% 1948 idctRowCondDC (mplayer: simple_idct.c, 104)
[6] 1.551 3.1% 76.6% 1551 simple_idct_put (mplayer: simple_idct.c, 313)
[7] 1.301 2.6% 79.2% 1301 put_pixels16_xy2_c (mplayer: dsputil.c, 897)
IDCT is a bit more scattered now, three routines instead of one, but added up (idctSparseColAdd +idctRowCondDC + simple_idct_put) gives 21.7% which is half the time of yuv2rgb_c_24_rgb(41.1%). Compared that with the starting speedshop run, this is a 50% reduction in time spent in IDCT! Looking at the total times. 64.0 versus 49.3 seconds is 23% speedup of the app. Whoa! Granted, it's only for this .avi file and this specific IDCT. Other codecs need other routines, but it's a start.
Well, hope you have read through it and picked up some ideas. Next time i'll be looking at yuv2rgb.c software wise.
The patch for libavcodec/simple_idct.c is now living at http://www.mechanics.citg.tudelft.nl/~e ... pre4.patch get it and try it. Schleusel and i will try to pester MPlayer CVS guru's to get us some libavcodec/mips subdirectory where we can store this.
%-)