SGI: Development

MPlayer 1.0pre4 IDCT Optimisation (LONG POST!)

Hi all,

Sorry for not being so active on Nekoware builds the last couple of weeks, but i really wanted to take part in Schleusel's and Vegac's attempts in making MPlayer just a little bit faster, so i can watch neato movie stuff on my I2 Impact ;) . It has cost me a lot of time, but boy, am i glad i spent it with them and MPlayer. I will show you what i did, so maybe you can learn from my trials of getting that app speed up it's framerate. The optimisation is not done yet, it still ongoing, and we only just begun searching out the possibilities of hardware colorspace conversion, but the methods behind the software optimisation part has now been understood. Here it comes. I hope Neko won't mind me breaking the record of longest post ever on Nekochan. Beer is in the mail, Pete :P

Since my first speedshop run, in the mplayer 1.0pre3 thread viewtopic.php?t=1374 i've read man pages of ssrun just to get myself acquainted with the most used options. Instead of 'ssrun -exp totaltime' or 'ssrun -exp usertime' i now do 'ssrun -exp fpcsampx' to get my Finegrained-ProgramCounter-SAMPling-with-4-bytes timing for all the routines which MPlayer is busy with. So lets do a standard MIPSPro 7.4.2 '-O3 -r10000 -mips4 -n32' build of MPlayer 1.0-pre4, run 'ssrun -exp fpcsampx ./mplayer' with a standard .avi file (Schleusel and i use "courtyard.avi" from a Call Of Duty Demo video (11.4MB), to be found on the net). Machine is an R10K@180MHz Origin200:

Code:
ssrun -exp fpcsampx ./mplayer -vo null -vf format=rgb24 -nosound courtyard.avi
prof -lines mplayer.fpcsampx.#


Code:
-------------------------------------------------------------------------
SpeedShop profile listing generated Thu Jul 22 09:54:04 2004

prof -lines mplayer.fpcsampx.m249164

mplayer (n32): Target program
fpcsampx: Experiment name
pc,4,1000,0:cu: Marching orders
R10000 / R10010: CPU / FPU
4: Number of CPUs
180: Clock frequency (MHz.)
Experiment notes--
From file mplayer.fpcsampx.m249164:
Caliper point 0 at target begin, PID 249164
/usr1/local/everdij/MPlayer-1.0pre4/mplayer -vo null -vf format=rgb24 -nosound courtyard.avi
Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of statistical PC sampling data (fpcsampx)--
64010: Total samples
64.010: Accumulated time (secs.)
1.0: Time per sample (msecs.)
4: Sample bin width (bytes)
-------------------------------------------------------------------------
Function list, in descending order by time
-------------------------------------------------------------------------
[index]      secs    %    cum.%   samples  function (dso: file, line)

[1]    21.754  34.0%  34.0%     21754  simple_idct_add (mplayer: simple_idct.c, 399)
[2]    20.301  31.7%  65.7%     20301  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 319)
[3]     4.593   7.2%  72.9%      4593  put_pixels8_c (mplayer: dsputil.c, 897)
[4]     3.701   5.8%  78.7%      3701  simple_idct_put (mplayer: simple_idct.c, 389)
[5]     2.402   3.8%  82.4%      2402  msmpeg4_decode_block (mplayer: msmpeg4.c, 1676)
[6]     1.323   2.1%  84.5%      1323  put_pixels16_xy2_c (mplayer: dsputil.c, 897)
[7]     1.214   1.9%  86.4%      1214  memset (libc.so.1: stat.c, 32; compiled in bzero.s)


and so forth.... Note the total time, which is 64 seconds. One third of the program's time is used in the InverseDiscreteCosineTransform adition, another third in the colorspace converter yuv2rgb_c_24_rgb and the rest in the rest.
So Schleusel, Vegac and me sorta divided up the tasks. Schleusel tried some optimisations flags (-IPA) and porting details, Vegac had a look at the colorspace conversion and possible OpenGL O2/ICE speedups, and i got libavcodec/simple_idct.c :) simple_idct_add is just a small routine, consisting of two parts of each 8 'for' loops, which is basically the way how a double 1D-IDCT works. First the rows are transformed, and then the colums are transformed. The fransformed values are added to the already existing image, hence 'add'. For more info, read up on IDCT:

(1) http://skal.planet-d.net/coding/dct.html
(2) http://www-vs.informatik.uni-ulm.de/bib ... paper.html
(3) http://rnvs.informatik.tu-chemnitz.de/~ ... /IDCT.html

Here's the interesting part:
Code:
/* signed 16x16 -> 32 multiply add accumulate */
#define W1  22725  //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W2  21407  //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W3  19266  //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W4  16383  //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W5  12873  //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W6  8867   //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define W7  4520   //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
#define ROW_SHIFT 11
#define COL_SHIFT 20 // 6

#define MAC16(rt, ra, rb) rt += (ra) * (rb)

/* signed 16x16 -> 32 multiply */
#define MUL16(rt, ra, rb) rt = (ra) * (rb)

static inline void idctSparseColAdd (uint8_t *dest, int line_size,
DCTELEM * col)
{
int a0, a1, a2, a3, b0, b1, b2, b3;
uint8_t *cm = cropTbl + MAX_NEG_CROP;

/* XXX: I did that only to give same values as previous code */
a0 = W4 * (col[8*0] + ((1<<(COL_SHIFT-1))/W4));
a1 = a0;
a2 = a0;
a3 = a0;

a0 +=  + W2*col[8*2];
a1 +=  + W6*col[8*2];
a2 +=  - W6*col[8*2];
a3 +=  - W2*col[8*2];

MUL16(b0, W1, col[8*1]);
MUL16(b1, W3, col[8*1]);
MUL16(b2, W5, col[8*1]);
MUL16(b3, W7, col[8*1]);

MAC16(b0, + W3, col[8*3]);
MAC16(b1, - W7, col[8*3]);
MAC16(b2, - W1, col[8*3]);
MAC16(b3, - W5, col[8*3]);

if(col[8*4]){
a0 += + W4*col[8*4];
a1 += - W4*col[8*4];
a2 += - W4*col[8*4];
a3 += + W4*col[8*4];
}

if (col[8*5]) {
MAC16(b0, + W5, col[8*5]);
MAC16(b1, - W1, col[8*5]);
MAC16(b2, + W7, col[8*5]);
MAC16(b3, + W3, col[8*5]);
}

if(col[8*6]){
a0 += + W6*col[8*6];
a1 += - W2*col[8*6];
a2 += + W2*col[8*6];
a3 += - W6*col[8*6];
}

if (col[8*7]) {
MAC16(b0, + W7, col[8*7]);
MAC16(b1, - W5, col[8*7]);
MAC16(b2, + W3, col[8*7]);
MAC16(b3, - W1, col[8*7]);
}

dest[0] = cm[dest[0] + ((a0 + b0) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a1 + b1) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a2 + b2) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a3 + b3) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a3 - b3) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a2 - b2) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a1 - b1) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a0 - b0) >> COL_SHIFT)];
}


At first glance i thought, what a lot of branches/decisions! If this routine is to become fast, it has to get rid of all those branches. Careful observing shows that the 'if' statements can be removed safely, because the condition (col[8*x]) is true when col[8*x]!=0. But if col[8*x]==0 then nothing is added or subtraced to the coefficients inside the 'if' statement anyway, so the if statement is superfluous:

Code:
if(col[8*4]){
a0 += + W4*col[8*4];
a1 += - W4*col[8*4];
a2 += - W4*col[8*4];
a3 += + W4*col[8*4];
}

if (col[8*5]) {
MAC16(b0, + W5, col[8*5]);
MAC16(b1, - W1, col[8*5]);
MAC16(b2, + W7, col[8*5]);
MAC16(b3, + W3, col[8*5]);
}

if(col[8*6]){
a0 += + W6*col[8*6];
a1 += - W2*col[8*6];
a2 += + W2*col[8*6];
a3 += - W6*col[8*6];
}

if (col[8*7]) {
MAC16(b0, + W7, col[8*7]);
MAC16(b1, - W5, col[8*7]);
MAC16(b2, + W3, col[8*7]);
MAC16(b3, - W1, col[8*7]);
}


will become:
Code:
a0 += + W4*col[8*4];
a1 += - W4*col[8*4];
a2 += - W4*col[8*4];
a3 += + W4*col[8*4];
MAC16(b0, + W5, col[8*5]);
MAC16(b1, - W1, col[8*5]);
MAC16(b2, + W7, col[8*5]);
MAC16(b3, + W3, col[8*5]);
a0 += + W6*col[8*6];
a1 += - W2*col[8*6];
a2 += + W2*col[8*6];
a3 += - W6*col[8*6];
MAC16(b0, + W7, col[8*7]);
MAC16(b1, - W5, col[8*7]);
MAC16(b2, + W3, col[8*7]);
MAC16(b3, - W1, col[8*7]);


So now i have to do more instructions, but will it weigh up to the time spent in those conditions? Answer later.

When reading (1) it becomes clear that these are indeed matrix operations of which 4 of them are so called 'rotations' which involves cosines. The cosine coefficients are all in separate '#defines' and converted to integers, so that makes this routine a 'fast-integer 1D-IDCT' Now write out all the multiplications and coefficients:
Code:
a0 = W4 * (col[8*0] + ((1<<(COL_SHIFT-1))/W4));
a1 = a0;
a2 = a0;
a3 = a0;

a0 +=  + W2*col[8*2];
a1 +=  + W6*col[8*2];
a2 +=  - W6*col[8*2];
a3 +=  - W2*col[8*2];

MUL16(b0, W1, col[8*1]);
MUL16(b1, W3, col[8*1]);
MUL16(b2, W5, col[8*1]);
MUL16(b3, W7, col[8*1]);

MAC16(b0, + W3, col[8*3]);
MAC16(b1, - W7, col[8*3]);
MAC16(b2, - W1, col[8*3]);
MAC16(b3, - W5, col[8*3]);
a0 += + W4*col[8*4];
a1 += - W4*col[8*4];
a2 += - W4*col[8*4];
a3 += + W4*col[8*4];
MAC16(b0, + W5, col[8*5]);
MAC16(b1, - W1, col[8*5]);
MAC16(b2, + W7, col[8*5]);
MAC16(b3, + W3, col[8*5]);
a0 += + W6*col[8*6];
a1 += - W2*col[8*6];
a2 += + W2*col[8*6];
a3 += - W6*col[8*6];
MAC16(b0, + W7, col[8*7]);
MAC16(b1, - W5, col[8*7]);
MAC16(b2, + W3, col[8*7]);
MAC16(b3, - W1, col[8*7]);

=
Code:
a0  = W4 * col[8*0] + (1<<(COL_SHIFT-1));
a1  = a0;
a2  = a0;
a3  = a0;

a0 += W4*col[8*4];
a1 -= W4*col[8*4];
a2 -= W4*col[8*4];
a3 += W4*col[8*4];

a0 += col[8*2]*W2;
a1 += col[8*2]*W6;
a2 -= col[8*2]*W6;
a3 -= col[8*2]*W2;
a0 += col[8*6]*W6;
a1 -= col[8*6]*w2;
a2 += col[8*6]*W2;
a3 -= col[8*6]*W6;

b0  = col[8*1]*W1;
b1  = col[8*1]*W3;
b2  = col[8*1]*W5;
b3  = col[8*1]*W7;
b0 += col[8*3]*W3;
b1 -= col[8*3]*W7;
b2 -= col[8*3]*W1;
b3 -= col[8*3]*W5;
b0 += col[8*5]*W5;
b1 -= col[8*5]*W1;
b2 += col[8*5]*W7;
b3 += col[8*5]*W3;
b0 += col[8*7]*W7;
b1 -= col[8*7]*W5;
b2 += col[8*7]*W3;
b3 -= col[8*7]*W1;

=
Code:
int d0,d2=col[8*2],d4=W4*col[8*4],d6=col[8*6];
int d1=col[8*1],d3=col[8*3],d5=col[8*5],d7=col[8*7];

a0  = W4 * col[8*0] + (1<<(COL_SHIFT-1));
a1  = a0;
a0 += d4;
a1 -= d4;
a3  = a0;
a2  = a1;

a0 += d2*W2 + d6*W6;
a1 += d2*W6 - d6*W2;
a2 +=-d2*W6 + d6*W2;
a3 +=-d2*W2 - d6*W6;

b0  = d1*W1 + d7*W7;
b3  = d1*W7 - d7*W1;
b2  = d1*W5 + d7*W3;
b1  = d1*W3 - d7*W5;

b0 += d3*W3 + d5*W5;
b3 +=-d3*W5 + d5*W3;
b1 +=-d3*W7 - d5*W1;
b2 +=-d3*W1 + d5*W7;


So after a lot of wizardry, i'm left with some fairly symmetric multiplications. Now comes the clever part. Also from (1), A multiplication of the form:
Code:
t0 = W0 * d0 + W1 * d1;
t1 = W0 * d1 - W1 * d0;

can also be written as:
Code:
int tmp = W0 * (d0 + d1);
t0 = tmp + (W1 - W0) * d1;
t1 = tmp - (W1 + W0) * d0;

which saves you one expensive multiplication per 2x2 matrix multiplication. This sort of 2x2 matrix multiplication BTW is supposed to be called a Butterfly, because of the butterfly shape of the diagrammatic form.

This is the reason why i had to get rid of those 'if' branches in the beginning. I couldn't have written out those butterflies without those pieces inside an 'if'. In this case it is worthwile, but one always has to test these things carefully.

Enter the SIMD (Single Instruction on Multiple Data) instructions on the MIPS 4 Instruction set.
Code:
madd  ==>  a = a + b*c
nmsub ==>  a = a - b*c

These instruction perform a multiplication and addition/subtraction in one go! They are only to be found on mips4 instruction sets, so you gotta compile with -mips4 to get them. Also they are only for floats, either singles and doubles, not for integers! :(

BUT, the R10000 is a pretty nifty piece of machinery. It is blessed with both an Integer ALU and floating point ALU and can issue 2 integer ops and 1 floating point ops in one clocktick. So what if we substitute a part of the calculation with floats instead of integers? Maybe the compiler can then weave these two "threads" together, with the added bonus of 'madd'/'nmsub' instructions for the floating point part. So i devised this:
Code:
#define BUTTERFLY(t0,t1,W0,W1,d0,d1)    \
do {                                    \
int tmp = W0 * (d0 + d1);           \
t0 = tmp + (W1 - W0) * d1;          \
t1 = tmp - (W1 + W0) * d0;          \
} while (0)

#define BUTTERFLY0(t0,t1,W0,W1,d0,d1)   \
do {                                    \
t0 = W0 * d0 + W1 * d1;             \
t1 = W0 * d1 - W1 * d0;             \
} while (0)

#define BUTTERFLYADD0(t0,t1,W0,W1,d0,d1)\
do {                                    \
t0 += W0 * d0 + W1 * d1;            \
t1 += W0 * d1 - W1 * d0;            \
} while (0)

static inline void idctSparseColAdd (uint8_t *dest, int line_size,
DCTELEM * col)
{
int a0, a1, a2, a3;
float b0, b1, b2, b3;
uint8_t *cm = cropTbl + MAX_NEG_CROP;
int d0,d4=W4*col[8*4];
float d1=col[8*1],d3=col[8*3],d5=col[8*5],d7=col[8*7];

/* XXX: I did that only to give same values as previous code */

a0 = W4 * col[8*0] + (1<<(COL_SHIFT-1));
a1 = a0;
BUTTERFLY0(b0,b3,d1,d7,W1,W7);

a0 += d4;
a1 -= d4;
BUTTERFLY0(b2,b1,d1,d7,W5,W3);
a3 = a0;
a2 = a1;
BUTTERFLYADD0(b3,b0,d3,d5,-W5,W3);
BUTTERFLY(d0,d4,col[8*2],col[8*6],W2,W6);
BUTTERFLYADD0(b1,b2,d3,d5,-W7,-W1);

a0 += d0;
a1 += d4;
a2 -= d4;
a3 -= d0;

dest[0] = cm[dest[0] + ((a0 + (int)b0) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a1 + (int)b1) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a2 + (int)b2) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a3 + (int)b3) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a3 - (int)b3) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a2 - (int)b2) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a1 - (int)b1) >> COL_SHIFT)];
dest += line_size;
dest[0] = cm[dest[0] + ((a0 - (int)b0) >> COL_SHIFT)];
}

All the calculation involving the 'b' coefficients are floats and all the 'a's are integers. Got some temp storage variables as well, since the R10000 has a large number of registers, so this never hurts.
The BUTTERFLY0 macro is for the floats, because that gives the compiler the chance to issue the madd instructions and the normal BUTTERFLY macro is for the integers, because of the saving of one multiplication. Oh and the order of macro's for the float series (b) is independent from the integer series (a), which makes it easy for the compiler/CPU to balance the load between the two ALU's

If you look carefully look at the assembly output ('c99 -S'), you can see the weaving of instructions. First the 'old' routine:
Normal
Code:
# Program Unit: idctSparseColAdd
.ent    idctSparseColAdd
idctSparseColAdd:       # 0x340
.dynsym idctSparseColAdd        sto_default
.frame  $sp, 80, $31
# lgra_spill_temp_9 = 0
# lgra_spill_temp_10 = 8
# lgra_spill_temp_11 = 16
# lgra_spill_temp_12 = 24
# lra_spill_temp_13 = 32
# lra_spill_temp_14 = 40
# lra_spill_temp_15 = 48
# lra_spill_temp_16 = 56
# lra_spill_temp_17 = 64
# lra_spill_temp_18 = 72
.loc    1 255 1
# 251  }
# 252
# 253  static inline void idctSparseColAdd (uint8_t *dest, int line_size,
# 254                                       DCTELEM * col)
# 255  {
.BB1.idctSparseColAdd:  # 0x340
#<freq>
#<freq> BB:1 frequency = 1.00000 (heuristic)
#<freq>
addiu $sp,$sp,-80               # [0]
sd $16,0($sp)                   # [1]  lgra_spill_temp_9
.loc    1 265 9
# 261          a1 = a0;
# 262          a2 = a0;
# 263          a3 = a0;
# 264
# 265          a0 +=  + W2*col[8*2];
lh $16,32($6)                   # [0]  id:201
addiu $10,$0,21407              # [0]
mult $16,$10                    # [2]
.loc    1 266 9
# 266          a1 +=  + W6*col[8*2];
addiu $12,$0,8867               # [1]
.loc    1 255 1
sd $19,24($sp)                  # [2]  lgra_spill_temp_12
.loc    1 265 9
mflo $19                        # [8]
nop                             # [2]
nop                             # [2]
.loc    1 266 9
.
.
.
.loc    1 322 9
lbu $19,0($18)                  # [163]  id:229
sra $24,$24,20                  # [160]
.loc    1 323 1
# 323  }
ld $16,0($sp)                   # [164]  lgra_spill_temp_9
.loc    1 322 9
addu $19,$19,$24                # [165]
addu $17,$17,$19                # [166]
.loc    1 323 1
ld $19,24($sp)                  # [165]  lgra_spill_temp_12
.loc    1 322 9
lbu $17,384($17)                # [168]  id:230 cropTbl+0x0
sb $17,0($18)                   # [169]  id:231
.loc    1 323 1
ld $17,8($sp)                   # [170]  lgra_spill_temp_10
ld $18,16($sp)                  # [171]  lgra_spill_temp_11
jr $31                          # [162]
addiu $sp,$sp,80                # [162]
.end    idctSparseColAdd
.section .text


171 clockticks! Because all the ops are integers, the integer ALU gets extremely busy, thereby stalling the throughput.

Optimised
Code:
# Program Unit: idctSparseColAdd
.ent    idctSparseColAdd
idctSparseColAdd:       # 0x200
.dynsym idctSparseColAdd        sto_default
.frame  $sp, 0, $31
.loc    1 204 1
# 200  }
# 201
# 202  static inline void idctSparseColAdd (uint8_t *dest, int line_size,
# 203                                       DCTELEM * col)
# 204  {
.BB1.idctSparseColAdd:  # 0x200
#<freq>
#<freq> BB:1 frequency = 1.00000 (heuristic)
#<freq>
.loc    1 209 32
# 205          int a0, a1, a2, a3;
# 206          float b0, b1, b2, b3;
# 207          uint8_t *cm = cropTbl + MAX_NEG_CROP;
# 208          int d0,d4=W4*col[8*4];
# 209          float d1=col[8*1],d3=col[8*3],d5=col[8*5],d7=col[8*7];
lh $8,80($6)                    # [0]  id:169
.loc    1 209 20
lh $10,48($6)                   # [1]  id:168
.loc    1 209 32
mtc1 $8,$f0                     # [2]
.loc    1 209 8
lh $12,16($6)                   # [2]  id:167
.loc    1 209 20
mtc1 $10,$f7                    # [3]
.loc    1 224 9
# 220          a3 = a0;
# 221          a2 = a1;
# 222
# 223          BUTTERFLYADD0(b3,b0,d3,d5,-W5,W3);
# 224          BUTTERFLY(d0,d4,col[8*2],col[8*6],W2,W6);
lh $1,32($6)                    # [3]  id:172
addiu $9,$0,30274               # [1]
.loc    1 209 44
lh $11,112($6)                  # [4]  id:170
.loc    1 209 32
cvt.s.w $f0,$f0                 # [5]
.loc    1 224 9
mult $1,$9                      # [5]
.loc    1 209 8
mtc1 $12,$f1                    # [4]
.
subu $15,$15,$25                # [13]
.loc    1 234 9
madd.s $f9,$f9,$f0,$f6          # [18]                             <== Float op [18], combined with integer op [18]
.loc    1 232 9
trunc.w.s $f13,$f13             # [21]
.loc    1 217 9
subu $2,$2,$3                   # [18]                             <== Integer op [18], combined with float op [18]
.loc    1 236 9
# 235          dest += line_size;
# 236          dest[0] = cm[dest[0] + ((a2 + (int)b2) >> COL_SHIFT)];
mul.s $f6,$f7,$f6               # [19]                             <== Float op [19], combined with integer op [19]
.loc    1 217 9
lui $3,8                        # [15]
addu $2,$15,$2                  # [19]                             <== Integer op [19], combined with float op [19]
.loc    1 224 9
addu $9,$24,$9                  # [18]                             <== Integer op [18], combined with float op [18]
.loc    1 232 9
mfc1 $7,$f13                    # [23]
.loc    1 217 9
addu $2,$2,$3                   # [20]                             <== Integer op [20], combined with float op [20]
.loc    1 236 9
nmsub.s $f6,$f6,$f0,$f11        # [20]                             <== Float op [20], combined with integer op [20]
.loc    1 227 9
.

lbu $8,384($8)                  # [48]  id:193 cropTbl+0x0
.loc    1 245 9
# 245          dest += line_size;
addu $2,$5,$9                   # [45]
.loc    1 244 9
sb $8,0($9)                     # [49]  id:194
.loc    1 246 9
# 246          dest[0] = cm[dest[0] + ((a0 - (int)b0) >> COL_SHIFT)];
subu $4,$6,$7                   # [46]
lbu $3,0($2)                    # [50]  id:195
sra $4,$4,20                    # [47]
addu $3,$3,$4                   # [52]
addu $1,$1,$3                   # [53]
lbu $1,384($1)                  # [54]  id:196 cropTbl+0x0
.loc    1 247 1
# 247  }
jr $31                          # [48]
.loc    1 246 9
sb $1,0($2)                     # [55]  id:197
.end    idctSparseColAdd
.section .text
.align 6


Tadaaa, 55 ticks! Amazing what a little software pipelining can do for your code!


Running speedshop again proves it:
Code:

Summary of statistical PC sampling data (fpcsampx)--
49298: Total samples
49.298: Accumulated time (secs.)
1.0: Time per sample (msecs.)
4: Sample bin width (bytes)
-------------------------------------------------------------------------
Function list, in descending order by time
-------------------------------------------------------------------------
[index]      secs    %    cum.%   samples  function (dso: file, line)

[1]    20.239  41.1%  41.1%     20239  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 319)
[2]     7.177  14.6%  55.6%      7177  idctSparseColAdd (mplayer: simple_idct.c, 209)
[3]     4.454   9.0%  64.6%      4454  put_pixels8_c (mplayer: dsputil.c, 897)
[4]     2.382   4.8%  69.5%      2382  msmpeg4_decode_block (mplayer: msmpeg4.c, 1676)
[5]     1.948   4.0%  73.4%      1948  idctRowCondDC (mplayer: simple_idct.c, 104)
[6]     1.551   3.1%  76.6%      1551  simple_idct_put (mplayer: simple_idct.c, 313)
[7]     1.301   2.6%  79.2%      1301  put_pixels16_xy2_c (mplayer: dsputil.c, 897)



IDCT is a bit more scattered now, three routines instead of one, but added up (idctSparseColAdd +idctRowCondDC + simple_idct_put) gives 21.7% which is half the time of yuv2rgb_c_24_rgb(41.1%). Compared that with the starting speedshop run, this is a 50% reduction in time spent in IDCT! Looking at the total times. 64.0 versus 49.3 seconds is 23% speedup of the app. Whoa! Granted, it's only for this .avi file and this specific IDCT. Other codecs need other routines, but it's a start.

Well, hope you have read through it and picked up some ideas. Next time i'll be looking at yuv2rgb.c software wise.

The patch for libavcodec/simple_idct.c is now living at http://www.mechanics.citg.tudelft.nl/~e ... pre4.patch get it and try it. Schleusel and i will try to pester MPlayer CVS guru's to get us some libavcodec/mips subdirectory where we can store this.

%-)
The MIPS CPU architecture is a pretty awesome piece of engineering when programmed correctly - this just goes to show it!
Sheer genius! -- I'm envious of your talent and time. :)

Thanks for sharing, there's more than a few good optimization tips there!

Also, don't let this good programming get lost! Build a super nekoware mplayer when you are through (patches and all)!
Damn that's cool. I read the whole thing, and I have to say I am very impressed. Go go go for the mplayer optimisation boys!

Shame I still have no TRAM for my Octane... but then I don't play movies on it. I play them on my old crappy 366 MHz PII Laptop (the FAMOUS one), and in general I have to say that mplayer is one awesome piece of software.

Keep at it guys.

_________________
覇気元
Eroteme.org
dex and I have both begun messing with yuv2rgb as well. He managed a couple percentage speed up on it, and I managed a few more percentage on top of that. Slowly but surely this beast is speeding up.

A side-note, when running with the gl2 video output, CRM and GL calls are actually the slowest piece of software at this point...of course this means we need to find ways around this - presumably an optimized 32bit RGBA conversion (currently we've been focusing on 24bit yuv to rgb conversion) would allow O2 to take the textures in native format and not have to convert them each frame...
...congratullations Dexter1 and Vegac!, very impressive optimization work!; I'm a lot sure that with the help and investigations of both, the maximum performance level for MPlayer can be achieved on MIPS.
Very anxious to try the NekoWare MPlayer as soon as possible.

...Keep on the good work, both! \;)/
great idea!!

especially the yuv2rgb is one of the top 3 slowdowns.
i'm anxious, too!

_________________
r-a-c.de
I vote this the most exciting thread!

Maybe I can write a simple Motif front-end for the final product....hmmm
squeen wrote:
I vote this the most exciting thread!


Completely agree with you, Squeen! ;)

squeen wrote:
Maybe I can write a simple Motif front-end for the final product....hmmm


...Oh yeah!, a nice GUI could be really great! \:D/
dexter1 wrote:
Beer is in the mail, Pete :P


I should be buying the beer for all this hard work. Incredible job guys - I can't wait to give it a try.

_________________
私のホバークラフト は鰻が一杯です。
squeen wrote:
I vote this the most exciting thread!

Maybe I can write a simple Motif front-end for the final product....hmmm


There is a gui for mplayer but you need to specify it during compiling by using --enable-gui during ./configure . Then, to turn on GUI mode, you have to execute the gmplayer binary. It uses gtk+ 1.2 from what i recall.

_________________
"Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better"
Awesome work!

You should get ahold of Brandon to see if he has any hardware colorspace conversion suggestions. I recall he spent a lot of time and research working on that component of his IRIX DivX player.
So this morning I began a CRM optimized video out plugin for mplayer...

A test movie we've been using (640x480, courtyard.avi from Call of Duty) went from playing in 49 seconds, to 29 seconds - a 20 second speedup!

I need to borrow a bit from lewis' plugin and multithread it - that should make it a good bit faster (hopefully keeping from waiting on the glDrawPixels etc. calls as much). It's far from finished, it only seems to work on O2's as it uses the YCrCb_422_SGIX pixel type. So hey, atleast those of us with O2's may get a speedup, right?
Beautiful stuff, dexter!

Vegac, I wouldn't really sweat the pthreads for O2s - glDrawPixels() is very fast.

Gonna go apply that IDCT patch to my source tree - I'll probably be able to play DVDs at full speed, finally!
Awesome work guys, very fantastic thread.

Now we just need to search that individual that gets all those win codecs running :twisted: :twisted:

Matthias

_________________
Life is what happens while we are making other plans