Nekonomicon - MPlayer 1.0pre4 IDCT Optimisation (LONG POST!)

dexter1
Moderator
Who joined Feb. 20, 2003, 7:57 a.m.
and authored 1875 notes

Wrote the following at July 22, 2004, 6:26 p.m...

Hi all,

Sorry for not being so active on Nekoware builds the last couple of weeks, but i really wanted to take part in Schleusel's and Vegac's attempts in making MPlayer just a little bit faster, so i can watch neato movie stuff on my I2 Impact

. It has cost me a lot of time, but boy, am i glad i spent it with them and MPlayer. I will show you what i did, so maybe you can learn from my trials of getting that app speed up it's framerate. The optimisation is not done yet, it still ongoing, and we only just begun searching out the possibilities of hardware colorspace conversion, but the methods behind the software optimisation part has now been understood. Here it comes. I hope Neko won't mind me breaking the record of longest post ever on Nekochan. Beer is in the mail, Pete

Since my first speedshop run, in the mplayer 1.0pre3 thread viewtopic.php?t=1374 i've read man pages of ssrun just to get myself acquainted with the most used options. Instead of 'ssrun -exp totaltime' or 'ssrun -exp usertime' i now do 'ssrun -exp fpcsampx' to get my Finegrained-ProgramCounter-SAMPling-with-4-bytes timing for all the routines which MPlayer is busy with. So lets do a standard MIPSPro 7.4.2 '-O3 -r10000 -mips4 -n32' build of MPlayer 1.0-pre4, run 'ssrun -exp fpcsampx ./mplayer' with a standard .avi file (Schleusel and i use "courtyard.avi" from a Call Of Duty Demo video (11.4MB), to be found on the net). Machine is an R10K@180MHz Origin200:

Code:

  ssrun -exp fpcsampx ./mplayer -vo null -vf format=rgb24 -nosound courtyard.avi
  
  prof -lines mplayer.fpcsampx.#

Code:

  -------------------------------------------------------------------------
  
  SpeedShop profile listing generated Thu Jul 22 09:54:04 2004
  
  prof -lines mplayer.fpcsampx.m249164
  
  mplayer (n32): Target program
  
  fpcsampx: Experiment name
  
  pc,4,1000,0:cu: Marching orders
  
  R10000 / R10010: CPU / FPU
  
  4: Number of CPUs
  
  180: Clock frequency (MHz.)
  
  Experiment notes--
  
  From file mplayer.fpcsampx.m249164:
  
  Caliper point 0 at target begin, PID 249164
  
  /usr1/local/everdij/MPlayer-1.0pre4/mplayer -vo null -vf format=rgb24 -nosound courtyard.avi
  
  Caliper point 1 at exit(0)
  
  -------------------------------------------------------------------------
  
  Summary of statistical PC sampling data (fpcsampx)--
  
  64010: Total samples
  
  64.010: Accumulated time (secs.)
  
  1.0: Time per sample (msecs.)
  
  4: Sample bin width (bytes)
  
  -------------------------------------------------------------------------
  
  Function list, in descending order by time
  
  -------------------------------------------------------------------------
  
  [index]      secs    %    cum.%   samples  function (dso: file, line)
  
  [1]    21.754  34.0%  34.0%     21754  simple_idct_add (mplayer: simple_idct.c, 399)
  
  [2]    20.301  31.7%  65.7%     20301  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 319)
  
  [3]     4.593   7.2%  72.9%      4593  put_pixels8_c (mplayer: dsputil.c, 897)
  
  [4]     3.701   5.8%  78.7%      3701  simple_idct_put (mplayer: simple_idct.c, 389)
  
  [5]     2.402   3.8%  82.4%      2402  msmpeg4_decode_block (mplayer: msmpeg4.c, 1676)
  
  [6]     1.323   2.1%  84.5%      1323  put_pixels16_xy2_c (mplayer: dsputil.c, 897)
  
  [7]     1.214   1.9%  86.4%      1214  memset (libc.so.1: stat.c, 32; compiled in bzero.s)

and so forth.... Note the total time, which is 64 seconds. One third of the program's time is used in the InverseDiscreteCosineTransform adition, another third in the colorspace converter yuv2rgb_c_24_rgb and the rest in the rest.
So Schleusel, Vegac and me sorta divided up the tasks. Schleusel tried some optimisations flags (-IPA) and porting details, Vegac had a look at the colorspace conversion and possible OpenGL O2/ICE speedups, and i got libavcodec/simple_idct.c :) simple_idct_add is just a small routine, consisting of two parts of each 8 'for' loops, which is basically the way how a double 1D-IDCT works. First the rows are transformed, and then the colums are transformed. The fransformed values are added to the already existing image, hence 'add'. For more info, read up on IDCT:

(1) http://skal.planet-d.net/coding/dct.html
(2) http://www-vs.informatik.uni-ulm.de/bib ... paper.html
(3) http://rnvs.informatik.tu-chemnitz.de/~ ... /IDCT.html

Here's the interesting part:

Code:

  /* signed 16x16 -> 32 multiply add accumulate */
  
  #define W1  22725  //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
  
  #define W2  21407  //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
  
  #define W3  19266  //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
  
  #define W4  16383  //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
  
  #define W5  12873  //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
  
  #define W6  8867   //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
  
  #define W7  4520   //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
  
  #define ROW_SHIFT 11
  
  #define COL_SHIFT 20 // 6
  
  #define MAC16(rt, ra, rb) rt += (ra) * (rb)
  
  /* signed 16x16 -> 32 multiply */
  
  #define MUL16(rt, ra, rb) rt = (ra) * (rb)
  
  static inline void idctSparseColAdd (uint8_t *dest, int line_size,
  
  DCTELEM * col)
  
  {
  
  int a0, a1, a2, a3, b0, b1, b2, b3;
  
  uint8_t *cm = cropTbl + MAX_NEG_CROP;
  
  /* XXX: I did that only to give same values as previous code */
  
  a0 = W4 * (col[8*0] + ((1<<(COL_SHIFT-1))/W4));
  
  a1 = a0;
  
  a2 = a0;
  
  a3 = a0;
  
  a0 +=  + W2*col[8*2];
  
  a1 +=  + W6*col[8*2];
  
  a2 +=  - W6*col[8*2];
  
  a3 +=  - W2*col[8*2];
  
  MUL16(b0, W1, col[8*1]);
  
  MUL16(b1, W3, col[8*1]);
  
  MUL16(b2, W5, col[8*1]);
  
  MUL16(b3, W7, col[8*1]);
  
  MAC16(b0, + W3, col[8*3]);
  
  MAC16(b1, - W7, col[8*3]);
  
  MAC16(b2, - W1, col[8*3]);
  
  MAC16(b3, - W5, col[8*3]);
  
  if(col[8*4]){
  
  a0 += + W4*col[8*4];
  
  a1 += - W4*col[8*4];
  
  a2 += - W4*col[8*4];
  
  a3 += + W4*col[8*4];
  
  }
  
  if (col[8*5]) {
  
  MAC16(b0, + W5, col[8*5]);
  
  MAC16(b1, - W1, col[8*5]);
  
  MAC16(b2, + W7, col[8*5]);
  
  MAC16(b3, + W3, col[8*5]);
  
  }
  
  if(col[8*6]){
  
  a0 += + W6*col[8*6];
  
  a1 += - W2*col[8*6];
  
  a2 += + W2*col[8*6];
  
  a3 += - W6*col[8*6];
  
  }
  
  if (col[8*7]) {
  
  MAC16(b0, + W7, col[8*7]);
  
  MAC16(b1, - W5, col[8*7]);
  
  MAC16(b2, + W3, col[8*7]);
  
  MAC16(b3, - W1, col[8*7]);
  
  }
  
  dest[0] = cm[dest[0] + ((a0 + b0) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a1 + b1) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a2 + b2) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a3 + b3) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a3 - b3) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a2 - b2) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a1 - b1) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a0 - b0) >> COL_SHIFT)];
  
  }

At first glance i thought, what a lot of branches/decisions! If this routine is to become fast, it has to get rid of all those branches. Careful observing shows that the 'if' statements can be removed safely, because the condition (col[8*x]) is true when col[8*x]!=0. But if col[8*x]==0 then nothing is added or subtraced to the coefficients inside the 'if' statement anyway, so the if statement is superfluous:

Code:

  if(col[8*4]){
  
  a0 += + W4*col[8*4];
  
  a1 += - W4*col[8*4];
  
  a2 += - W4*col[8*4];
  
  a3 += + W4*col[8*4];
  
  }
  
  if (col[8*5]) {
  
  MAC16(b0, + W5, col[8*5]);
  
  MAC16(b1, - W1, col[8*5]);
  
  MAC16(b2, + W7, col[8*5]);
  
  MAC16(b3, + W3, col[8*5]);
  
  }
  
  if(col[8*6]){
  
  a0 += + W6*col[8*6];
  
  a1 += - W2*col[8*6];
  
  a2 += + W2*col[8*6];
  
  a3 += - W6*col[8*6];
  
  }
  
  if (col[8*7]) {
  
  MAC16(b0, + W7, col[8*7]);
  
  MAC16(b1, - W5, col[8*7]);
  
  MAC16(b2, + W3, col[8*7]);
  
  MAC16(b3, - W1, col[8*7]);
  
  }

will become:

Code:

  a0 += + W4*col[8*4];
  
  a1 += - W4*col[8*4];
  
  a2 += - W4*col[8*4];
  
  a3 += + W4*col[8*4];
  
  MAC16(b0, + W5, col[8*5]);
  
  MAC16(b1, - W1, col[8*5]);
  
  MAC16(b2, + W7, col[8*5]);
  
  MAC16(b3, + W3, col[8*5]);
  
  a0 += + W6*col[8*6];
  
  a1 += - W2*col[8*6];
  
  a2 += + W2*col[8*6];
  
  a3 += - W6*col[8*6];
  
  MAC16(b0, + W7, col[8*7]);
  
  MAC16(b1, - W5, col[8*7]);
  
  MAC16(b2, + W3, col[8*7]);
  
  MAC16(b3, - W1, col[8*7]);

So now i have to do more instructions, but will it weigh up to the time spent in those conditions? Answer later.

When reading (1) it becomes clear that these are indeed matrix operations of which 4 of them are so called 'rotations' which involves cosines. The cosine coefficients are all in separate '#defines' and converted to integers, so that makes this routine a 'fast-integer 1D-IDCT' Now write out all the multiplications and coefficients:

Code:

  a0 = W4 * (col[8*0] + ((1<<(COL_SHIFT-1))/W4));
  
  a1 = a0;
  
  a2 = a0;
  
  a3 = a0;
  
  a0 +=  + W2*col[8*2];
  
  a1 +=  + W6*col[8*2];
  
  a2 +=  - W6*col[8*2];
  
  a3 +=  - W2*col[8*2];
  
  MUL16(b0, W1, col[8*1]);
  
  MUL16(b1, W3, col[8*1]);
  
  MUL16(b2, W5, col[8*1]);
  
  MUL16(b3, W7, col[8*1]);
  
  MAC16(b0, + W3, col[8*3]);
  
  MAC16(b1, - W7, col[8*3]);
  
  MAC16(b2, - W1, col[8*3]);
  
  MAC16(b3, - W5, col[8*3]);
  
  a0 += + W4*col[8*4];
  
  a1 += - W4*col[8*4];
  
  a2 += - W4*col[8*4];
  
  a3 += + W4*col[8*4];
  
  MAC16(b0, + W5, col[8*5]);
  
  MAC16(b1, - W1, col[8*5]);
  
  MAC16(b2, + W7, col[8*5]);
  
  MAC16(b3, + W3, col[8*5]);
  
  a0 += + W6*col[8*6];
  
  a1 += - W2*col[8*6];
  
  a2 += + W2*col[8*6];
  
  a3 += - W6*col[8*6];
  
  MAC16(b0, + W7, col[8*7]);
  
  MAC16(b1, - W5, col[8*7]);
  
  MAC16(b2, + W3, col[8*7]);
  
  MAC16(b3, - W1, col[8*7]);

=

Code:

  a0  = W4 * col[8*0] + (1<<(COL_SHIFT-1));
  
  a1  = a0;
  
  a2  = a0;
  
  a3  = a0;
  
  a0 += W4*col[8*4];
  
  a1 -= W4*col[8*4];
  
  a2 -= W4*col[8*4];
  
  a3 += W4*col[8*4];
  
  a0 += col[8*2]*W2;
  
  a1 += col[8*2]*W6;
  
  a2 -= col[8*2]*W6;
  
  a3 -= col[8*2]*W2;
  
  a0 += col[8*6]*W6;
  
  a1 -= col[8*6]*w2;
  
  a2 += col[8*6]*W2;
  
  a3 -= col[8*6]*W6;
  
  b0  = col[8*1]*W1;
  
  b1  = col[8*1]*W3;
  
  b2  = col[8*1]*W5;
  
  b3  = col[8*1]*W7;
  
  b0 += col[8*3]*W3;
  
  b1 -= col[8*3]*W7;
  
  b2 -= col[8*3]*W1;
  
  b3 -= col[8*3]*W5;
  
  b0 += col[8*5]*W5;
  
  b1 -= col[8*5]*W1;
  
  b2 += col[8*5]*W7;
  
  b3 += col[8*5]*W3;
  
  b0 += col[8*7]*W7;
  
  b1 -= col[8*7]*W5;
  
  b2 += col[8*7]*W3;
  
  b3 -= col[8*7]*W1;

=

Code:

  int d0,d2=col[8*2],d4=W4*col[8*4],d6=col[8*6];
  
  int d1=col[8*1],d3=col[8*3],d5=col[8*5],d7=col[8*7];
  
  a0  = W4 * col[8*0] + (1<<(COL_SHIFT-1));
  
  a1  = a0;
  
  a0 += d4;
  
  a1 -= d4;
  
  a3  = a0;
  
  a2  = a1;
  
  a0 += d2*W2 + d6*W6;
  
  a1 += d2*W6 - d6*W2;
  
  a2 +=-d2*W6 + d6*W2;
  
  a3 +=-d2*W2 - d6*W6;
  
  b0  = d1*W1 + d7*W7;
  
  b3  = d1*W7 - d7*W1;
  
  b2  = d1*W5 + d7*W3;
  
  b1  = d1*W3 - d7*W5;
  
  b0 += d3*W3 + d5*W5;
  
  b3 +=-d3*W5 + d5*W3;
  
  b1 +=-d3*W7 - d5*W1;
  
  b2 +=-d3*W1 + d5*W7;

So after a lot of wizardry, i'm left with some fairly symmetric multiplications. Now comes the clever part. Also from (1), A multiplication of the form:

Code:

  t0 = W0 * d0 + W1 * d1;
  
  t1 = W0 * d1 - W1 * d0;

can also be written as:

Code:

  int tmp = W0 * (d0 + d1);
  
  t0 = tmp + (W1 - W0) * d1;
  
  t1 = tmp - (W1 + W0) * d0;

which saves you one expensive multiplication per 2x2 matrix multiplication. This sort of 2x2 matrix multiplication BTW is supposed to be called a Butterfly, because of the butterfly shape of the diagrammatic form.

This is the reason why i had to get rid of those 'if' branches in the beginning. I couldn't have written out those butterflies without those pieces inside an 'if'. In this case it is worthwile, but one always has to test these things carefully.

Enter the SIMD (Single Instruction on Multiple Data) instructions on the MIPS 4 Instruction set.

Code:

  madd  ==>  a = a + b*c
  
  nmsub ==>  a = a - b*c

These instruction perform a multiplication and addition/subtraction in one go! They are only to be found on mips4 instruction sets, so you gotta compile with -mips4 to get them. Also they are only for floats, either singles and doubles, not for integers! :(

BUT, the R10000 is a pretty nifty piece of machinery. It is blessed with both an Integer ALU and floating point ALU and can issue 2 integer ops and 1 floating point ops in one clocktick. So what if we substitute a part of the calculation with floats instead of integers? Maybe the compiler can then weave these two "threads" together, with the added bonus of 'madd'/'nmsub' instructions for the floating point part. So i devised this:

Code:

  #define BUTTERFLY(t0,t1,W0,W1,d0,d1)    \
  
  do {                                    \
  
  int tmp = W0 * (d0 + d1);           \
  
  t0 = tmp + (W1 - W0) * d1;          \
  
  t1 = tmp - (W1 + W0) * d0;          \
  
  } while (0)
  
  #define BUTTERFLY0(t0,t1,W0,W1,d0,d1)   \
  
  do {                                    \
  
  t0 = W0 * d0 + W1 * d1;             \
  
  t1 = W0 * d1 - W1 * d0;             \
  
  } while (0)
  
  #define BUTTERFLYADD0(t0,t1,W0,W1,d0,d1)\
  
  do {                                    \
  
  t0 += W0 * d0 + W1 * d1;            \
  
  t1 += W0 * d1 - W1 * d0;            \
  
  } while (0)
  
  static inline void idctSparseColAdd (uint8_t *dest, int line_size,
  
  DCTELEM * col)
  
  {
  
  int a0, a1, a2, a3;
  
  float b0, b1, b2, b3;
  
  uint8_t *cm = cropTbl + MAX_NEG_CROP;
  
  int d0,d4=W4*col[8*4];
  
  float d1=col[8*1],d3=col[8*3],d5=col[8*5],d7=col[8*7];
  
  /* XXX: I did that only to give same values as previous code */
  
  a0 = W4 * col[8*0] + (1<<(COL_SHIFT-1));
  
  a1 = a0;
  
  BUTTERFLY0(b0,b3,d1,d7,W1,W7);
  
  a0 += d4;
  
  a1 -= d4;
  
  BUTTERFLY0(b2,b1,d1,d7,W5,W3);
  
  a3 = a0;
  
  a2 = a1;
  
  BUTTERFLYADD0(b3,b0,d3,d5,-W5,W3);
  
  BUTTERFLY(d0,d4,col[8*2],col[8*6],W2,W6);
  
  BUTTERFLYADD0(b1,b2,d3,d5,-W7,-W1);
  
  a0 += d0;
  
  a1 += d4;
  
  a2 -= d4;
  
  a3 -= d0;
  
  dest[0] = cm[dest[0] + ((a0 + (int)b0) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a1 + (int)b1) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a2 + (int)b2) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a3 + (int)b3) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a3 - (int)b3) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a2 - (int)b2) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a1 - (int)b1) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a0 - (int)b0) >> COL_SHIFT)];
  
  }

All the calculation involving the 'b' coefficients are floats and all the 'a's are integers. Got some temp storage variables as well, since the R10000 has a large number of registers, so this never hurts.
The BUTTERFLY0 macro is for the floats, because that gives the compiler the chance to issue the madd instructions and the normal BUTTERFLY macro is for the integers, because of the saving of one multiplication. Oh and the order of macro's for the float series (b) is independent from the integer series (a), which makes it easy for the compiler/CPU to balance the load between the two ALU's

If you look carefully look at the assembly output ('c99 -S'), you can see the weaving of instructions. First the 'old' routine:
Normal

Code:

  # Program Unit: idctSparseColAdd
  
  .ent    idctSparseColAdd
  
  idctSparseColAdd:       # 0x340
  
  .dynsym idctSparseColAdd        sto_default
  
  .frame  $sp, 80, $31
  
  # lgra_spill_temp_9 = 0
  
  # lgra_spill_temp_10 = 8
  
  # lgra_spill_temp_11 = 16
  
  # lgra_spill_temp_12 = 24
  
  # lra_spill_temp_13 = 32
  
  # lra_spill_temp_14 = 40
  
  # lra_spill_temp_15 = 48
  
  # lra_spill_temp_16 = 56
  
  # lra_spill_temp_17 = 64
  
  # lra_spill_temp_18 = 72
  
  .loc    1 255 1
  
  # 251  }
  
  # 252
  
  # 253  static inline void idctSparseColAdd (uint8_t *dest, int line_size,
  
  # 254                                       DCTELEM * col)
  
  # 255  {
  
  .BB1.idctSparseColAdd:  # 0x340
  
  #<freq>
  
  #<freq> BB:1 frequency = 1.00000 (heuristic)
  
  #<freq>
  
  addiu $sp,$sp,-80               # [0]
  
  sd $16,0($sp)                   # [1]  lgra_spill_temp_9
  
  .loc    1 265 9
  
  # 261          a1 = a0;
  
  # 262          a2 = a0;
  
  # 263          a3 = a0;
  
  # 264
  
  # 265          a0 +=  + W2*col[8*2];
  
  lh $16,32($6)                   # [0]  id:201
  
  addiu $10,$0,21407              # [0]
  
  mult $16,$10                    # [2]
  
  .loc    1 266 9
  
  # 266          a1 +=  + W6*col[8*2];
  
  addiu $12,$0,8867               # [1]
  
  .loc    1 255 1
  
  sd $19,24($sp)                  # [2]  lgra_spill_temp_12
  
  .loc    1 265 9
  
  mflo $19                        # [8]
  
  nop                             # [2]
  
  nop                             # [2]
  
  .loc    1 266 9
  
  .
  
  .
  
  .
  
  .loc    1 322 9
  
  lbu $19,0($18)                  # [163]  id:229
  
  sra $24,$24,20                  # [160]
  
  .loc    1 323 1
  
  # 323  }
  
  ld $16,0($sp)                   # [164]  lgra_spill_temp_9
  
  .loc    1 322 9
  
  addu $19,$19,$24                # [165]
  
  addu $17,$17,$19                # [166]
  
  .loc    1 323 1
  
  ld $19,24($sp)                  # [165]  lgra_spill_temp_12
  
  .loc    1 322 9
  
  lbu $17,384($17)                # [168]  id:230 cropTbl+0x0
  
  sb $17,0($18)                   # [169]  id:231
  
  .loc    1 323 1
  
  ld $17,8($sp)                   # [170]  lgra_spill_temp_10
  
  ld $18,16($sp)                  # [171]  lgra_spill_temp_11
  
  jr $31                          # [162]
  
  addiu $sp,$sp,80                # [162]
  
  .end    idctSparseColAdd
  
  .section .text

171 clockticks! Because all the ops are integers, the integer ALU gets extremely busy, thereby stalling the throughput.

Optimised

Code:

  # Program Unit: idctSparseColAdd
  
  .ent    idctSparseColAdd
  
  idctSparseColAdd:       # 0x200
  
  .dynsym idctSparseColAdd        sto_default
  
  .frame  $sp, 0, $31
  
  .loc    1 204 1
  
  # 200  }
  
  # 201
  
  # 202  static inline void idctSparseColAdd (uint8_t *dest, int line_size,
  
  # 203                                       DCTELEM * col)
  
  # 204  {
  
  .BB1.idctSparseColAdd:  # 0x200
  
  #<freq>
  
  #<freq> BB:1 frequency = 1.00000 (heuristic)
  
  #<freq>
  
  .loc    1 209 32
  
  # 205          int a0, a1, a2, a3;
  
  # 206          float b0, b1, b2, b3;
  
  # 207          uint8_t *cm = cropTbl + MAX_NEG_CROP;
  
  # 208          int d0,d4=W4*col[8*4];
  
  # 209          float d1=col[8*1],d3=col[8*3],d5=col[8*5],d7=col[8*7];
  
  lh $8,80($6)                    # [0]  id:169
  
  .loc    1 209 20
  
  lh $10,48($6)                   # [1]  id:168
  
  .loc    1 209 32
  
  mtc1 $8,$f0                     # [2]
  
  .loc    1 209 8
  
  lh $12,16($6)                   # [2]  id:167
  
  .loc    1 209 20
  
  mtc1 $10,$f7                    # [3]
  
  .loc    1 224 9
  
  # 220          a3 = a0;
  
  # 221          a2 = a1;
  
  # 222
  
  # 223          BUTTERFLYADD0(b3,b0,d3,d5,-W5,W3);
  
  # 224          BUTTERFLY(d0,d4,col[8*2],col[8*6],W2,W6);
  
  lh $1,32($6)                    # [3]  id:172
  
  addiu $9,$0,30274               # [1]
  
  .loc    1 209 44
  
  lh $11,112($6)                  # [4]  id:170
  
  .loc    1 209 32
  
  cvt.s.w $f0,$f0                 # [5]
  
  .loc    1 224 9
  
  mult $1,$9                      # [5]
  
  .loc    1 209 8
  
  mtc1 $12,$f1                    # [4]
  
  .
  
  subu $15,$15,$25                # [13]
  
  .loc    1 234 9
  
  madd.s $f9,$f9,$f0,$f6          # [18]                             <== Float op [18], combined with integer op [18]
  
  .loc    1 232 9
  
  trunc.w.s $f13,$f13             # [21]
  
  .loc    1 217 9
  
  subu $2,$2,$3                   # [18]                             <== Integer op [18], combined with float op [18]
  
  .loc    1 236 9
  
  # 235          dest += line_size;
  
  # 236          dest[0] = cm[dest[0] + ((a2 + (int)b2) >> COL_SHIFT)];
  
  mul.s $f6,$f7,$f6               # [19]                             <== Float op [19], combined with integer op [19]
  
  .loc    1 217 9
  
  lui $3,8                        # [15]
  
  addu $2,$15,$2                  # [19]                             <== Integer op [19], combined with float op [19]
  
  .loc    1 224 9
  
  addu $9,$24,$9                  # [18]                             <== Integer op [18], combined with float op [18]
  
  .loc    1 232 9
  
  mfc1 $7,$f13                    # [23]
  
  .loc    1 217 9
  
  addu $2,$2,$3                   # [20]                             <== Integer op [20], combined with float op [20]
  
  .loc    1 236 9
  
  nmsub.s $f6,$f6,$f0,$f11        # [20]                             <== Float op [20], combined with integer op [20]
  
  .loc    1 227 9
  
  .
  
  lbu $8,384($8)                  # [48]  id:193 cropTbl+0x0
  
  .loc    1 245 9
  
  # 245          dest += line_size;
  
  addu $2,$5,$9                   # [45]
  
  .loc    1 244 9
  
  sb $8,0($9)                     # [49]  id:194
  
  .loc    1 246 9
  
  # 246          dest[0] = cm[dest[0] + ((a0 - (int)b0) >> COL_SHIFT)];
  
  subu $4,$6,$7                   # [46]
  
  lbu $3,0($2)                    # [50]  id:195
  
  sra $4,$4,20                    # [47]
  
  addu $3,$3,$4                   # [52]
  
  addu $1,$1,$3                   # [53]
  
  lbu $1,384($1)                  # [54]  id:196 cropTbl+0x0
  
  .loc    1 247 1
  
  # 247  }
  
  jr $31                          # [48]
  
  .loc    1 246 9
  
  sb $1,0($2)                     # [55]  id:197
  
  .end    idctSparseColAdd
  
  .section .text
  
  .align 6

Tadaaa, 55 ticks! Amazing what a little software pipelining can do for your code!

Running speedshop again proves it:

Code:

  
  Summary of statistical PC sampling data (fpcsampx)--
  
  49298: Total samples
  
  49.298: Accumulated time (secs.)
  
  1.0: Time per sample (msecs.)
  
  4: Sample bin width (bytes)
  
  -------------------------------------------------------------------------
  
  Function list, in descending order by time
  
  -------------------------------------------------------------------------
  
  [index]      secs    %    cum.%   samples  function (dso: file, line)
  
  [1]    20.239  41.1%  41.1%     20239  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 319)
  
  [2]     7.177  14.6%  55.6%      7177  idctSparseColAdd (mplayer: simple_idct.c, 209)
  
  [3]     4.454   9.0%  64.6%      4454  put_pixels8_c (mplayer: dsputil.c, 897)
  
  [4]     2.382   4.8%  69.5%      2382  msmpeg4_decode_block (mplayer: msmpeg4.c, 1676)
  
  [5]     1.948   4.0%  73.4%      1948  idctRowCondDC (mplayer: simple_idct.c, 104)
  
  [6]     1.551   3.1%  76.6%      1551  simple_idct_put (mplayer: simple_idct.c, 313)
  
  [7]     1.301   2.6%  79.2%      1301  put_pixels16_xy2_c (mplayer: dsputil.c, 897)

IDCT is a bit more scattered now, three routines instead of one, but added up (idctSparseColAdd +idctRowCondDC + simple_idct_put) gives 21.7% which is half the time of yuv2rgb_c_24_rgb(41.1%). Compared that with the starting speedshop run, this is a 50% reduction in time spent in IDCT! Looking at the total times. 64.0 versus 49.3 seconds is 23% speedup of the app. Whoa! Granted, it's only for this .avi file and this specific IDCT. Other codecs need other routines, but it's a start.

Well, hope you have read through it and picked up some ideas. Next time i'll be looking at yuv2rgb.c software wise.

The patch for libavcodec/simple_idct.c is now living at http://www.mechanics.citg.tudelft.nl/~e ... pre4.patch get it and try it. Schleusel and i will try to pester MPlayer CVS guru's to get us some libavcodec/mips subdirectory where we can store this.

%-)

Dr. Dave
Old Salt
Who joined Feb. 13, 2004, 11:37 p.m.
and authored 1680 notes

Wrote the following at July 22, 2004, 9:39 p.m...

The MIPS CPU architecture is a pretty awesome piece of engineering when programmed correctly - this just goes to show it!

squeen
Administrator
Who joined May 9, 2003, 6:10 a.m.
and authored 2549 notes

Wrote the following at July 23, 2004, 4:02 a.m...

Sheer genius! -- I'm envious of your talent and time.

Thanks for sharing, there's more than a few good optimization tips there!

Also, don't let this good programming get lost! Build a super nekoware mplayer when you are through (patches and all)!

Hakimoto
Moderator
Who joined March 30, 2003, 5:29 a.m.
and authored 2246 notes

Wrote the following at July 23, 2004, 5:18 a.m...

Damn that's cool. I read the whole thing, and I have to say I am very impressed. Go go go for the mplayer optimisation boys!

Shame I still have no TRAM for my Octane... but then I don't play movies on it. I play them on my old crappy 366 MHz PII Laptop (the FAMOUS one), and in general I have to say that mplayer is one awesome piece of software.

Keep at it guys.

_________________
覇気元
Eroteme.org

vegac
Addict
Who joined Jan. 23, 2003, 12:34 p.m.
and authored 646 notes

Wrote the following at July 23, 2004, 9:07 a.m...

dex and I have both begun messing with yuv2rgb as well. He managed a couple percentage speed up on it, and I managed a few more percentage on top of that. Slowly but surely this beast is speeding up.

A side-note, when running with the gl2 video output, CRM and GL calls are actually the slowest piece of software at this point...of course this means we need to find ways around this - presumably an optimized 32bit RGBA conversion (currently we've been focusing on 24bit yuv to rgb conversion) would allow O2 to take the textures in native format and not have to convert them each frame...

GeneratriX
Old Salt
Who joined Oct. 21, 2003, 2:07 a.m.
and authored 3202 notes

Wrote the following at July 23, 2004, 10:32 a.m...

...congratullations Dexter1 and Vegac!, very impressive optimization work!; I'm a lot sure that with the help and investigations of both, the maximum performance level for MPlayer can be achieved on MIPS.
Very anxious to try the NekoWare MPlayer as soon as possible.

...Keep on the good work, both! \;)/

foetz
Old Salt
Who joined April 14, 2003, 4:34 a.m.
and authored 3609 notes

Wrote the following at July 23, 2004, 3:27 p.m...

great idea!!

especially the yuv2rgb is one of the top 3 slowdowns.
i'm anxious, too!

_________________
r-a-c.de

squeen
Administrator
Who joined May 9, 2003, 6:10 a.m.
and authored 2549 notes

Wrote the following at July 23, 2004, 7:57 p.m...

I vote this the most exciting thread!

Maybe I can write a simple Motif front-end for the final product....hmmm

GeneratriX
Old Salt
Who joined Oct. 21, 2003, 2:07 a.m.
and authored 3202 notes

Wrote the following at July 23, 2004, 8:16 p.m...

squeen wrote:

I vote this the most exciting thread!

Completely agree with you, Squeen!

squeen wrote:

Maybe I can write a simple Motif front-end for the final product....hmmm

...Oh yeah!, a nice GUI could be really great! \:D/

nekonoko
Site Admin
Who joined Jan. 23, 2003, 2:31 a.m.
and authored 5418 notes

Wrote the following at July 24, 2004, 2:17 p.m...

dexter1 wrote:

Beer is in the mail, Pete

I should be buying the beer for all this hard work. Incredible job guys - I can't wait to give it a try.

_________________
私のホバークラフトは鰻が一杯です。

Dubhthach
Addict
Who joined Oct. 7, 2003, 8:16 a.m.
and authored 771 notes

Wrote the following at July 25, 2004, 10:20 a.m...

squeen wrote:

I vote this the most exciting thread!

Maybe I can write a simple Motif front-end for the final product....hmmm

There is a gui for mplayer but you need to specify it during compiling by using --enable-gui during ./configure . Then, to turn on GUI mode, you have to execute the gmplayer binary. It uses gtk+ 1.2 from what i recall.

_________________
"Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better"

colin
Addict
Who joined Feb. 7, 2004, 6:43 p.m.
and authored 429 notes

Wrote the following at July 25, 2004, 6:07 p.m...

Awesome work!

You should get ahold of Brandon to see if he has any hardware colorspace conversion suggestions. I recall he spent a lot of time and research working on that component of his IRIX DivX player.

vegac
Addict
Who joined Jan. 23, 2003, 12:34 p.m.
and authored 646 notes

Wrote the following at July 26, 2004, 2:30 p.m...

So this morning I began a CRM optimized video out plugin for mplayer...

A test movie we've been using (640x480, courtyard.avi from Call of Duty) went from playing in 49 seconds, to 29 seconds - a 20 second speedup!

I need to borrow a bit from lewis' plugin and multithread it - that should make it a good bit faster (hopefully keeping from waiting on the glDrawPixels etc. calls as much). It's far from finished, it only seems to work on O2's as it uses the YCrCb_422_SGIX pixel type. So hey, atleast those of us with O2's may get a speedup, right?

lewis
Addict
Who joined Nov. 27, 2003, 1:30 p.m.
and authored 546 notes

Wrote the following at July 28, 2004, 2:41 p.m...

Beautiful stuff, dexter!

Vegac, I wouldn't really sweat the pthreads for O2s - glDrawPixels() is very fast.

Gonna go apply that IDCT patch to my source tree - I'll probably be able to play DVDs at full speed, finally!

Brombear
Addict
Who joined Oct. 5, 2003, 8:42 a.m.
and authored 726 notes

Wrote the following at July 28, 2004, 9:36 p.m...

Awesome work guys, very fantastic thread.

Now we just need to search that individual that gets all those win codecs running :twisted:

Matthias

_________________
Life is what happens while we are making other plans

SGI: Development

MPlayer 1.0pre4 IDCT Optimisation (LONG POST!)