Nekonomicon - dexter1

dexter1
Moderator
From Zoetermeer, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of a few tardists at Nov. 4, 2003, 12:17 a.m...

schleusel wrote: HI renamed the tardist and edited above announcement. Could someone please change the blog entry accordingly?

Just did that.

dexter1
Moderator
From Zoetermeer, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of iconbar at Nov. 20, 2003, 4:22 a.m...

Love the app! (i just installed it from sources on my O2 IRIX 6.5.16m mpro7313)

But there is a problem with xlock screensavers and password locking.

When i revive the machine and want to enter my password to unlock the screen, the screensaver restarts whenever i enter the first character. At first i was stumped, but then i saw iconbar blitting away at the very bottom. So i killed xlock manually and got my desktop back.

Exactly how one affects another i'm not sure, but try running an iconbar yourself, and set screensaver to lock the screen after one minute to see the effect (hopefully)

dexter1
Moderator
From Zoetermeer, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of iconbar at Nov. 23, 2003, 1:58 a.m...

Thanks Squeen for the fix! Haven't tried it yet, cause i'm at home...

Actually i'm gonna try this on my Crimson at home because of my holiday next week. The monster reacted favourably to my self fritted cables and i have a fresh disk ready to install IRIX 6.2 and iconbar. Expect screenshots

dexter1
Moderator
From Zoetermeer, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of 600MHz O2 Is Up And Running!! at Nov. 24, 2003, 2:15 a.m...

Yes, read:
http://www.ff-net.demon.nl/papa/overclo ... eite3.html

The Indy R5K@150MHz and the O2 R5K@180MHz are identical in appearance. Gemm swapped them and both machines were happy with eachothers CPU.
BTW, the 180 MHz Indy cpu has a voltage regulator feeding the CPU at 3.6 volt, which the 150 MHz is lacking.

Triox has pioneered Indy overclocking, together with my feeble attemps at overclocking 5 Volt R4K's. We are now focusing on modifying the EEPROM to overclock the R5K@150MHz to 200 MHz. If the O2 can take that, the Indy should work as well.

dexter1
Moderator
From Zoetermeer, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of iconbar at Nov. 30, 2003, 12:42 p.m...

Done: http://www.nekochan.net/wiki/gallery/album21/crimea1

I took the latest CVS last fridayevening. In all the excitement i forgot to test the password locking. #-o Will do that tomorrow on my O2 at work asap.

Cheers

Update: the screenlocking is fixed indeed, thanks so much!

dexter1
Moderator
From Zoetermeer, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of iconbar at Dec. 1, 2003, 3:58 a.m...

.... only thing is that now the icons reappear after i unlock the screensaver by typing in my password. This doesn't happen when the screensaver doesn't lock.

The only way to fix it is by clicking on the icons and minimize them which brings them in the iconbar again. A bit annoying if you have to do that at each screensaver unlock.

But i am now convinced that this app will stay on my machine!

dexter1
Moderator
From Zoetermeer, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of iconbar at Dec. 1, 2003, 6:30 a.m...

No this is irix 6.5.19m on my O2 at work. My irix 6.2 Crimson RE is at home.

I've been able to reproduce the problem. It's a multiple desktop problem, but has nothing to do with xlock.
Apparently 'ov' does lots of dirty tricks. Here's how i can reproduce it:

Make two (or three) desks in ov and start iconbar in desk 1. You see the icons on the ov display, but they are not showing on your desktop! If you switch back and forth with the second desk the icons appear as soon as you go back to desk 1.

Hmm. I thought Lisa had issues as well with multiple desktop. I believe she said that if you click on an iconized window from another desktop it will apppear on your current one.

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of rss-0.7.6 at Dec. 31, 2003, 1:34 a.m...

Say Squeen,

How did you manage to circumvent the nagging about ImageMagick 5.5.1 when configuring? The freeware distro is 5.4.x, which configure doesn't like. And did you build it with gcc or MIPSPro?

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of rss-0.7.6 at Dec. 31, 2003, 6:55 a.m...

Sheesh, the horrors MIPSPro compiler developers put us though to get our apps compiled...

Thanks for the answer, i'll attempt an optimized 0.7.6 with MIPSPro and put back some of the notexture stuff from Lisa back in it. I also try a static build with libbz2 libpng and who knows libimagemagick as well.

Just being masochistic today. Wonder why i'm not out there igniting fireworks

dexter1
Moderator
Who joined Feb. 20, 2003, 7:57 a.m.
and authored 1878 notes

Wrote on the subject of Network install of Irix, possible with OS X? at Dec. 31, 2003, 1:15 p.m...

ducks wrote:

I couldn't do a full netinstall, couse csh on FreeBSD is not exactly the same as SGI's

Use pdksh instead of csh, i've done several installs, even irix 4.0.1,5.3 and 6.2! I usually follow this recipe for Linux:

Code:

  sysctl -w net.ipv4.ip_no_pmtu_disc=1
  
  sysctl -w net.ipv4.ip_local_port_range="2048 32767"
  
  bootptab:
  
  pippa:ht=1:sm=255.255.255.0:gw=192.168.9.5:ha=080069022996:ip=192.168.9.7
  
  inetd.conf:
  
  shell   stream  tcp     nowait  root    /usr/sbin/tcpd  in.rshd -L
  
  login   stream  tcp     nowait  root    /usr/sbin/tcpd  in.rlogind
  
  tftp    dgram   udp     wait    root    /usr/sbin/in.tftpd      in.tftpd -p -vv
  
  bootps  dgram   udp     wait    root    /usr/sbin/bootpd        bootpd
  
  passwd:
  
  guest:x:500:100:,,,:/home/guest:/bin/sh
  
  shadow:
  
  guest:$1$Wsf0bAiT$Ua/QWYtP6k98G7R8uqQJH/:12329:0:99999:7:::
  
  hosts:
  
  192.168.9.7             pippa.sol pippa
  
  hosts.allow:
  
  ALL:localhost
  
  ALL:192.168.9.1
  
  ALL:192.168.9.7
  
  ALL:pippa.sol
  
  /home/guest/.rhosts
  
  localhost frank guest
  
  pippa root frank guest
  
  copy install cd to disk first.
  
  setenv notape 1
  
  boot -f bootp()neo:/home2/irix53/stand/fx.IP12 --x
  
  setenv notape 1
  
  setenv tapedevice bootp()neo:/home2/irix53/dist/sa
  
  boot -f $tapedevice(sash.IP12) --m

which contains the necessary instructions to netboot my Personal IRIS with IRIX 5.3. Along with 'ln -s /bin/pdksh /bin/sh'

dexter1
Moderator
From Zoetermeer, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of 2004! at Dec. 31, 2003, 3:59 p.m...

Happy new year to all of you!

:hathat49: :silly: :smilecolros: :drinking:

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of rss-0.7.6 at Jan. 2, 2004, 12:18 p.m...

As a matter of fact, yes.

I have built ImageMagick 5.5.7-15 statically without perl and managed to get rss-glx0.7.6 compiled with MIPSPro 7.3.1.3m and applied the patch from lisa's 0.7.4. There were only three things, which needed to be fixed in order for it to compile on MIPSPro:

Code: Select all


   --- oglc_src/FirePart.h.save    Fri Jan  2 14:22:27 2004
   

   +++ oglc_src/FirePart.h Fri Jan  2 14:23:38 2004
   

   @@ -122,7 +122,7 @@
   

   Particle *p = TblP;     //+1;
   

   

   n = 0;
   

   -       float da = pow (FIREDA, dt);
   

   +       float da = pow ((float)FIREDA, dt);
   

   //float ds = pow (FIREDS, dt);
   

   //SVector3D v;
   

   

   @@ -162,7 +162,7 @@
   

   p->s.x = size;
   

   p->s.y = size;
   

   p->s.z = size;
   

   -                               p->a = alpha * pow (FIREDA, (t - LastPartTime));
   

   +                               p->a = alpha * pow ((float)FIREDA, (t - LastPartTime));
   

   p->s *= 0.5f + nrnd (0.5);
   

   }
   

   } else
   

   --- reallyslick/cpp_src/skyrocket_smoke.cpp.save        Fri Jan  2 14:31:13 2004
   

   +++ reallyslick/cpp_src/skyrocket_smoke.cpp     Fri Jan  2 14:31:44 2004
   

   @@ -17,6 +17,7 @@
   

   */
   

   

   #include <stdlib.h>
   

   +#include <stdio.h>
   

   #include <GL/gl.h>
   

   #include <GL/glu.h>
   

   

   --- reallyslick/cpp_src/skyrocket_world.cpp.save        Fri Jan  2 14:31:30 2004
   

   +++ reallyslick/cpp_src/skyrocket_world.cpp     Fri Jan  2 14:31:58 2004
   

   @@ -16,6 +16,7 @@
   

   * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
   

   */
   

   

   +#include <stdio.h>
   

   #include <math.h>
   

   #include <GL/gl.h>
   

   #include <GL/glu.h>

The code was built on my O2 at work, and haven't had a chance to package it and give it a spin. Will do that as soon as i get to work next monday. I'm attempting to compile it on my crimson though. So hold on...

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of rss-0.7.6 at Jan. 6, 2004, 1:06 a.m...

Well, all looks good, except for cyclone ofcourse. Lisa's patch was for lattice to run without textures, so everything looks cool, except for sound. I'll bench my stuff along with squeens and see if it really makes a difference. And because everybody has an Octane, i'll make it a -mips4 -Ofast=ip30 -IPA build along with a build -mips3 and more general optimisations.

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of rss-0.7.6 at Jan. 6, 2004, 4:13 a.m...

It seems that for the lattice, Lisa has added an option -T if you want textures. I'm not sure if that means you get testures only with -T or you can specify a texture. I'm gonna try this on an XZ Indy to see if that makes a difference. Otherwise we have to hack a bit more or build an unpatched rss build next to a patched one.

dexter1
Moderator
From Zoetermeer, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of XVM and FX problems at Jan. 10, 2004, 1:54 p.m...

Looks like some disk labels have been screwed

Best bet is to redo fx in export mode 'fx -x'

and write a default SGI label onto the disks, then select (o)ption drive in the partition menu.

As for XVM, don't know. Haven't played yet, because XLV is still my preferred way of striping disks, because i know the procedure best.

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of rss-0.7.6 at Jan. 10, 2004, 4:03 p.m...

Neko, i've upped two tardist for the MIPSPro build of RSS GLX 0.7.6 with lisa's patch and my compile patch in incoming.
The mips3 is a general -Ofast -mips3 -IPA build for all machines
The mips4 is a -Ofast=ip30 -mips4 -IPA build optimized for Octanes, but should run on every R5K and up.
Prereqs is only libbz2, which the package checks for.

Let me know if there are any probs. Oh and Cyclone is still broken

dexter1
Moderator
Who joined Feb. 20, 2003, 6:57 a.m.
and authored 2062 notes

Wrote on the subject of Guidelines for "hinv"-Thread at Jan. 12, 2004, 5:09 a.m...

You have a valid point. I will try to reedit that particular posting.

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of Making a bootable IRIX DVD at Feb. 29, 2004, 12:44 p.m...

canavan wrote: (2) what's the size limit of EFS anyway?

8 Gb, so this will fit nicely on a double layer DVD

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of rss-0.7.6 at March 1, 2004, 2:30 p.m...

dexter1 wrote: Oh and Cyclone is still broken

Fixed it!

Take a look at reallyslick/cpp_src/cyclone.cpp:443

Code: Select all


   glColor3f (r, g, b);
   

   glPushMatrix ();
   

   glLoadIdentity ();
   

   glTranslatef (xyz[0], xyz[1], xyz[2]);
   

   glRotatef (tiltAngle, crossVec[0], crossVec[1], crossVec[2]);
   

   glRotatef (spinAngle, 0, 1, 0);
   

   glTranslatef (width * cyWidth, 0, 0);
   

   if (dStretch)
   

   glScalef (1.0f, 1.0f, scale);
   

   glCallList (1);
   

   glPopMatrix ();

which is the main screen update subroutine (glLoadIdentity clears the screen and one sets up a glCalllist of functions and primitives to be shown). Note the first "glRotatef" having "tiltangle" as first argument. For some reason this glRotatef causes a no-show on SGI's native viewport but as soon as you set the display to a Linux machine with openGL support (Matrox) it did work!?!?!
Huh?
After several hours of dissecting, this tiltangle and crossVec are actually products of a machinecode optimised x86 routine with a C++ counterpart for 'other' CPU's like our MIPS. Instead of agonising assembly i took the easy way out and commented out only that first glRotatef. Viola

My Crimson sweating out a Twister:

http://www.nekochan.net/wiki/gallery/album21/crimea3

I'll retest it on my O2 tomorrow and if i'm happy i'll pop the corrected tardists (mips3 and mips4) onto Neko's server.

Cheerio

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of mips3 pledge at March 12, 2004, 3:27 p.m...

Hello,

i've just uploaded a 75Mb tardist of Qt 3.3.1 on my university server:
http://www.mechanics.citg.tudelft.nl/~e ... s3.tardist
I have uploaded it to neko's ftp server as well. Basically it's a full MIPSPro build in mips3, no dependencies with freeware, and both single and multi threaded libraries exist. It's all in the nekoware format, so you can build the stuff yourself.

FWIW i've built it on a Challenge S irix 6.5.20m with MIPSPro 7.4.1 and POSIX/MIPSPro patches. A small patch to /usr/include/stdlib.h was necessary to build qmake, which is included. From there on it's a breeze

Well, apart from creating the idb file, that has taken me several hours

More to come ! (KDE 3.2.1)

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of mips3 pledge at March 12, 2004, 4:12 p.m...

That's ok. I'm a patient man and my wine bottle is not empty yet :drinking:

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of mips3 pledge at March 17, 2004, 7:20 a.m...

In the qt331 tardist i included that exact strtoll as a patch to /usr/include/stdlib.h

strtoll is actually a C99 function but not a C++ 99 one (yet)

dexter1
Moderator
Who joined Feb. 20, 2003, 6:57 a.m.
and authored 2062 notes

Wrote on the subject of mips3 pledge at March 22, 2004, 7:11 a.m...

foetz wrote:

got the qt pack but qmake is missing!!!! the whole folder!

Ah woops, the bin/qmake was a link to qmake/qmake. Should have spotted that one. You only need the qmake binary? Or do you also need the qmake dir with object files?

dexter1
Moderator
Who joined Feb. 20, 2003, 6:57 a.m.
and authored 2062 notes

Wrote on the subject of mips3 pledge at March 26, 2004, 7:14 a.m...

I'm currently fixing some symbolic link deficiencies in my qt tardist package.

Be patient; it's a virtue.

dexter1
Moderator
Who joined Feb. 20, 2003, 6:57 a.m.
and authored 2062 notes

Wrote on the subject of mips3 pledge at March 27, 2004, 2:16 a.m...

I've uploaded the new qt331-mips3.tardist onto neko's server and on my mirror http://www.mechanics.citg.tudelft.nl/~e ... s3.tardist

It fixes the absence of qmake, symbolic links to non existant includes and the inclusion of phrasebooks and the templates directory. As Whiter said, that directory does contain two header files with brackets in them, which chokes the entire swpkg build. I had an idea of fixing that with a postop script/command, but abandoned it. I deleted the brackets from the names and i'll leave it at that, until i have a bright idea.

dexter1
Moderator
Who joined Feb. 20, 2003, 6:57 a.m.
and authored 2062 notes

Wrote on the subject of WARNING - could not mpin memory [when I run lmdd] at March 29, 2004, 1:24 a.m...

I never ran lmdd, but i always use diskperf which does the job just fine. Just gave lmdd a spin and it looks like more of a general timing I/O benchmark program than a disk performance analyzer tool.

IMO ditch lmdd usage for disk performance numbers and go for diskperf instead.

dexter1
Moderator
From Zoetermeer, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of Linux booting on a Octane at June 14, 2004, 3:42 a.m...

Orakel wrote: One of the most interesting usages (ie. Challenge S as firewall) is not possible because GIO64 is not supported, hence no Phobos.

For the record, Challenge S uses GIO32bis, not GIO64. The Mezzanine board is indeed not supported (yet).
Challenge M uses GIO64, BTW.

dexter1
Moderator
From Zoetermeer, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of tightvnc at June 28, 2004, 11:59 p.m...

Unixmuseum, that is quite enough. No personal attacks please.

If someone's idea of system seciurity is sitting behind a NAT and letting every bsd protocol open, that is their choice. My Crimson at home is also behind a NAT, but i teach myself to compile the latest openSSL/H and login only via that. Not only it's good practice to do so, but you also get to know the Crimson quirks of compiling that code, and in general it increases the work needed for an occasional hacker who succeeds in breaking into my Firewall.

dexter1
Moderator
From Zoetermeer, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of tightvnc at June 30, 2004, 2:41 a.m...

Yes, i have read the thread.

Look; putting irony or even sarcasm in posts is fine, as long as you use smileys or an obvious joke to express the irony or sarcasm itself. Your post doesn't even contain one. How am i supposed to know then, which remark is irony and which one is not?

Please people, use the smileys! That's what they are there for...

Sorry if my response was a bit harsh, but the conversation was getting very offtopic, and in my view Orakel didn't deserve your reply, which only widens the gap between your points-of-views.

Also I have received flak for other (non)moderation decisions in the past. That's ok, i can take it. Please please PM me to state your complaints, so i can moderate better in the future.

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of hi man i am looking for you but there are to many genious... at July 3, 2004, 6:46 a.m...

Shtoink wrote: I'm no genius, just a lacky...

Fetch me some beer then... :twisted:

dexter1
Moderator
From Zoetermeer, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of MPlayer 1.0pre3 tardist at July 3, 2004, 11:51 a.m...

With Schleusels patches to mplayer 1.0pre4 i have succeeded in running a speedshop trace. My machine is I2 HI+TRAM 195MHz 384 Mb, 6.5.22m+patches, MIPSPro 7.4.2m+patches. flags were '-O3 -r10000 -mips4 -n32'. Sample file was a fansub of Psychic Academy episode 4.
BTW, when compiling mplayer, do not strip the resulting executable! Then:

Code: Select all


   ssrun -v -exp usertime ./mplayer -vo gl2 -vf format=RGB24 -nosound -benchmark your.avi

which results in a file called mplayer.usertime.somenumbers
Then run prof:

Code: Select all


   prof mplayer.usertime.somenumbers > out.txt

And this is what comes out:

Code: Select all


   -------------------------------------------------------------------------
   

   SpeedShop profile listing generated Sat Jul  3 20:07:29 2004
   

   

   prof mplayer.usertime.m17232
   

   

   mplayer (n32): Target program
   

   usertime: Experiment name
   

   ut:cu: Marching orders
   

   R10000 / R10010: CPU / FPU
   

   1: Number of CPUs
   

   195: Clock frequency (MHz.)
   

   Experiment notes--
   

   From file mplayer.usertime.m17232:
   

   Caliper point 0 at target begin, PID 17232
   

   /usr2/local/src/MPlayer-1.0pre4/mplayer -nosound -benchmark psychic_academy_ep04.avi
   

   Caliper point 1 at exit(0)
   

   -------------------------------------------------------------------------
   

   Summary of statistical callstack sampling data (usertime)--
   

   494: Total Samples
   

   0: Samples with incomplete traceback
   

   14.820: Accumulated Time (secs.)
   

   30.0: Sample interval (msecs.)
   

   -------------------------------------------------------------------------
   

   Function list, in descending order by exclusive time
   

   -------------------------------------------------------------------------
   

   [index]  excl.secs excl.%   cum.%  incl.secs incl.%    samples  procedure  (dso: file, line)
   

   

   [14]      3.810  25.7%   25.7%      3.810  25.7%        127  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 313)
   

   [20]      2.490  16.8%   42.5%      2.490  16.8%         83  simple_idct_add (mplayer: simple_idct.c, 399)
   

   [21]      1.530  10.3%   52.8%      1.530  10.3%         51  __ioctl (libc.so.1: stat.c, 32; compiled in ioctl.s)
   

   [23]      1.410   9.5%   62.3%      1.410   9.5%         47  __glMgrWaitForDMAWrite (libGLcore.so: mgras_pxdma.c, 368)
   

   [29]      0.930   6.3%   68.6%      0.930   6.3%         31  put_pixels16_l2 (mplayer: dsputil.c, 67)
   

   [30]      0.810   5.5%   74.1%      0.810   5.5%         27  __glMgrim_Finish (libGLcore.so: mgras_modes.c, 60)
   

   [33]      0.420   2.8%   76.9%      0.420   2.8%         14  simple_idct_put (mplayer: simple_idct.c, 389)
   

   [34]      0.420   2.8%   79.8%      0.420   2.8%         14  yuv2rgb_c_24_bgr (mplayer: yuv2rgb.c, 332)
   

   [32]      0.390   2.6%   82.4%      0.450   3.0%         15  msmpeg4_decode_block (mplayer: msmpeg4.c, 1676)
   

   [35]      0.300   2.0%   84.4%      0.300   2.0%         10  memset (libc.so.1: stat.c, 32; compiled in bzero.s)
   

   [7]      0.240   1.6%   86.0%      4.530  30.6%        151  MPV_decode_mb (mplayer: mpegvideo.c, 3093)
   

   [39]      0.180   1.2%   87.2%      0.180   1.2%          6  simple_idct (mplayer: simple_idct.c, 409)
   

   [42]      0.150   1.0%   88.3%      0.150   1.0%          5  __write (libc.so.1: flush.c, 58; compiled in write.s)
   

   [49]      0.120   0.8%   89.1%      0.120   0.8%          4  __read (libc.so.1: malloc.c, 907; compiled in read.s)
   

   [50]      0.120   0.8%   89.9%      0.120   0.8%          4  ff_h263_update_motion_val (mplayer: h263.c, 614)
   

   [31]      0.090   0.6%   90.5%      0.690   4.7%         23  msmpeg4v34_decode_mb (mplayer: msmpeg4.c, 1582)
   

   [63]      0.090   0.6%   91.1%      0.090   0.6%          3  h263_pred_motion (mplayer: h263.c, 1573)
   

   [64]      0.090   0.6%   91.7%      0.090   0.6%          3  put_no_rnd_pixels16_xy2_c (mplayer: dsputil.c, 897)
   

   [65]      0.090   0.6%   92.3%      0.090   0.6%          3  _BSD_getime (libc.so.1: flush.c, 58; compiled in BSD_getime.s)
   

   [28]      0.060   0.4%   92.7%      1.110   7.5%         37  mpeg_motion (mplayer: mpegvideo.c, 2464)
   

   .
   

   snip

Amazing!

More that 25% is spent in yuv2rgb, the colorspace conversion. The inverse discrete cosine transform is #2. Looks like we can kick some butt by:
1) Write a faster colorspace converter routine. MIPS asm, SGI_color_matrix, your momma on a calculator, anything seems better than this one.
2) The idct routine. If i can get SCSL libraries installed, there's a good chance it has some speedupped fast fourier transform routines. Also complib for somewhat older machines is an option.

project!

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of firefox and thunderbird icons at July 12, 2004, 5:57 a.m...

jan-jaap wrote: firefox has tag 0x100013, you have to tag your /usr/local/firefox/firefox (or whereever it lives):
cd /usr/local/firefox && tag 0x100013 firefox

Unfortunately, the firefox in /usr/local/firefox/firefox is not a binary but a shell script, so tagging doesn't work. maybe i'll try a "glob" later tonight...

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of firefox and thunderbird icons at July 12, 2004, 10:32 a.m...

Gaaack

I feel like a newbie... You're right, Neko.

i probably typed #0x100013 instead of 0x100013. Can't think of any better excuse

dexter1
Moderator
From Zoetermeer, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of Blender 2.33 is here !!! at July 22, 2004, 3:01 p.m...

ChiaHos wrote: Unfortunately we've hit another snag: The current CVS compiles the solid/qhull collision detection libs, which seems to only want to compile with MipsPro 7.4 (unless somebody knows how to process lines like "#include <cmath>" wth MipsPro 7.3.1.3m?).

This has been covered in an old thread about Octave builds on IRIX:
viewtopic.php?t=710

The trick is to get separate CC-isoheaders to augment your MIPSPro 7.3.1.3m. There are also patches from SGI about this, but are behind support contracts.

Unfortunately the tarball mentioned in the above thread is gone, but, i found an Octave reference:
http://wiki.octave.org/wiki.pl?PaulKienzleIrixConf

And here is the CC-isoheaders link:
http://octave.sourceforge.net/MIPS73-isoheaders.tar.gz

dexter1
Moderator
Who joined Feb. 20, 2003, 7:57 a.m.
and authored 1875 notes

Wrote on the subject of MPlayer 1.0pre4 IDCT Optimisation (LONG POST!) at July 22, 2004, 6:26 p.m...

Hi all,

Sorry for not being so active on Nekoware builds the last couple of weeks, but i really wanted to take part in Schleusel's and Vegac's attempts in making MPlayer just a little bit faster, so i can watch neato movie stuff on my I2 Impact

. It has cost me a lot of time, but boy, am i glad i spent it with them and MPlayer. I will show you what i did, so maybe you can learn from my trials of getting that app speed up it's framerate. The optimisation is not done yet, it still ongoing, and we only just begun searching out the possibilities of hardware colorspace conversion, but the methods behind the software optimisation part has now been understood. Here it comes. I hope Neko won't mind me breaking the record of longest post ever on Nekochan. Beer is in the mail, Pete

Since my first speedshop run, in the mplayer 1.0pre3 thread viewtopic.php?t=1374 i've read man pages of ssrun just to get myself acquainted with the most used options. Instead of 'ssrun -exp totaltime' or 'ssrun -exp usertime' i now do 'ssrun -exp fpcsampx' to get my Finegrained-ProgramCounter-SAMPling-with-4-bytes timing for all the routines which MPlayer is busy with. So lets do a standard MIPSPro 7.4.2 '-O3 -r10000 -mips4 -n32' build of MPlayer 1.0-pre4, run 'ssrun -exp fpcsampx ./mplayer' with a standard .avi file (Schleusel and i use "courtyard.avi" from a Call Of Duty Demo video (11.4MB), to be found on the net). Machine is an R10K@180MHz Origin200:

Code:

  ssrun -exp fpcsampx ./mplayer -vo null -vf format=rgb24 -nosound courtyard.avi
  
  prof -lines mplayer.fpcsampx.#

Code:

  -------------------------------------------------------------------------
  
  SpeedShop profile listing generated Thu Jul 22 09:54:04 2004
  
  prof -lines mplayer.fpcsampx.m249164
  
  mplayer (n32): Target program
  
  fpcsampx: Experiment name
  
  pc,4,1000,0:cu: Marching orders
  
  R10000 / R10010: CPU / FPU
  
  4: Number of CPUs
  
  180: Clock frequency (MHz.)
  
  Experiment notes--
  
  From file mplayer.fpcsampx.m249164:
  
  Caliper point 0 at target begin, PID 249164
  
  /usr1/local/everdij/MPlayer-1.0pre4/mplayer -vo null -vf format=rgb24 -nosound courtyard.avi
  
  Caliper point 1 at exit(0)
  
  -------------------------------------------------------------------------
  
  Summary of statistical PC sampling data (fpcsampx)--
  
  64010: Total samples
  
  64.010: Accumulated time (secs.)
  
  1.0: Time per sample (msecs.)
  
  4: Sample bin width (bytes)
  
  -------------------------------------------------------------------------
  
  Function list, in descending order by time
  
  -------------------------------------------------------------------------
  
  [index]      secs    %    cum.%   samples  function (dso: file, line)
  
  [1]    21.754  34.0%  34.0%     21754  simple_idct_add (mplayer: simple_idct.c, 399)
  
  [2]    20.301  31.7%  65.7%     20301  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 319)
  
  [3]     4.593   7.2%  72.9%      4593  put_pixels8_c (mplayer: dsputil.c, 897)
  
  [4]     3.701   5.8%  78.7%      3701  simple_idct_put (mplayer: simple_idct.c, 389)
  
  [5]     2.402   3.8%  82.4%      2402  msmpeg4_decode_block (mplayer: msmpeg4.c, 1676)
  
  [6]     1.323   2.1%  84.5%      1323  put_pixels16_xy2_c (mplayer: dsputil.c, 897)
  
  [7]     1.214   1.9%  86.4%      1214  memset (libc.so.1: stat.c, 32; compiled in bzero.s)

and so forth.... Note the total time, which is 64 seconds. One third of the program's time is used in the InverseDiscreteCosineTransform adition, another third in the colorspace converter yuv2rgb_c_24_rgb and the rest in the rest.
So Schleusel, Vegac and me sorta divided up the tasks. Schleusel tried some optimisations flags (-IPA) and porting details, Vegac had a look at the colorspace conversion and possible OpenGL O2/ICE speedups, and i got libavcodec/simple_idct.c :) simple_idct_add is just a small routine, consisting of two parts of each 8 'for' loops, which is basically the way how a double 1D-IDCT works. First the rows are transformed, and then the colums are transformed. The fransformed values are added to the already existing image, hence 'add'. For more info, read up on IDCT:

(1) http://skal.planet-d.net/coding/dct.html
(2) http://www-vs.informatik.uni-ulm.de/bib ... paper.html
(3) http://rnvs.informatik.tu-chemnitz.de/~ ... /IDCT.html

Here's the interesting part:

Code:

  /* signed 16x16 -> 32 multiply add accumulate */
  
  #define W1  22725  //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
  
  #define W2  21407  //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
  
  #define W3  19266  //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
  
  #define W4  16383  //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
  
  #define W5  12873  //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
  
  #define W6  8867   //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
  
  #define W7  4520   //cos(i*M_PI/16)*sqrt(2)*(1<<14) + 0.5
  
  #define ROW_SHIFT 11
  
  #define COL_SHIFT 20 // 6
  
  #define MAC16(rt, ra, rb) rt += (ra) * (rb)
  
  /* signed 16x16 -> 32 multiply */
  
  #define MUL16(rt, ra, rb) rt = (ra) * (rb)
  
  static inline void idctSparseColAdd (uint8_t *dest, int line_size,
  
  DCTELEM * col)
  
  {
  
  int a0, a1, a2, a3, b0, b1, b2, b3;
  
  uint8_t *cm = cropTbl + MAX_NEG_CROP;
  
  /* XXX: I did that only to give same values as previous code */
  
  a0 = W4 * (col[8*0] + ((1<<(COL_SHIFT-1))/W4));
  
  a1 = a0;
  
  a2 = a0;
  
  a3 = a0;
  
  a0 +=  + W2*col[8*2];
  
  a1 +=  + W6*col[8*2];
  
  a2 +=  - W6*col[8*2];
  
  a3 +=  - W2*col[8*2];
  
  MUL16(b0, W1, col[8*1]);
  
  MUL16(b1, W3, col[8*1]);
  
  MUL16(b2, W5, col[8*1]);
  
  MUL16(b3, W7, col[8*1]);
  
  MAC16(b0, + W3, col[8*3]);
  
  MAC16(b1, - W7, col[8*3]);
  
  MAC16(b2, - W1, col[8*3]);
  
  MAC16(b3, - W5, col[8*3]);
  
  if(col[8*4]){
  
  a0 += + W4*col[8*4];
  
  a1 += - W4*col[8*4];
  
  a2 += - W4*col[8*4];
  
  a3 += + W4*col[8*4];
  
  }
  
  if (col[8*5]) {
  
  MAC16(b0, + W5, col[8*5]);
  
  MAC16(b1, - W1, col[8*5]);
  
  MAC16(b2, + W7, col[8*5]);
  
  MAC16(b3, + W3, col[8*5]);
  
  }
  
  if(col[8*6]){
  
  a0 += + W6*col[8*6];
  
  a1 += - W2*col[8*6];
  
  a2 += + W2*col[8*6];
  
  a3 += - W6*col[8*6];
  
  }
  
  if (col[8*7]) {
  
  MAC16(b0, + W7, col[8*7]);
  
  MAC16(b1, - W5, col[8*7]);
  
  MAC16(b2, + W3, col[8*7]);
  
  MAC16(b3, - W1, col[8*7]);
  
  }
  
  dest[0] = cm[dest[0] + ((a0 + b0) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a1 + b1) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a2 + b2) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a3 + b3) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a3 - b3) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a2 - b2) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a1 - b1) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a0 - b0) >> COL_SHIFT)];
  
  }

At first glance i thought, what a lot of branches/decisions! If this routine is to become fast, it has to get rid of all those branches. Careful observing shows that the 'if' statements can be removed safely, because the condition (col[8*x]) is true when col[8*x]!=0. But if col[8*x]==0 then nothing is added or subtraced to the coefficients inside the 'if' statement anyway, so the if statement is superfluous:

Code:

  if(col[8*4]){
  
  a0 += + W4*col[8*4];
  
  a1 += - W4*col[8*4];
  
  a2 += - W4*col[8*4];
  
  a3 += + W4*col[8*4];
  
  }
  
  if (col[8*5]) {
  
  MAC16(b0, + W5, col[8*5]);
  
  MAC16(b1, - W1, col[8*5]);
  
  MAC16(b2, + W7, col[8*5]);
  
  MAC16(b3, + W3, col[8*5]);
  
  }
  
  if(col[8*6]){
  
  a0 += + W6*col[8*6];
  
  a1 += - W2*col[8*6];
  
  a2 += + W2*col[8*6];
  
  a3 += - W6*col[8*6];
  
  }
  
  if (col[8*7]) {
  
  MAC16(b0, + W7, col[8*7]);
  
  MAC16(b1, - W5, col[8*7]);
  
  MAC16(b2, + W3, col[8*7]);
  
  MAC16(b3, - W1, col[8*7]);
  
  }

will become:

Code:

  a0 += + W4*col[8*4];
  
  a1 += - W4*col[8*4];
  
  a2 += - W4*col[8*4];
  
  a3 += + W4*col[8*4];
  
  MAC16(b0, + W5, col[8*5]);
  
  MAC16(b1, - W1, col[8*5]);
  
  MAC16(b2, + W7, col[8*5]);
  
  MAC16(b3, + W3, col[8*5]);
  
  a0 += + W6*col[8*6];
  
  a1 += - W2*col[8*6];
  
  a2 += + W2*col[8*6];
  
  a3 += - W6*col[8*6];
  
  MAC16(b0, + W7, col[8*7]);
  
  MAC16(b1, - W5, col[8*7]);
  
  MAC16(b2, + W3, col[8*7]);
  
  MAC16(b3, - W1, col[8*7]);

So now i have to do more instructions, but will it weigh up to the time spent in those conditions? Answer later.

When reading (1) it becomes clear that these are indeed matrix operations of which 4 of them are so called 'rotations' which involves cosines. The cosine coefficients are all in separate '#defines' and converted to integers, so that makes this routine a 'fast-integer 1D-IDCT' Now write out all the multiplications and coefficients:

Code:

  a0 = W4 * (col[8*0] + ((1<<(COL_SHIFT-1))/W4));
  
  a1 = a0;
  
  a2 = a0;
  
  a3 = a0;
  
  a0 +=  + W2*col[8*2];
  
  a1 +=  + W6*col[8*2];
  
  a2 +=  - W6*col[8*2];
  
  a3 +=  - W2*col[8*2];
  
  MUL16(b0, W1, col[8*1]);
  
  MUL16(b1, W3, col[8*1]);
  
  MUL16(b2, W5, col[8*1]);
  
  MUL16(b3, W7, col[8*1]);
  
  MAC16(b0, + W3, col[8*3]);
  
  MAC16(b1, - W7, col[8*3]);
  
  MAC16(b2, - W1, col[8*3]);
  
  MAC16(b3, - W5, col[8*3]);
  
  a0 += + W4*col[8*4];
  
  a1 += - W4*col[8*4];
  
  a2 += - W4*col[8*4];
  
  a3 += + W4*col[8*4];
  
  MAC16(b0, + W5, col[8*5]);
  
  MAC16(b1, - W1, col[8*5]);
  
  MAC16(b2, + W7, col[8*5]);
  
  MAC16(b3, + W3, col[8*5]);
  
  a0 += + W6*col[8*6];
  
  a1 += - W2*col[8*6];
  
  a2 += + W2*col[8*6];
  
  a3 += - W6*col[8*6];
  
  MAC16(b0, + W7, col[8*7]);
  
  MAC16(b1, - W5, col[8*7]);
  
  MAC16(b2, + W3, col[8*7]);
  
  MAC16(b3, - W1, col[8*7]);

=

Code:

  a0  = W4 * col[8*0] + (1<<(COL_SHIFT-1));
  
  a1  = a0;
  
  a2  = a0;
  
  a3  = a0;
  
  a0 += W4*col[8*4];
  
  a1 -= W4*col[8*4];
  
  a2 -= W4*col[8*4];
  
  a3 += W4*col[8*4];
  
  a0 += col[8*2]*W2;
  
  a1 += col[8*2]*W6;
  
  a2 -= col[8*2]*W6;
  
  a3 -= col[8*2]*W2;
  
  a0 += col[8*6]*W6;
  
  a1 -= col[8*6]*w2;
  
  a2 += col[8*6]*W2;
  
  a3 -= col[8*6]*W6;
  
  b0  = col[8*1]*W1;
  
  b1  = col[8*1]*W3;
  
  b2  = col[8*1]*W5;
  
  b3  = col[8*1]*W7;
  
  b0 += col[8*3]*W3;
  
  b1 -= col[8*3]*W7;
  
  b2 -= col[8*3]*W1;
  
  b3 -= col[8*3]*W5;
  
  b0 += col[8*5]*W5;
  
  b1 -= col[8*5]*W1;
  
  b2 += col[8*5]*W7;
  
  b3 += col[8*5]*W3;
  
  b0 += col[8*7]*W7;
  
  b1 -= col[8*7]*W5;
  
  b2 += col[8*7]*W3;
  
  b3 -= col[8*7]*W1;

=

Code:

  int d0,d2=col[8*2],d4=W4*col[8*4],d6=col[8*6];
  
  int d1=col[8*1],d3=col[8*3],d5=col[8*5],d7=col[8*7];
  
  a0  = W4 * col[8*0] + (1<<(COL_SHIFT-1));
  
  a1  = a0;
  
  a0 += d4;
  
  a1 -= d4;
  
  a3  = a0;
  
  a2  = a1;
  
  a0 += d2*W2 + d6*W6;
  
  a1 += d2*W6 - d6*W2;
  
  a2 +=-d2*W6 + d6*W2;
  
  a3 +=-d2*W2 - d6*W6;
  
  b0  = d1*W1 + d7*W7;
  
  b3  = d1*W7 - d7*W1;
  
  b2  = d1*W5 + d7*W3;
  
  b1  = d1*W3 - d7*W5;
  
  b0 += d3*W3 + d5*W5;
  
  b3 +=-d3*W5 + d5*W3;
  
  b1 +=-d3*W7 - d5*W1;
  
  b2 +=-d3*W1 + d5*W7;

So after a lot of wizardry, i'm left with some fairly symmetric multiplications. Now comes the clever part. Also from (1), A multiplication of the form:

Code:

  t0 = W0 * d0 + W1 * d1;
  
  t1 = W0 * d1 - W1 * d0;

can also be written as:

Code:

  int tmp = W0 * (d0 + d1);
  
  t0 = tmp + (W1 - W0) * d1;
  
  t1 = tmp - (W1 + W0) * d0;

which saves you one expensive multiplication per 2x2 matrix multiplication. This sort of 2x2 matrix multiplication BTW is supposed to be called a Butterfly, because of the butterfly shape of the diagrammatic form.

This is the reason why i had to get rid of those 'if' branches in the beginning. I couldn't have written out those butterflies without those pieces inside an 'if'. In this case it is worthwile, but one always has to test these things carefully.

Enter the SIMD (Single Instruction on Multiple Data) instructions on the MIPS 4 Instruction set.

Code:

  madd  ==>  a = a + b*c
  
  nmsub ==>  a = a - b*c

These instruction perform a multiplication and addition/subtraction in one go! They are only to be found on mips4 instruction sets, so you gotta compile with -mips4 to get them. Also they are only for floats, either singles and doubles, not for integers! :(

BUT, the R10000 is a pretty nifty piece of machinery. It is blessed with both an Integer ALU and floating point ALU and can issue 2 integer ops and 1 floating point ops in one clocktick. So what if we substitute a part of the calculation with floats instead of integers? Maybe the compiler can then weave these two "threads" together, with the added bonus of 'madd'/'nmsub' instructions for the floating point part. So i devised this:

Code:

  #define BUTTERFLY(t0,t1,W0,W1,d0,d1)    \
  
  do {                                    \
  
  int tmp = W0 * (d0 + d1);           \
  
  t0 = tmp + (W1 - W0) * d1;          \
  
  t1 = tmp - (W1 + W0) * d0;          \
  
  } while (0)
  
  #define BUTTERFLY0(t0,t1,W0,W1,d0,d1)   \
  
  do {                                    \
  
  t0 = W0 * d0 + W1 * d1;             \
  
  t1 = W0 * d1 - W1 * d0;             \
  
  } while (0)
  
  #define BUTTERFLYADD0(t0,t1,W0,W1,d0,d1)\
  
  do {                                    \
  
  t0 += W0 * d0 + W1 * d1;            \
  
  t1 += W0 * d1 - W1 * d0;            \
  
  } while (0)
  
  static inline void idctSparseColAdd (uint8_t *dest, int line_size,
  
  DCTELEM * col)
  
  {
  
  int a0, a1, a2, a3;
  
  float b0, b1, b2, b3;
  
  uint8_t *cm = cropTbl + MAX_NEG_CROP;
  
  int d0,d4=W4*col[8*4];
  
  float d1=col[8*1],d3=col[8*3],d5=col[8*5],d7=col[8*7];
  
  /* XXX: I did that only to give same values as previous code */
  
  a0 = W4 * col[8*0] + (1<<(COL_SHIFT-1));
  
  a1 = a0;
  
  BUTTERFLY0(b0,b3,d1,d7,W1,W7);
  
  a0 += d4;
  
  a1 -= d4;
  
  BUTTERFLY0(b2,b1,d1,d7,W5,W3);
  
  a3 = a0;
  
  a2 = a1;
  
  BUTTERFLYADD0(b3,b0,d3,d5,-W5,W3);
  
  BUTTERFLY(d0,d4,col[8*2],col[8*6],W2,W6);
  
  BUTTERFLYADD0(b1,b2,d3,d5,-W7,-W1);
  
  a0 += d0;
  
  a1 += d4;
  
  a2 -= d4;
  
  a3 -= d0;
  
  dest[0] = cm[dest[0] + ((a0 + (int)b0) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a1 + (int)b1) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a2 + (int)b2) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a3 + (int)b3) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a3 - (int)b3) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a2 - (int)b2) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a1 - (int)b1) >> COL_SHIFT)];
  
  dest += line_size;
  
  dest[0] = cm[dest[0] + ((a0 - (int)b0) >> COL_SHIFT)];
  
  }

All the calculation involving the 'b' coefficients are floats and all the 'a's are integers. Got some temp storage variables as well, since the R10000 has a large number of registers, so this never hurts.
The BUTTERFLY0 macro is for the floats, because that gives the compiler the chance to issue the madd instructions and the normal BUTTERFLY macro is for the integers, because of the saving of one multiplication. Oh and the order of macro's for the float series (b) is independent from the integer series (a), which makes it easy for the compiler/CPU to balance the load between the two ALU's

If you look carefully look at the assembly output ('c99 -S'), you can see the weaving of instructions. First the 'old' routine:
Normal

Code:

  # Program Unit: idctSparseColAdd
  
  .ent    idctSparseColAdd
  
  idctSparseColAdd:       # 0x340
  
  .dynsym idctSparseColAdd        sto_default
  
  .frame  $sp, 80, $31
  
  # lgra_spill_temp_9 = 0
  
  # lgra_spill_temp_10 = 8
  
  # lgra_spill_temp_11 = 16
  
  # lgra_spill_temp_12 = 24
  
  # lra_spill_temp_13 = 32
  
  # lra_spill_temp_14 = 40
  
  # lra_spill_temp_15 = 48
  
  # lra_spill_temp_16 = 56
  
  # lra_spill_temp_17 = 64
  
  # lra_spill_temp_18 = 72
  
  .loc    1 255 1
  
  # 251  }
  
  # 252
  
  # 253  static inline void idctSparseColAdd (uint8_t *dest, int line_size,
  
  # 254                                       DCTELEM * col)
  
  # 255  {
  
  .BB1.idctSparseColAdd:  # 0x340
  
  #<freq>
  
  #<freq> BB:1 frequency = 1.00000 (heuristic)
  
  #<freq>
  
  addiu $sp,$sp,-80               # [0]
  
  sd $16,0($sp)                   # [1]  lgra_spill_temp_9
  
  .loc    1 265 9
  
  # 261          a1 = a0;
  
  # 262          a2 = a0;
  
  # 263          a3 = a0;
  
  # 264
  
  # 265          a0 +=  + W2*col[8*2];
  
  lh $16,32($6)                   # [0]  id:201
  
  addiu $10,$0,21407              # [0]
  
  mult $16,$10                    # [2]
  
  .loc    1 266 9
  
  # 266          a1 +=  + W6*col[8*2];
  
  addiu $12,$0,8867               # [1]
  
  .loc    1 255 1
  
  sd $19,24($sp)                  # [2]  lgra_spill_temp_12
  
  .loc    1 265 9
  
  mflo $19                        # [8]
  
  nop                             # [2]
  
  nop                             # [2]
  
  .loc    1 266 9
  
  .
  
  .
  
  .
  
  .loc    1 322 9
  
  lbu $19,0($18)                  # [163]  id:229
  
  sra $24,$24,20                  # [160]
  
  .loc    1 323 1
  
  # 323  }
  
  ld $16,0($sp)                   # [164]  lgra_spill_temp_9
  
  .loc    1 322 9
  
  addu $19,$19,$24                # [165]
  
  addu $17,$17,$19                # [166]
  
  .loc    1 323 1
  
  ld $19,24($sp)                  # [165]  lgra_spill_temp_12
  
  .loc    1 322 9
  
  lbu $17,384($17)                # [168]  id:230 cropTbl+0x0
  
  sb $17,0($18)                   # [169]  id:231
  
  .loc    1 323 1
  
  ld $17,8($sp)                   # [170]  lgra_spill_temp_10
  
  ld $18,16($sp)                  # [171]  lgra_spill_temp_11
  
  jr $31                          # [162]
  
  addiu $sp,$sp,80                # [162]
  
  .end    idctSparseColAdd
  
  .section .text

171 clockticks! Because all the ops are integers, the integer ALU gets extremely busy, thereby stalling the throughput.

Optimised

Code:

  # Program Unit: idctSparseColAdd
  
  .ent    idctSparseColAdd
  
  idctSparseColAdd:       # 0x200
  
  .dynsym idctSparseColAdd        sto_default
  
  .frame  $sp, 0, $31
  
  .loc    1 204 1
  
  # 200  }
  
  # 201
  
  # 202  static inline void idctSparseColAdd (uint8_t *dest, int line_size,
  
  # 203                                       DCTELEM * col)
  
  # 204  {
  
  .BB1.idctSparseColAdd:  # 0x200
  
  #<freq>
  
  #<freq> BB:1 frequency = 1.00000 (heuristic)
  
  #<freq>
  
  .loc    1 209 32
  
  # 205          int a0, a1, a2, a3;
  
  # 206          float b0, b1, b2, b3;
  
  # 207          uint8_t *cm = cropTbl + MAX_NEG_CROP;
  
  # 208          int d0,d4=W4*col[8*4];
  
  # 209          float d1=col[8*1],d3=col[8*3],d5=col[8*5],d7=col[8*7];
  
  lh $8,80($6)                    # [0]  id:169
  
  .loc    1 209 20
  
  lh $10,48($6)                   # [1]  id:168
  
  .loc    1 209 32
  
  mtc1 $8,$f0                     # [2]
  
  .loc    1 209 8
  
  lh $12,16($6)                   # [2]  id:167
  
  .loc    1 209 20
  
  mtc1 $10,$f7                    # [3]
  
  .loc    1 224 9
  
  # 220          a3 = a0;
  
  # 221          a2 = a1;
  
  # 222
  
  # 223          BUTTERFLYADD0(b3,b0,d3,d5,-W5,W3);
  
  # 224          BUTTERFLY(d0,d4,col[8*2],col[8*6],W2,W6);
  
  lh $1,32($6)                    # [3]  id:172
  
  addiu $9,$0,30274               # [1]
  
  .loc    1 209 44
  
  lh $11,112($6)                  # [4]  id:170
  
  .loc    1 209 32
  
  cvt.s.w $f0,$f0                 # [5]
  
  .loc    1 224 9
  
  mult $1,$9                      # [5]
  
  .loc    1 209 8
  
  mtc1 $12,$f1                    # [4]
  
  .
  
  subu $15,$15,$25                # [13]
  
  .loc    1 234 9
  
  madd.s $f9,$f9,$f0,$f6          # [18]                             <== Float op [18], combined with integer op [18]
  
  .loc    1 232 9
  
  trunc.w.s $f13,$f13             # [21]
  
  .loc    1 217 9
  
  subu $2,$2,$3                   # [18]                             <== Integer op [18], combined with float op [18]
  
  .loc    1 236 9
  
  # 235          dest += line_size;
  
  # 236          dest[0] = cm[dest[0] + ((a2 + (int)b2) >> COL_SHIFT)];
  
  mul.s $f6,$f7,$f6               # [19]                             <== Float op [19], combined with integer op [19]
  
  .loc    1 217 9
  
  lui $3,8                        # [15]
  
  addu $2,$15,$2                  # [19]                             <== Integer op [19], combined with float op [19]
  
  .loc    1 224 9
  
  addu $9,$24,$9                  # [18]                             <== Integer op [18], combined with float op [18]
  
  .loc    1 232 9
  
  mfc1 $7,$f13                    # [23]
  
  .loc    1 217 9
  
  addu $2,$2,$3                   # [20]                             <== Integer op [20], combined with float op [20]
  
  .loc    1 236 9
  
  nmsub.s $f6,$f6,$f0,$f11        # [20]                             <== Float op [20], combined with integer op [20]
  
  .loc    1 227 9
  
  .
  
  lbu $8,384($8)                  # [48]  id:193 cropTbl+0x0
  
  .loc    1 245 9
  
  # 245          dest += line_size;
  
  addu $2,$5,$9                   # [45]
  
  .loc    1 244 9
  
  sb $8,0($9)                     # [49]  id:194
  
  .loc    1 246 9
  
  # 246          dest[0] = cm[dest[0] + ((a0 - (int)b0) >> COL_SHIFT)];
  
  subu $4,$6,$7                   # [46]
  
  lbu $3,0($2)                    # [50]  id:195
  
  sra $4,$4,20                    # [47]
  
  addu $3,$3,$4                   # [52]
  
  addu $1,$1,$3                   # [53]
  
  lbu $1,384($1)                  # [54]  id:196 cropTbl+0x0
  
  .loc    1 247 1
  
  # 247  }
  
  jr $31                          # [48]
  
  .loc    1 246 9
  
  sb $1,0($2)                     # [55]  id:197
  
  .end    idctSparseColAdd
  
  .section .text
  
  .align 6

Tadaaa, 55 ticks! Amazing what a little software pipelining can do for your code!

Running speedshop again proves it:

Code:

  
  Summary of statistical PC sampling data (fpcsampx)--
  
  49298: Total samples
  
  49.298: Accumulated time (secs.)
  
  1.0: Time per sample (msecs.)
  
  4: Sample bin width (bytes)
  
  -------------------------------------------------------------------------
  
  Function list, in descending order by time
  
  -------------------------------------------------------------------------
  
  [index]      secs    %    cum.%   samples  function (dso: file, line)
  
  [1]    20.239  41.1%  41.1%     20239  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 319)
  
  [2]     7.177  14.6%  55.6%      7177  idctSparseColAdd (mplayer: simple_idct.c, 209)
  
  [3]     4.454   9.0%  64.6%      4454  put_pixels8_c (mplayer: dsputil.c, 897)
  
  [4]     2.382   4.8%  69.5%      2382  msmpeg4_decode_block (mplayer: msmpeg4.c, 1676)
  
  [5]     1.948   4.0%  73.4%      1948  idctRowCondDC (mplayer: simple_idct.c, 104)
  
  [6]     1.551   3.1%  76.6%      1551  simple_idct_put (mplayer: simple_idct.c, 313)
  
  [7]     1.301   2.6%  79.2%      1301  put_pixels16_xy2_c (mplayer: dsputil.c, 897)

IDCT is a bit more scattered now, three routines instead of one, but added up (idctSparseColAdd +idctRowCondDC + simple_idct_put) gives 21.7% which is half the time of yuv2rgb_c_24_rgb(41.1%). Compared that with the starting speedshop run, this is a 50% reduction in time spent in IDCT! Looking at the total times. 64.0 versus 49.3 seconds is 23% speedup of the app. Whoa! Granted, it's only for this .avi file and this specific IDCT. Other codecs need other routines, but it's a start.

Well, hope you have read through it and picked up some ideas. Next time i'll be looking at yuv2rgb.c software wise.

The patch for libavcodec/simple_idct.c is now living at http://www.mechanics.citg.tudelft.nl/~e ... pre4.patch get it and try it. Schleusel and i will try to pester MPlayer CVS guru's to get us some libavcodec/mips subdirectory where we can store this.

%-)

dexter1
Moderator
Who joined Feb. 20, 2003, 6:57 a.m.
and authored 2062 notes

Wrote on the subject of SGI's first Linux-based Visualization System at Oct. 12, 2004, 5:17 a.m...

hamei wrote:

Brombear wrote:

hamei wrote:

From what I understand the Itanic is a lot like RISC in some ways - execution speed is very dependent on smart compilers. So what do they have on Linux ? Gcc ? hmmm.

I believe the intel compiler (icc) is used on these machines. Hard to guess its performance without real tests though

I just know what I've read about developers in the HP camp screaming bloody murder about "no tools ! no tools ! where the hell are all the optimized tools you promised us ?"

It's not icc and ifc/ifort, but ecc and efc/efort on Itanium systems. And there is one performance tool you can run on an itanium, because of its specific counter registers included in the CPU core.
The tool is called HistX and can be downloaded from the SGI site:

http://www.sgi.com/products/evaluation/altix_histx/

But i admit, i miss ssrun/cvd/perfex on Itanium systems. Also, there's a lot of funky performance issues like some code running like mad on the PIV Xeon will crawl on Itanium. And you're completely dependent on ecc for your optimisations, so no pragma's like on MIPS will help you. I have seen and tested an Itanium2 1.5 GHz with 3MB cache to be slightly slower than a PIV Xeon 3.05GHz with respect to floating point fortran code. And considering a dual Itanium2 machine costs triple the amount of a dual PIV machine, the choice is easily made...

dexter1
Moderator
From Zoetermeer, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of The Nekoware Tardist Build FAQ at Oct. 29, 2004, 5:11 p.m...

My first attempt at a FAQ for building tardists for nekoware. It's grossly incomplete, contains arrogant remarks, smells like dogpoo and is probably not very accurate. Please comment on this FAQ in the regular thread "Nekoware tardists build FAQ" started by O2ric and post suggestions! I'll be doing additions on this document in a hopeful steady rate. Hence my posting today, otherwise this will never get finished

The Nekoware Tardist Build FAQ

collected by Frank Everdij aka dexter1 29 oct 2004

Intro

This is supposed to be a guide about making nekoware tardist packages to be enjoyed on SGI IRIX machines, aka those funny colored machines with a cube logo up front which costed a small car back in the days...

Rules

First of all we need to define some rules on what packages we intend to be released as nekoware and on what platforms they should run. We aim at opensource software, GNU GPL or BSD license or other free/non-commercial license, usually developed on linux x86 machines. Because these opensource software were mainly developed on x86 architecture, this poses some constraints on the target machine. It should have reasonably fast IO, fast integer performance and for building code it should run a recent MIPSPro compiler which can squeeze as much performance from code as it can.

As target SGI machines, we have chosen the following:

1) MIPS IV instruction set, meaning all machines with R5000 processors and up, which include: R5K Indy, R8K Indigo2, R10K Indigo2 Impact, R8K and R10K Challenge, R8K and R10K Onyx1, all O2, all Octane, all Onyx2, all Origin, Tezro, Fuel.
2) Minimum IRIX to run is 6.5.21m. On supportfolio you can get hold of 6.5.22m overlays, but take care to install Patch 5086 first if you plan to upgrade. People running less than 6.5.21 can use nekoware, but need some runtime modifications with RLD_LIST becuae of missing symbols, most notably strlcpy and strlcat. Ugly, but it works. Also take care to install the latest patches for the OS.
3) MIPSPro 7.4.x is the preferred compiler choice. Mipspro 7.3.1.3m is also pretty good and for some apps it's even better than the 7.4.x series. We discourage the use of gcc 3.x, though sometimes it is the only choice. Unsure is, if compiling c++ code with g++ will play with MIPSPro compiled code, because of name mangling issues.
4) IRIX 6.5.22m is our preferred build platform OS. This IRIX is the end of the line for a lot of older machines, has mp3 support, UDF, IPV6, NTP, in short a nice OS, is fairly recent, and if reasonably patched, quite stable.

Port/Compile

Porting opensource on IRIX platform is sometimes not an easy task. To name a few major problems:
1) Endianness. IRIX is big-endian, meaning that word/multiword storage starts with the most significant byte first. x86 PC's are little-endian. So you can have situations where you need to swap bytes around before or after processing variables in routines.
2) ASM. Forget it. Optimised x86 assembly code will never run on MIPS. Ditch the software and find a C/C++ equivalent.
3) GCC-isms. A name bundling a variety of hacks (void pointer arithmetics), oddball coding (namespace clashes), wrong standards (C++ iostreambuf return pointer), hard-coded Makefiles (the worst) and botched/ignorant ./configure scripts (even worse than worst). Sometimes switching to c99 instead of cc fixes a lot of gcc-isms. In the code atleast
4) Different device handling, most notably parallel port programming and audio. No joysticks

no USB (except your mouse on the Tezro)
5) Performance/Speed. MIPS being a floating point killer has trouble with integer code. Most opensource code do not take this into consideration, so optimisations are needed to crank up the code to make it run at useable speeds on your IRIX box.

Minor problems like misplaced or missing headers can usually be fixed by a bit of searching and #ifdef's wrapping. __sgi is a nice symbol to use for that, though most ./configure scripts can determine that you're running MIPS and set defines accordingly. Also defines which include BSD in the name are sometimes a good idea, since IRIX is a BSD style Unix flavor.

Selecting a good environment is necessary to get a proper build. A good starting point is:

Code: Select all


   setenv CC cc
   

   setenv CXX CC
   

   setenv CFLAGS '-O3 -mips4 -n32'
   

   setenv CXXFLAGS '-O3 -mips4 -n32'
   

   setenv CPPFLAGS '-I/usr/nekoware/include'
   

   setenv LDFLAGS '-L/usr/nekoware/lib'

CC is the basic C compiler environment name, which usually gets picked up by ./configure scripts. sometimes one may have to substitute it with:

Code: Select all


   setenv CC c99  (only for mipspro 7.4.x) or
   

   setenv CC c89  to select a more gcc-like parsing and compiling behaviour.

CXX is the C++ compiler environment name. It should be CC or g++ if the code is too messy to compile with MIPSPro.

CFLAGS and CXXFLAGS should be small and neat. -O3 gives you the best optimisation for a first compile attempt. -mips4 selects MIPS IV instruction set optimisations. -n32 select 32bit addressing, so pointers should not exceed 2 Gb, which rarely happens with opensource programs.
/usr/nekoware/include and /usr/nekoware/lib should be good include and lib paths for porting an app to nekoware. If you need more libs to port an app, it may be an idea to first build it into /usr/local, so you can test the code before doing a proper nekoware build.

Other less used environment variables which can be picked up by ./configure are

Code: Select all


   setenv PERL /usr/nekoware/bin/perl
   

   setenv GNUMAKE /usr/nekoware/bin/gmake
   

   setenv SED /usr/nekoware/bin/sed

Packaging

Try to start ports with the necessary prerequisites installed, preferably nekoware packages if you plan to make a nekoware tardist. Although nekoware doesn't bite with freeware, some ./configure scripts can get confused with two similar named libraries in different locations. Imagine a package including /usr/freeware/include/jpeg.h and linking with /usr/nekoware/lib/libjpeg.so...
So it's best to dedicate a machine for that.

I compile most opensource stuff for nekoware in /usr/local/src/<programname-version>
For building packages i have made a directory /usr/local/src/build and in there i make a directory <programname> where i store the following; Lets give and example for the program called "program" In /usr/local/src/build/program :

1) neko_program.idb <- swpkg/gendist package list of files and actions
2) neko_program.spec <- specification file of subsystems, versions and dependencies
3) neko_program.txt <- text file with specifications on the build proces, program version, used environment variables, ./configure options, dependencies needed, background info, sometimes test results or performance numbers.
4) program-1.2.3.tar.gz or program-1.2.3.tar.bz2 <- original program tarball. Always leave the tarball intact, because the source code and copyright notice should remain available.
5) neko_program-1.2.3_irix.patch <- patch file to patch the source. should be a cat of diff -u output applied inside the source path /usr/local/src/program-1.2.3 so it should be appliable with "patch -p0 < neko_program-1.2.3_irix.patch"
6) clean.txt and program.txt <- These are ls -lR outputs of /usr/nekoware done as root, before and after a "make install" so i know what files have been added/deleted by doing a diff or xdiff of these two files.

5) and 6) are more or less my choices but 1) through 4) are mandatory files and should be included in the nekoware tardist build so people can recreate the tardist if needed.

More to come after my beauty sleep

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of Alcohol Preferences at Nov. 21, 2004, 7:05 a.m...

Beer:
Grolsch (springcap bottle)
Duvel (8.5% Belgian beer, very bitter)

Spirits:
"ForestWalk" (cream liquor with Bananas)
Ouzo (vat12)
Absinth (Rare 55% Anice spirit with Terpene oil compounds, in small doses allowed in Europe)

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of Alcohol Preferences at Nov. 22, 2004, 12:40 a.m...

Intel-OUTSIDE wrote:

dexter1 wrote: Spirits:
"ForestWalk" (cream liquor with Bananas)
Ouzo (vat12)
Absinth (Rare 55% Anice spirit with Terpene oil compounds, in small doses allowed in Europe)

you have got to be joking, especially the Ouzo, greek paint-stripper!!!

Hahaha, no really it's very doable

The anice hides the bitter taste of alcohol well. In fact, i hardly get any hangovers from Ouzo and it keeps you warm in cold Nerd-camps. Ofcourse Absinth is something different alltogether: http://www.eabsinthe.com/hills/serving.htm
it does give you a hangover, but oddly only my rightside frontal lobe felt sore the next morning. Must be the Thujone terpene binding to receptors for my abstract abilities...

dexter1
Moderator
From Voorburg, The Netherlands
Who joined Feb. 20, 2003, 6:57 a.m.

Wrote on the subject of Alcohol Preferences at Nov. 22, 2004, 7:10 a.m...

Hakimoto wrote: What I'm asking myself, dexter1, is if they're really still making Absinth with Artemisia absinthum or some other stuff like sage.

Yes, nowadays they make it with Artemisia absinthum or wormwood, which contains the Thujone. There is a legal European maximum dose of Thujone in absinth which is 10 mg per litre for drinks above 35% promillage. Formally, Absinth is forbidden in Holland, but one Judge ruling has placed this 1909 Absinth law invalid. So it is now legal to buy the Absinth liquors Like Tabu from Germany in stores in Holland. I have a bottle still open, care to join me?

For more information, visit http://www.groenefee.nl/ for some specific Dutch info.

And you can make me really happy with some genuine french Absinth like Francois Guy or Versinth La Blanche. Never tasted those, and you're closer to France than i am

The collected works of dexter1 - Page 2