SGI: Development

Beta test : pigz 2.3

About this time last year, mia requested that some programs be packaged in this thread . Among them was something called pigz: a Parallel Implementation of GZip .

More recently I was looking for something simple to port and package, and this seemed to fit the bill - especially as testing it would be a good excuse for firing up the Origin. ;) I've done all that and uploaded neko_pigz-2.3.tardist to ftp.nekochan.net/incoming. I'd appreciate a cautious eye from somebody familiar with the packaging process. This package includes release notes, spec and idb file, and the original sources and patches so it should be possible to see where I might have gone wrong.

I tested the program itself with a 1.1GB sendmail logfile. Stock /usr/sbin/gzip took an average of 244.4 seconds over three runs to compress at level 9 ("gzip -9") on an Origin 300 @ 600MHz. Limited to one thread pigz averaged 243.1 seconds on the same system, and when using all 8 CPUs it averaged 36.9 seconds. Increasing the buffer/chunk size changed average timing as follows: 256KB, 35.1 sec; 1MB, 34.4 sec.

Uncompressing the gzip output with pigz, using just 1 thread took 24.9 seconds (single run, but typical of others). But the same operation with 8 threads took 32.5 seconds - again a single run, but other runs were within a second higher. During the 8-way decompression runs pigz was typically only using 2/3 to 3/4 of two CPUs, though sometimes that was reflected as roughly that level of usage spread across as many as five CPUs.

For more information about pigz, go to: http://zlib.net/pigz

_________________
Then? :IRIS3130: ... Now? :O3x02L: :A3504L: - :A3502L: :1600SW: +MLA :Fuel: :Octane2: :Octane: :Indigo2IMP: ... Other: DEC :BA213: :BA123: Sun , DG AViiON , NeXT :Cube:
Got a little ahead of myself - next time I'll PM somebody and get it moved into /beta before I post... :D PM sent, it'll be along sometime soonish-like.

_________________
Then? :IRIS3130: ... Now? :O3x02L: :A3504L: - :A3502L: :1600SW: +MLA :Fuel: :Octane2: :Octane: :Indigo2IMP: ... Other: DEC :BA213: :BA123: Sun , DG AViiON , NeXT :Cube:
The package is in /beta now . Share & Enjoy !

_________________
Then? :IRIS3130: ... Now? :O3x02L: :A3504L: - :A3502L: :1600SW: +MLA :Fuel: :Octane2: :Octane: :Indigo2IMP: ... Other: DEC :BA213: :BA123: Sun , DG AViiON , NeXT :Cube:
smj wrote:
... pigz: a Parallel Implementation of GZip. This package includes release notes, spec and idb file, and the original sources and patches so it should be possible to see where I might have gone wrong.

I ain't no authority but I installed it and noticed :

a) it works

b) has the standard nekoware organization

on the less-positive side

a) the name is kind of long so you've got to make swmgr really wiiiiiide to read the name. (8.3, anyone ?) But it's not capitalized which is great (that makes it fit in with the rest of the nekoware group all in one place).

b) your naming system is different from normal nekoware. In this case it's not different enough to cause a problem but can you imagine if people started naming all the subsystems however they like ? I'm already thinking up names ... this could get ugly :D

c) You probably should title the thread < beta test : pigz (inna poke) > or something because neko doesn't have all the time in the world to keep track of whatever mischief we are up to, but I think he does watch for the < beta: test > threads.

On the functional side, it works, the man pages work, I tried several zip and unzip operations, there were no problems and top said it was using 168% of my cpus. I assume top relates to a single processor.

I would go ahead and give 'er a +1 vote altho joerg might find something to say. He's got eyes like a middle-aged frizzy-haired Shanghai bureau lady.

Thank you. Another useful tool for the good guys, yay !

Quote:
Uncompressing the gzip output with pigz, using just 1 thread took 24.9 seconds (single run, but typical of others). But the same operation with 8 threads took 32.5 seconds

Is that correct or a misprint ? Uncompressing with 8 threads took 7.5 seconds longer than a single thread ?

_________________
waiting for flight 1203 ...
Thanks for the quick test - I take it the dependency on zlib worked alright? I used the package version from the zlib in nekoware/current, which might cause a complaint for somebody using an older version - but then, that seems like what it should be doing...

hamei wrote:
a) the name is kind of long so you've got to make swmgr really wiiiiiide to read the name. (8.3, anyone ?)

I just pulled the package from /beta, opened it in swmgr, and the description is about the same length as neko_libidn. Granted it's on the long side, but it's hardly the only package with a description that length. Check out neko_fvwm or neko_wxGTK for some long names.

IMHO package descriptions that just repeat the name of the package aren't terribly helpful, so there's a trade-off to be made...

I did leave off the software version number (2.3) which I can fix with a respin. Is there something other than that and length? The format seems to be " package X.Y.Z description "

hamei wrote:
b) your naming system is different from normal nekoware. In this case it's not different enough to cause a problem but can you imagine if people started naming all the subsystems however they like ? I'm already thinking up names ... this could get ugly

You mean how I used neko_pigz.sw.eoe instead of neko_wget.sw.wget and neko_pigz.man.manpages instead of neko_wget.man.wget ? I was just following the wiki guide and what swpkg(1m) suggested...

I did notice that I put the relnotes in the "man" image instead of "opt" (?? pkg .man.relnotes instead of pkg .opt.relnotes), but that was because the discussion over here indicated the relnotes shouldn't be optional any more, and since "opt" is optional...

At any rate, if I can understand this better I'll happily correct it and add some guidance to the wiki page on packaging.


Quote:
I would go ahead and give 'er a +1 vote altho joerg might find something to say. He's got eyes like a middle-aged frizzy-haired Shanghai bureau lady.

If the package doesn't follow the standards, I'd rather respin the package. Just need to understand where I veered off.

hamei wrote:
smj wrote:
Uncompressing the gzip output with pigz, using just 1 thread took 24.9 seconds (single run, but typical of others). But the same operation with 8 threads took 32.5 seconds
Is that correct or a misprint ? Uncompressing with 8 threads took 7.5 seconds longer than a single thread ?

That's no typo, it took several seconds longer trying to use multiple threads -- in this specific example. Maybe a much larger compressed file would yield different results.

_________________
Then? :IRIS3130: ... Now? :O3x02L: :A3504L: - :A3502L: :1600SW: +MLA :Fuel: :Octane2: :Octane: :Indigo2IMP: ... Other: DEC :BA213: :BA123: Sun , DG AViiON , NeXT :Cube:
hamei wrote:
smj wrote:
Uncompressing the gzip output with pigz, using just 1 thread took 24.9 seconds (single run, but typical of others). But the same operation with 8 threads took 32.5 seconds
Is that correct or a misprint ? Uncompressing with 8 threads took 7.5 seconds longer than a single thread ?

That's no typo, it took several seconds longer trying to use multiple threads -- in this specific example. Maybe a much larger compressed file would yield different results.[/quote]

If I'm not mistaken, the documentation for pigz explicitly states that it does not provide any benefit for archives generated with 'normal' gzip. I'm surprised to see that it's so much worse.
Thread syntonization overhead. Take a bench-marking program like I dunno, pvm-povray or c-ray or whatever and run it with thousands and thousands of threads more than it can handle. It can spend more time switching and dealing with synchronization than doing actual work. That is one of the reasons why massively parallel programming is such an art or science.


R.

_________________
死の神はりんごだけ食べる

開いた括弧は必ず閉じる -- あるプログラマー

:Tezro: :Tezro: :Onyx2R: :Onyx2RE: :Onyx2: :O3x04R: :O3x0: :O200: :Octane: :Octane2: :O2: :O2: :Indigo2IMP: :PI: :PI: :1600SW: :1600SW: :Indy: :Indy: :Indy: :Indy: :Indy:
:hpserv: J5600, 2 x Mac, 3 x SUN, Alpha DS20E, Alpha 800 5/550, 3 x RS/6000, Amiga 4000 VideoToaster, Amiga4000 -030, 733MHz Sam440 AmigaOS 4.1 update 1. Tandem Himalaya S-Series Nonstop S72000 ServerNet.

Sold: :Indy: :Indy: :Indy: :Indigo:

Cortex ---> http://www.facebook.com/pages/Cortex-th ... 11?sk=info
Minnie ---> http://www.facebook.com/pages/Minnie-th ... 02?sk=info
Book ----> http://pymblesoftware.com/book/
Github ---> https://github.com/pymblesoftware
Visit http://www.pymblesoftware.com
Search for "Pymble", "InstaElf", "CryWhy" or "Cricket Score Sheet" in the iPad App store or search for "Pymble" or "CryWhy" in the iPhone App store.
hi sm - didn't mean to sound critical, was just trying to be as picky as possible.
smj wrote:
You mean how I used neko_pigz.sw.eoe instead of neko_wget.sw.wget and neko_pigz.man.manpages instead of neko_wget.man.wget ?

I was thinking of the subsytem names. Freeware and most of the nekoware I've looked at us

distribution files
execution only environment
info pages
man pages
original source code
release notes
headers
shared libraries
patches
and a few I've forgotten

you used

Base Software
Man Pages
Package Sources
Packaging Files
Patches for this package
Release notes and other docs

It's not a sin, definitely not worth redoing, but if you want to be like the rest of nekoware, that's been the defacto setup in the past.
Quote:
If the package doesn't follow the standards, I'd rather respin the package. Just need to understand where I veered off.

I would say no, it's fine, just go to the previously-used system names in the future (if you want. If you don't want, I'm not god.)

Use your time to build another package instead :P

Quote:
That's no typo, it took several seconds longer trying to use multiple threads -- in this specific example. Maybe a much larger compressed file would yield different results.

It might be advantageous for the pigz people and the pbzip people to do some research on the most efficient use of parallel threads in an archiver - and then use that.

In OS/2 it would be stupid to create the same number of threads as there are processors : OS/2 uses a round-robin scheduling approach, the next ready-to-go thread gets the next available processor. So if the most efficient way to execute the program is with sixteen threads, give the thing sixteen threads and let the operating system's scheduler figure out how to apportion them. The guys who write kernels are a hell of a lot more knowledgeable about operating system internals than guys writing userland applications.

I can't think that Irix is a lot worse about scheduling threads than OS/2. Maybe, but seems unlikely.

_________________
waiting for flight 1203 ...
I'm running tests with an 11GB logfile, but in the meantime I thought the following passage from the pbzip2 man page was interesting:
Quote:
Files that were compressed using bzip2 will not see speedup
since bzip2 packages the data into a single chunk that
cannot be split between processors.

Maybe there's a similar structure to the gzip output stream. Last time I only decompressed with the opposite tool from compression, since I was using that as a verification for the pigz build. After I've finished the compression timing runs, I'll see if decompressing the pigz-compressed files with multiple threads/CPUs offers any speedup.

Right now all these tests have been run from a single 10k SCSI drive, so we're going to find a limit there. If I didn't have work to do, I could drag out a couple different drive arrays...

_________________
Then? :IRIS3130: ... Now? :O3x02L: :A3504L: - :A3502L: :1600SW: +MLA :Fuel: :Octane2: :Octane: :Indigo2IMP: ... Other: DEC :BA213: :BA123: Sun , DG AViiON , NeXT :Cube:
smj wrote:
I'm running tests with an 11GB logfile, but in the meantime I thought the following passage from the pbzip2 man page was interesting:
Quote:
Files that were compressed using bzip2 will not see speedup
since bzip2 packages the data into a single chunk that
cannot be split between processors.

Sheer conjecture here but with their approach (just split the file into pieces, zip each piece then join together ?) it would seem that different file sizes would benefit from different numbers of threads, e.g.

file < 500k = 1 thread
file < 2 mb = 2 threads
file < 6mb = 4 threads

therefore, rather than jumping to the "your computer has 8 cpu's therefore we will default to 8 threads" they should be looking at the file size and determining how many threads to use from that. Then don't change the number of threads according to what's in the box. Let the damned scheduler take care of that. You'd lose a little in some instances but in the big picture it would work better. IMO :P

After years of OS/2-ing, this is one of the things I used to hate about Windows programmers - they assume that they are the only thing running. "8p ? Okay, we'll default to 8 threads !" Guess what, you stupid shits ? There are other processes running ! Figure out what is optimum for the application then let the os do the scheduling and don't try to take over the entire computer all for yourselves !

They never listened though. They just moved over to Linux and brought all their Windows baggage with them :(

_________________
waiting for flight 1203 ...
Ran the timing tests again using an 11GB sendmail logfile. Times are the average of three runs compressing that logfile:
Code:
gzip:              37'00" (2,220.7 seconds)
pigz -p 8:         6'15"  (375 seconds)            [defaults to 128KB block/chunk size]
pigz -p 8 -b 256:  5'00"  (300.3 seconds)
pigz -p 8 -b 1024: 4'13"  (253.3 seconds)
pigz -p 8 -b 2048: 4'25"  (265 seconds)
Glancing at gr_osview from time to time during runs, none of them kept the CPUs running flat-out - probably hitting an I/O limit with that single drive. The 1MB block size appeared to do the best.


On the decompression side, looks like the gzip scheme doesn't lend itself to the scenario suggested in the pbzip2 man page. Just did a single run of each to see if it was worth pursuing.

Uncompressing gzip-compressed logfile:
pigz -p 1: 5'48"
pigz -p 8: 6'16"

Uncompressing logfile compressed with "pigz -p 8":
pigz -p 8 -d: 6'22"

So that's that for this little informal benchmarking exercise. At least for tonight. ;)

_________________
Then? :IRIS3130: ... Now? :O3x02L: :A3504L: - :A3502L: :1600SW: +MLA :Fuel: :Octane2: :Octane: :Indigo2IMP: ... Other: DEC :BA213: :BA123: Sun , DG AViiON , NeXT :Cube:
smj wrote:
Code:
gzip:              37'00" (2,220.7 seconds)
pigz -p 8:         6'15"  (375 seconds)            [defaults to 128KB block/chunk size]
pigz -p 8 -b 256:  5'00"  (300.3 seconds)
pigz -p 8 -b 1024: 4'13"  (253.3 seconds)
pigz -p 8 -b 2048: 4'25"  (265 seconds)
Glancing at gr_osview from time to time during runs, none of them kept the CPUs running flat-out - probably hitting an I/O limit with that single drive. The 1MB block size appeared to do the best.

SM - just for fun, next time you're bored try running the same tests with 4p ?

_________________
waiting for flight 1203 ...
hamei wrote:
Sheer conjecture here but with their approach (just split the file into pieces, zip each piece then join together ?) it would seem that different file sizes would benefit from different numbers of threads, e.g.

file < 500k = 1 thread
file < 2 mb = 2 threads
file < 6mb = 4 threads

therefore, rather than jumping to the "your computer has 8 cpu's therefore we will default to 8 threads" they should be looking at the file size and determining how many threads to use from that.

Given that this is a dedicated parallel compression utility, I don't think taking the simple approach is unreasonable. You might wish it tried to use all four CPUs from the beginning on your old reliable Pentium III Xeon machine, rather than second guessing the user. Should it try to infer something about I/O bandwidth and adjust to that?

hamei wrote:
Then don't change the number of threads according to what's in the box. Let the damned scheduler take care of that. You'd lose a little in some instances but in the big picture it would work better. IMO :P

I'm sure it could be built to adjust the number of threads dynamically during a run, but I can't think of an OS scheduler - kernel code, most cases - that's going to tell the application it should be spawning more threads. Maybe that's just because I should've been asleep four hours ago...

_________________
Then? :IRIS3130: ... Now? :O3x02L: :A3504L: - :A3502L: :1600SW: +MLA :Fuel: :Octane2: :Octane: :Indigo2IMP: ... Other: DEC :BA213: :BA123: Sun , DG AViiON , NeXT :Cube:
smj wrote:
Given that this is a dedicated parallel compression utility, I don't think taking the simple approach is unreasonable.

Ah, but my approach is also reasonable :)

I am guessing, but would bet fifty cents that the efficiency of their approach - take the file, split it into x pieces, give one piece to each thread - is dependent on the size of the file. Obviously using 24 threads to zip a 120k file would be silly, right ? I am going to go out on a limb and imagine that you can determine pretty well how many threads will work best just by the size of the file to be compressed. Obviously there would be outliers but in general .

From your example, block size could be involved as well.

Hence, the zipper checks the size of the file, decides how many threads and what blocksize to use, then steps into the batter's box.

Quote:
You might wish it tried to use all four CPUs from the beginning on your old reliable Pentium III Xeon machine, rather than second guessing the user. Should it try to infer something about I/O bandwidth and adjust to that?

But the guys who wrote the kernel are smarter than the user ! the scheduler does take into account i/o, bandwidth, cpu load, which process currently has the highest priority, etc etc. In OS/2 there are 32 different dynamically-adjusted priority levels in the scheduler. (Hate to keep bringing up OS/2 but that's the system I've read a little about.) Remember - this isn't the only thing running on the computer ! Just because the zip application might run .02% quicker, if the rest of the computer gets all bogged down because of that, it's a crappy application.

I believe that deciding to use 8 threads just because the box has 8 cores is not the correct approach. First off, the zipper does not get the entire computer all to itself anyhow. Second, you don't want the damned thing to take over the computer. And , I doubt that using the number of p in a box to determine which is the most efficient number of threads to generate is optimum. It could even be pessimum :)

Quote:
I'm sure it could be built to adjust the number of threads dynamically during a run, but I can't think of an OS scheduler - kernel code, most cases - that's going to tell the application it should be spawning more threads.

I don't think it would need to. It seems very likely to me that the number of threads for most efficient zipping could be determined by some fairly simple attributes. So check those attributes, determine the proper number of threads, then let the scheduler (programmed by some damned smart people and taking into account a lot more than just "how fast can we zip this file") take over feeding the processors.

Anyway, that's my story and I'm stickin' to it :D

_________________
waiting for flight 1203 ...