Parallel BZIP2 (PBZIP2)

Data Compression Software

by Jeff Gilchrist
PBZIP2 Contact Address



PBZIP2 is a parallel implementation of the bzip2 block-sorting file compressor that uses pthreads and achieves near-linear speedup on SMP machines. The output of this version is fully compatible with bzip2 v1.0.2 or newer (ie: anything compressed with pbzip2 can be decompressed with bzip2)PBZIP2 should work on any system that has a pthreads compatible C++ compiler (such as gcc). It has been tested on: Linux, Windows (cygwin & MinGW), Solaris, Tru64/OSF1, HP-UX, and Irix.

NOTE: If you are looking for a parallel BZIP2 that works on cluster machines, you should check out MPIBZIP2 which was designed for a distributed-memory message-passing architecture.

Screen Shot

PBZIP2 v1.0 Screen Shot


License/Disclaimer

This software is distributed under a BSD-style license. For details, see the file COPYING. Use at your own risk. I take no responsibility for anything that happens to your data or equipment. Always test (bzip2 -tv) a compressed file containing important data before deleting the original to verify the compression was successful.

If you find this software useful or you are using it in a government/business/commercial environment, please consider making a donation to help support future improvements:


Download

Click to download the latest version:
Source Code: PBZIP2 v1.0.2 (23 KB) [SHA-1: 8ae0ebcd08761332ade6baa4b1172a3f97f71169]
[MD5: 7c959f0554695bc484865b938e791aaf]
SRPM: PBZIP2 v1.0.2 (28 KB) [SHA-1: 15378931ffa89d4b050d07c6a44c902e185d68cb]
[MD5: 300563c4ae7f61b18322cc4fb84d4fa6]
Pre-built Packages
On a Debian/Ubuntu system:  'apt-get update; apt-get install pbzip2' or get the Deb package
On a FreeBSD system:  'pkg_add -r pbzip2' or get the package
On a Gentoo system:  get the Ebuild package
On a Mandriva system:  'urpmi pbzip2'
On a NetBSD system:  get the package
On an OSX system:  'fink install pbzip2' or get the package
On a RedHat system:  'yum install pbzip2'
On a Slackware system:  get the package
On a Solaris system:  get the package from blastwave or from sunfreeware
 
Previous Version
Source Code: PBZIP2 v1.0.1 (22 KB) SRPM: PBZIP2 v1.0.1 (26 KB)

Recent History

v1.0.2 (Jul. 25, 2007)
  • Added support to directly compress files without using threads when files are smaller than the specified block size or the system only has 1 CPU.  This will speed things up considerably if you are compressing many small files.  You can also force this behaviour by using -p1
  • Added support for pbunzip2 symlink to automatically specify decompression mode
  • Changed pbzip2 exit code behaviour to match bzip2 for all error states (ie: trying to compress a file that already has a .bz2 extension)
v1.0.1 (Mar. 20, 2007)
  • Added #ifdef PBZIP_NO_LOADAVG to remove load average code for UNIX systems that do not support it such as HP-UX and OSF1
v1.0 (Mar. 14, 2007)
  • Official non-beta release!
  • Fixed minor memory leak in queueDelete()
  • Added support for UNIX systems to modify max number of CPUs used based on load average
v0.9.6 (Feb. 5, 2006)
  • Fixed bug that caused blocks to be missed by decompression routine under certain conditions
v0.9.5 (Dec. 31, 2005)
  • Changed default output to silent like bzip2 and added -v switch to make verbose
  • Added support to autodetect number of CPUs on OSX
  • Added support to compile on Borland and other Windows compilers using pthreads-win32 open source library
  • Added decompression throttling in case too much backlog in filewriter
  • Fixed bug from patch in 0.9.4 that limited file block size to 900k
  • Fixed bug that caused file output to fail with some large files
  • Fixed pthreads race condition that could cause random segfaults
  • Fixed pthreads resource issue that prevented pbzip2 from compressing a large number of files at once


Contributions

- Bryan Stillwell <bryan [at] bokeoa {dot} com> - code cleanup, RPM spec, and prep work for inclusion in Fedora Extras
- Dru Lemley [http://lemley.net/smp.html] - help with large file support
- Kir Kolyshkin <kir [at] sacred {dot} ru> - autodetection for # of CPUs
- Joergen Ramskov <joergen [at] ramskov {dot} org> - initial version of man page
- Peter Cordes <peter [at] cordes {dot} ca> - code cleanup
- Kurt Fitzner <kfitzner [at] excelcia {dot} org> - port to Windows compilers and decompression throttling
- Oliver Falk <oliver [at] linux-kernel {dot} at> - RPM spec update
- Jindrich Novy <jnovy [at] redhat {dot} com> - code cleanup and bug fixes
- Benjamin Reed <ranger [at] befunk {dot} com> - autodetection for # of CPUs in OSX and maintains OSX packages
- Chris Dearman <chris [at] mips {dot} com> - fixed pthreads race condition that led to pthread resources issues when processing large numbers of files and random segfaults
- Richard Russon <ntfs [at] flatcap {dot} org> - help fix decompression bug
- Paul Pluzhnikov <paul [at] parasoft {dot} com> - fixed minor memory leak
Anibal Monsalve Salazar <anibal [at] debian {dot} org> - creates and maintains Debian packages
- Steve Christensen - creates and maintains Solaris packages (sunfreeware.com)
- Alessio Cervellin - creates and maintains Solaris packages (blastwave.org)
- Andre Przywara - creates and maintains Slackware packages (linuxpackages.net)
- Ying-Chieh Liao - created the FreeBSD port
- Andrew Pantyukhin <sat [at] FreeBSD {dot} org> - maintains the FreeBSD port and willing to resolve any FreeBSD-related problems
- Roland Illig - creates and maintains the NetBSD packages

Special Thanks for suggestions and testing to: Phillippe Welsh, Cassens Transport Co., James Terhune, Dru Lemley, Bryan Stillwell, George Chalissery, Kir Kolyshkin, Madhu Kangara, Mike Furr, Joergen Ramskov, Kurt Fitzner, Peter Cordes, Oliver Falk, Jindrich Novy, Benjamin Reed, Chris Dearman, Richard Russon, Anibal Monsalve Salazar, Jim Leonard, Paul Pluzhniko, Robert Archard, Coran Fisher, Ken Takusagawa, David Pyke.


ToDo

- Add support for input from stdin & pipes


Benchmark Results

The following benchmark was performed using an SGI Altix 3700 Bx2 system with 128 1.6GHz Itanium2 Processors, 6MB cache, 256GB system memory running Linux Kernel 2.4.21-sgi306rp31 on the SHARCNET computing network.

Benchmark results for compressing 1.83GB of data on a Itanium2 1.6 GHz system.

The following benchmark was performed with various systems using a 900k block size.  The pbzip2 software was benchmarked with the Opteron and Pentium4 processors using a Linux 2.6 kernel while the Athlon used Windows XP.

Benchmark results for compressing 159MB of data with 900k block size on various machines.

For more benchmark information click here.


Usage

Run pbzip2 for the help listing.

===================================================================

Usage: pbzip2 [-1 .. -9] [-b#cdfklp#rtvV] <filename> <filename2> <filenameN>

-b#: where # is the file block size in 100k (default 9 = 900k)
-c : output to standard out (stdout)
-d : decompress file
-f : force, overwrite existing output file
-k : keep input file, don't delete
-l : load average determines max number processors to use
-p#: where # is the number of processors (default: autodetect)
-r : read entire input file into RAM and split between processors
-t : test compressed file integrity
-v : verbose mode
-V : display version info for pbzip2 then exit
-1 .. -9 : set BWT block size to 100k .. 900k (default 900k)

Example: pbzip2 -b15qk myfile.tar
Example: pbzip2 -p4 -r -5 myfile.tar second*.txt
Example: pbzip2 -d myfile.tar.bz2

===================================================================

The pbzip2 program is a parallel version of bzip2 for use on shared memory machines. It provides near-linear speedup when used on true multi-processor machines and 5-10% speedup on Hyperthreaded machines. The output is fully compatible with the regular bzip2 data so any files created with pbzip2 can be uncompressed by bzip2 and vice-versa.

The default settings for pbzip2 will work well in most cases. The only switch you will likely need to use is -d to decompress files and -p to set the # of processors for pbzip2 to use if autodetect is not supported on your system, or you want to use a specific # of CPUs.

Example 1:
pbzip2 -v myfile.tar

This example will compress the file "
myfile.tar" into the compressed file "myfile.tar.bz2". It will use the autodetected # of processors (or 2 processors if autodetect not supported) with the default file block size of 900k and default BWT block size of 900k.

The program would report something like:
===================================================================

Parallel BZIP2 v1.0.2 - by: Jeff Gilchrist [http://compression.ca]
[July 25, 2007] (uses libbzip2 by Julian Seward)

# CPUs: 2
BWT Block Size: 900k
File Block Size: 900k
-------------------------------------------
File #: 1 of 1
Input Name: myfile.tar
Output Name: myfile.tar.bz2

Input Size: 7428687 bytes
Compressing data...
Output Size: 3236549 bytes
-------------------------------------------

Wall Clock: 2.809000 seconds

===================================================================

Example 2:
pbzip2 -b15vk myfile.tar

This example will compress the file "
myfile.tar" into the compressed file "myfile.tar.bz2". It will use the autodetected # of processors (or 2 processors if autodetect not supported) with a file block size of 1500k and a BWT block size of 900k. The file "myfile.tar" will not be deleted after compression is finished.

The program would report something like:
===================================================================

Parallel BZIP2 v1.0.2 - by: Jeff Gilchrist [http://compression.ca]
[July 25, 2007] (uses libbzip2 by Julian Seward)

# CPUs: 2
BWT Block Size: 900k
File Block Size: 1500k
-------------------------------------------
File #: 1 of 1
Input Name: myfile.tar
Output Name: myfile.tar.bz2

Input Size: 7428687 bytes
Compressing data...
Output Size: 3236394 bytes
-------------------------------------------

Wall Clock: 3.059000 seconds

===================================================================

Example 3:
pbzip2 -p4 -r -5 -v myfile.tar second*.txt

This example will compress the file "
myfile.tar" into the compressed file "myfile.tar.bz2". It will use 4 processors with a BWT block size of 500k. The file block size will be the size of "myfile.tar" divided by 4 (# of processors) so that the data will be split evenly among each processor. This requires you have enough RAM for pbzip2 to read the entire file into memory for compression. Pbzip2 will then use the same options to compress all other files that match the wildcard "second*.txt" in that directory.

The program would report something like:
===================================================================

Parallel BZIP2 v1.0.2 - by: Jeff Gilchrist [http://compression.ca]
[July 25, 2007] (uses libbzip2 by Julian Seward)

# CPUs: 4
BWT Block Size: 500k
File Block Size: 1857k
-------------------------------------------
File #: 1 of 3
Input Name: myfile.tar
Output Name: myfile.tar.bz2

Input Size: 7428687 bytes
Compressing data...
Output Size: 3237105 bytes
-------------------------------------------
File #: 2 of 3
Input Name: secondfile.txt
Output Name: secondfile.txt.bz2

Input Size: 5897 bytes
Compressing data...
Output Size: 3192 bytes
-------------------------------------------
File #: 3 of 3
Input Name: secondbreakfast.txt
Output Name: secondbreakfast.txt.bz2

Input Size: 83531 bytes
Compressing data...
Output Size: 11832 bytes
-------------------------------------------

Wall Clock: 5.127381 seconds

===================================================================

Example 4:
pbzip2 -dv myfile.tar.bz2

This example will decompress the file "
myfile.tar.bz2" into the decompressed file "myfile.tar". It will use the autodetected # of processors (or 2 processors if autodetect not supported). The switches -b, -r, and -1..-9 are not valid for decompression.

The program would report something like:
===================================================================

Parallel BZIP2 v1.0.2 - by: Jeff Gilchrist [http://compression.ca]
[July 25, 2007] (uses libbzip2 by Julian Seward)

# CPUs: 2
-------------------------------------------
File #: 1 of 1
Input Name:
myfile.tar.bz2
Output Name:
myfile.tar

BWT Block Size: 900k
Input Size: 3236549 bytes
Decompressing data...
Output Size: 7428687 bytes
-------------------------------------------

Wall Clock: 1.154000 seconds

===================================================================

Bugs/Contact

If you would like to report any bugs or contact me related to the software you can reach me via e-mail at: PBZIP2 Contact Address


  • This web page is maintained by Jeff Gilchrist, Copyright © 2003-2008.
  • This web page best viewed using a resolution of 800 x 600 or higher.
compression.ca