diff --git a/Makefile b/Makefile index f8a1772..b0fef95 100644 --- a/Makefile +++ b/Makefile @@ -135,7 +135,7 @@ bzip2recover.o: bzip2recover.c distclean: clean - rm -f manual.ps manual.html manual.pdf + rm -f manual.ps manual.html manual.pdf bzip2.txt bzip2.1.preformatted DISTNAME=bzip2-1.0.8 dist: check manual @@ -205,7 +205,13 @@ dist: check manual MANUAL_SRCS= bz-common.xsl bz-fo.xsl bz-html.xsl bzip.css \ entities.xml manual.xml -manual: manual.html manual.ps manual.pdf +bzip2.txt: bzip2.1 + MANWIDTH=67 man --ascii ./$^ > $@ + +bzip2.1.preformatted: bzip2.1 + MAN_KEEP_FORMATTING=1 MANWIDTH=67 man -E UTF-8 ./$^ > $@ + +manual: manual.html manual.ps manual.pdf bzip2.txt bzip2.1.preformatted manual.ps: $(MANUAL_SRCS) ./xmlproc.sh -ps manual.xml diff --git a/bzip2.1.preformatted b/bzip2.1.preformatted deleted file mode 100644 index 787f1c6..0000000 --- a/bzip2.1.preformatted +++ /dev/null @@ -1,399 +0,0 @@ -bzip2(1) bzip2(1) - - - -NNAAMMEE - bzip2, bunzip2 − a block‐sorting file compressor, v1.0.8 - bzcat − decompresses files to stdout - bzip2recover − recovers data from damaged bzip2 files - - -SSYYNNOOPPSSIISS - bbzziipp22 [ −−ccddffkkqqssttvvzzVVLL112233445566778899 ] [ _f_i_l_e_n_a_m_e_s _._._. ] - bbuunnzziipp22 [ −−ffkkvvssVVLL ] [ _f_i_l_e_n_a_m_e_s _._._. ] - bbzzccaatt [ −−ss ] [ _f_i_l_e_n_a_m_e_s _._._. ] - bbzziipp22rreeccoovveerr _f_i_l_e_n_a_m_e - - -DDEESSCCRRIIPPTTIIOONN - _b_z_i_p_2 compresses files using the Burrows‐Wheeler block - sorting text compression algorithm, and Huffman coding. - Compression is generally considerably better than that - achieved by more conventional LZ77/LZ78‐based compressors, - and approaches the performance of the PPM family of sta­ - tistical compressors. - - The command‐line options are deliberately very similar to - those of _G_N_U _g_z_i_p_, but they are not identical. - - _b_z_i_p_2 expects a list of file names to accompany the com­ - mand‐line flags. Each file is replaced by a compressed - version of itself, with the name "original_name.bz2". - Each compressed file has the same modification date, per­ - missions, and, when possible, ownership as the correspond­ - ing original, so that these properties can be correctly - restored at decompression time. File name handling is - naive in the sense that there is no mechanism for preserv­ - ing original file names, permissions, ownerships or dates - in filesystems which lack these concepts, or have serious - file name length restrictions, such as MS‐DOS. - - _b_z_i_p_2 and _b_u_n_z_i_p_2 will by default not overwrite existing - files. If you want this to happen, specify the −f flag. - - If no file names are specified, _b_z_i_p_2 compresses from - standard input to standard output. In this case, _b_z_i_p_2 - will decline to write compressed output to a terminal, as - this would be entirely incomprehensible and therefore - pointless. - - _b_u_n_z_i_p_2 (or _b_z_i_p_2 _−_d_) decompresses all specified files. - Files which were not created by _b_z_i_p_2 will be detected and - ignored, and a warning issued. _b_z_i_p_2 attempts to guess - the filename for the decompressed file from that of the - compressed file as follows: - - filename.bz2 becomes filename - filename.bz becomes filename - filename.tbz2 becomes filename.tar - filename.tbz becomes filename.tar - anyothername becomes anyothername.out - - If the file does not end in one of the recognised endings, - _._b_z_2_, _._b_z_, _._t_b_z_2 or _._t_b_z_, _b_z_i_p_2 complains that it cannot - guess the name of the original file, and uses the original - name with _._o_u_t appended. - - As with compression, supplying no filenames causes decom­ - pression from standard input to standard output. - - _b_u_n_z_i_p_2 will correctly decompress a file which is the con­ - catenation of two or more compressed files. The result is - the concatenation of the corresponding uncompressed files. - Integrity testing (−t) of concatenated compressed files is - also supported. - - You can also compress or decompress files to the standard - output by giving the −c flag. Multiple files may be com­ - pressed and decompressed like this. The resulting outputs - are fed sequentially to stdout. Compression of multiple - files in this manner generates a stream containing multi­ - ple compressed file representations. Such a stream can be - decompressed correctly only by _b_z_i_p_2 version 0.9.0 or - later. Earlier versions of _b_z_i_p_2 will stop after decom­ - pressing the first file in the stream. - - _b_z_c_a_t (or _b_z_i_p_2 _‐_d_c_) decompresses all specified files to - the standard output. - - _b_z_i_p_2 will read arguments from the environment variables - _B_Z_I_P_2 and _B_Z_I_P_, in that order, and will process them - before any arguments read from the command line. This - gives a convenient way to supply default arguments. - - Compression is always performed, even if the compressed - file is slightly larger than the original. Files of less - than about one hundred bytes tend to get larger, since the - compression mechanism has a constant overhead in the - region of 50 bytes. Random data (including the output of - most file compressors) is coded at about 8.05 bits per - byte, giving an expansion of around 0.5%. - - As a self‐check for your protection, _b_z_i_p_2 uses 32‐bit - CRCs to make sure that the decompressed version of a file - is identical to the original. This guards against corrup­ - tion of the compressed data, and against undetected bugs - in _b_z_i_p_2 (hopefully very unlikely). The chances of data - corruption going undetected is microscopic, about one - chance in four billion for each file processed. Be aware, - though, that the check occurs upon decompression, so it - can only tell you that something is wrong. It can’t help - you recover the original uncompressed data. You can use - _b_z_i_p_2_r_e_c_o_v_e_r to try to recover data from damaged files. - - Return values: 0 for a normal exit, 1 for environmental - problems (file not found, invalid flags, I/O errors, &c), - 2 to indicate a corrupt compressed file, 3 for an internal - consistency error (eg, bug) which caused _b_z_i_p_2 to panic. - - -OOPPTTIIOONNSS - −−cc ‐‐‐‐ssttddoouutt - Compress or decompress to standard output. - - −−dd ‐‐‐‐ddeeccoommpprreessss - Force decompression. _b_z_i_p_2_, _b_u_n_z_i_p_2 and _b_z_c_a_t are - really the same program, and the decision about - what actions to take is done on the basis of which - name is used. This flag overrides that mechanism, - and forces _b_z_i_p_2 to decompress. - - −−zz ‐‐‐‐ccoommpprreessss - The complement to −d: forces compression, - regardless of the invocation name. - - −−tt ‐‐‐‐tteesstt - Check integrity of the specified file(s), but don’t - decompress them. This really performs a trial - decompression and throws away the result. - - −−ff ‐‐‐‐ffoorrccee - Force overwrite of output files. Normally, _b_z_i_p_2 - will not overwrite existing output files. Also - forces _b_z_i_p_2 to break hard links to files, which it - otherwise wouldn’t do. - - bzip2 normally declines to decompress files which - don’t have the correct magic header bytes. If - forced (‐f), however, it will pass such files - through unmodified. This is how GNU gzip behaves. - - −−kk ‐‐‐‐kkeeeepp - Keep (don’t delete) input files during compression - or decompression. - - −−ss ‐‐‐‐ssmmaallll - Reduce memory usage, for compression, decompression - and testing. Files are decompressed and tested - using a modified algorithm which only requires 2.5 - bytes per block byte. This means any file can be - decompressed in 2300k of memory, albeit at about - half the normal speed. - - During compression, −s selects a block size of - 200k, which limits memory use to around the same - figure, at the expense of your compression ratio. - In short, if your machine is low on memory (8 - megabytes or less), use −s for everything. See - MEMORY MANAGEMENT below. - - −−qq ‐‐‐‐qquuiieett - Suppress non‐essential warning messages. Messages - pertaining to I/O errors and other critical events - will not be suppressed. - - −−vv ‐‐‐‐vveerrbboossee - Verbose mode ‐‐ show the compression ratio for each - file processed. Further −v’s increase the ver­ - bosity level, spewing out lots of information which - is primarily of interest for diagnostic purposes. - - −−LL ‐‐‐‐lliicceennssee ‐‐VV ‐‐‐‐vveerrssiioonn - Display the software version, license terms and - conditions. - - −−11 ((oorr −−−−ffaasstt)) ttoo −−99 ((oorr −−−−bbeesstt)) - Set the block size to 100 k, 200 k .. 900 k when - compressing. Has no effect when decompressing. - See MEMORY MANAGEMENT below. The −−fast and −−best - aliases are primarily for GNU gzip compatibility. - In particular, −−fast doesn’t make things signifi­ - cantly faster. And −−best merely selects the - default behaviour. - - −−‐‐ Treats all subsequent arguments as file names, even - if they start with a dash. This is so you can han­ - dle files with names beginning with a dash, for - example: bzip2 −‐ −myfilename. - - −−‐‐rreeppeettiittiivvee‐‐ffaasstt ‐‐‐‐rreeppeettiittiivvee‐‐bbeesstt - These flags are redundant in versions 0.9.5 and - above. They provided some coarse control over the - behaviour of the sorting algorithm in earlier ver­ - sions, which was sometimes useful. 0.9.5 and above - have an improved algorithm which renders these - flags irrelevant. - - -MMEEMMOORRYY MMAANNAAGGEEMMEENNTT - _b_z_i_p_2 compresses large files in blocks. The block size - affects both the compression ratio achieved, and the - amount of memory needed for compression and decompression. - The flags −1 through −9 specify the block size to be - 100,000 bytes through 900,000 bytes (the default) respec­ - tively. At decompression time, the block size used for - compression is read from the header of the compressed - file, and _b_u_n_z_i_p_2 then allocates itself just enough memory - to decompress the file. Since block sizes are stored in - compressed files, it follows that the flags −1 to −9 are - irrelevant to and so ignored during decompression. - - Compression and decompression requirements, in bytes, can - be estimated as: - - Compression: 400k + ( 8 x block size ) - - Decompression: 100k + ( 4 x block size ), or - 100k + ( 2.5 x block size ) - - Larger block sizes give rapidly diminishing marginal - returns. Most of the compression comes from the first two - or three hundred k of block size, a fact worth bearing in - mind when using _b_z_i_p_2 on small machines. It is also - important to appreciate that the decompression memory - requirement is set at compression time by the choice of - block size. - - For files compressed with the default 900k block size, - _b_u_n_z_i_p_2 will require about 3700 kbytes to decompress. To - support decompression of any file on a 4 megabyte machine, - _b_u_n_z_i_p_2 has an option to decompress using approximately - half this amount of memory, about 2300 kbytes. Decompres­ - sion speed is also halved, so you should use this option - only where necessary. The relevant flag is ‐s. - - In general, try and use the largest block size memory con­ - straints allow, since that maximises the compression - achieved. Compression and decompression speed are virtu­ - ally unaffected by block size. - - Another significant point applies to files which fit in a - single block ‐‐ that means most files you’d encounter - using a large block size. The amount of real memory - touched is proportional to the size of the file, since the - file is smaller than a block. For example, compressing a - file 20,000 bytes long with the flag ‐9 will cause the - compressor to allocate around 7600k of memory, but only - touch 400k + 20000 * 8 = 560 kbytes of it. Similarly, the - decompressor will allocate 3700k but only touch 100k + - 20000 * 4 = 180 kbytes. - - Here is a table which summarises the maximum memory usage - for different block sizes. Also recorded is the total - compressed size for 14 files of the Calgary Text Compres­ - sion Corpus totalling 3,141,622 bytes. This column gives - some feel for how compression varies with block size. - These figures tend to understate the advantage of larger - block sizes for larger files, since the Corpus is domi­ - nated by smaller files. - - Compress Decompress Decompress Corpus - Flag usage usage ‐s usage Size - - ‐1 1200k 500k 350k 914704 - ‐2 2000k 900k 600k 877703 - ‐3 2800k 1300k 850k 860338 - ‐4 3600k 1700k 1100k 846899 - ‐5 4400k 2100k 1350k 845160 - ‐6 5200k 2500k 1600k 838626 - ‐7 6100k 2900k 1850k 834096 - ‐8 6800k 3300k 2100k 828642 - ‐9 7600k 3700k 2350k 828642 - - -RREECCOOVVEERRIINNGG DDAATTAA FFRROOMM DDAAMMAAGGEEDD FFIILLEESS - _b_z_i_p_2 compresses files in blocks, usually 900kbytes long. - Each block is handled independently. If a media or trans­ - mission error causes a multi‐block .bz2 file to become - damaged, it may be possible to recover data from the - undamaged blocks in the file. - - The compressed representation of each block is delimited - by a 48‐bit pattern, which makes it possible to find the - block boundaries with reasonable certainty. Each block - also carries its own 32‐bit CRC, so damaged blocks can be - distinguished from undamaged ones. - - _b_z_i_p_2_r_e_c_o_v_e_r is a simple program whose purpose is to - search for blocks in .bz2 files, and write each block out - into its own .bz2 file. You can then use _b_z_i_p_2 −t to test - the integrity of the resulting files, and decompress those - which are undamaged. - - _b_z_i_p_2_r_e_c_o_v_e_r takes a single argument, the name of the dam­ - aged file, and writes a number of files - "rec00001file.bz2", "rec00002file.bz2", etc, containing - the extracted blocks. The output filenames are - designed so that the use of wildcards in subsequent pro­ - cessing ‐‐ for example, "bzip2 ‐dc rec*file.bz2 > recov­ - ered_data" ‐‐ processes the files in the correct order. - - _b_z_i_p_2_r_e_c_o_v_e_r should be of most use dealing with large .bz2 - files, as these will contain many blocks. It is clearly - futile to use it on damaged single‐block files, since a - damaged block cannot be recovered. If you wish to min­ - imise any potential data loss through media or transmis­ - sion errors, you might consider compressing with a smaller - block size. - - -PPEERRFFOORRMMAANNCCEE NNOOTTEESS - The sorting phase of compression gathers together similar - strings in the file. Because of this, files containing - very long runs of repeated symbols, like "aabaabaabaab - ..." (repeated several hundred times) may compress more - slowly than normal. Versions 0.9.5 and above fare much - better than previous versions in this respect. The ratio - between worst‐case and average‐case compression time is in - the region of 10:1. For previous versions, this figure - was more like 100:1. You can use the −vvvv option to mon­ - itor progress in great detail, if you want. - - Decompression speed is unaffected by these phenomena. - - _b_z_i_p_2 usually allocates several megabytes of memory to - operate in, and then charges all over it in a fairly ran­ - dom fashion. This means that performance, both for com­ - pressing and decompressing, is largely determined by the - speed at which your machine can service cache misses. - Because of this, small changes to the code to reduce the - miss rate have been observed to give disproportionately - large performance improvements. I imagine _b_z_i_p_2 will per­ - form best on machines with very large caches. - - -CCAAVVEEAATTSS - I/O error messages are not as helpful as they could be. - _b_z_i_p_2 tries hard to detect I/O errors and exit cleanly, - but the details of what the problem is sometimes seem - rather misleading. - - This manual page pertains to version 1.0.8 of _b_z_i_p_2_. Com­ - pressed data created by this version is entirely forwards - and backwards compatible with the previous public - releases, versions 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, - 1.0.2 and above, but with the following exception: 0.9.0 - and above can correctly decompress multiple concatenated - compressed files. 0.1pl2 cannot do this; it will stop - after decompressing just the first file in the stream. - - _b_z_i_p_2_r_e_c_o_v_e_r versions prior to 1.0.2 used 32‐bit integers - to represent bit positions in compressed files, so they - could not handle compressed files more than 512 megabytes - long. Versions 1.0.2 and above use 64‐bit ints on some - platforms which support them (GNU supported targets, and - Windows). To establish whether or not bzip2recover was - built with such a limitation, run it without arguments. - In any event you can build yourself an unlimited version - if you can recompile it with MaybeUInt64 set to be an - unsigned 64‐bit integer. - - - - -AAUUTTHHOORR - Julian Seward, jseward@acm.org. - - https://sourceware.org/bzip2/ - - The ideas embodied in _b_z_i_p_2 are due to (at least) the fol­ - lowing people: Michael Burrows and David Wheeler (for the - block sorting transformation), David Wheeler (again, for - the Huffman coder), Peter Fenwick (for the structured cod­ - ing model in the original _b_z_i_p_, and many refinements), and - Alistair Moffat, Radford Neal and Ian Witten (for the - arithmetic coder in the original _b_z_i_p_)_. I am much - indebted for their help, support and advice. See the man­ - ual in the source distribution for pointers to sources of - documentation. Christian von Roques encouraged me to look - for faster sorting algorithms, so as to speed up compres­ - sion. Bela Lubkin encouraged me to improve the worst‐case - compression performance. Donna Robinson XMLised the docu­ - mentation. The bz* scripts are derived from those of GNU - gzip. Many people sent patches, helped with portability - problems, lent machines, gave advice and were generally - helpful. - - - - bzip2(1) diff --git a/bzip2.txt b/bzip2.txt deleted file mode 100644 index a50570b..0000000 --- a/bzip2.txt +++ /dev/null @@ -1,391 +0,0 @@ - -NAME - bzip2, bunzip2 - a block-sorting file compressor, v1.0.8 - bzcat - decompresses files to stdout - bzip2recover - recovers data from damaged bzip2 files - - -SYNOPSIS - bzip2 [ -cdfkqstvzVL123456789 ] [ filenames ... ] - bunzip2 [ -fkvsVL ] [ filenames ... ] - bzcat [ -s ] [ filenames ... ] - bzip2recover filename - - -DESCRIPTION - bzip2 compresses files using the Burrows-Wheeler block - sorting text compression algorithm, and Huffman coding. - Compression is generally considerably better than that - achieved by more conventional LZ77/LZ78-based compressors, - and approaches the performance of the PPM family of sta- - tistical compressors. - - The command-line options are deliberately very similar to - those of GNU gzip, but they are not identical. - - bzip2 expects a list of file names to accompany the com- - mand-line flags. Each file is replaced by a compressed - version of itself, with the name "original_name.bz2". - Each compressed file has the same modification date, per- - missions, and, when possible, ownership as the correspond- - ing original, so that these properties can be correctly - restored at decompression time. File name handling is - naive in the sense that there is no mechanism for preserv- - ing original file names, permissions, ownerships or dates - in filesystems which lack these concepts, or have serious - file name length restrictions, such as MS-DOS. - - bzip2 and bunzip2 will by default not overwrite existing - files. If you want this to happen, specify the -f flag. - - If no file names are specified, bzip2 compresses from - standard input to standard output. In this case, bzip2 - will decline to write compressed output to a terminal, as - this would be entirely incomprehensible and therefore - pointless. - - bunzip2 (or bzip2 -d) decompresses all specified files. - Files which were not created by bzip2 will be detected and - ignored, and a warning issued. bzip2 attempts to guess - the filename for the decompressed file from that of the - compressed file as follows: - - filename.bz2 becomes filename - filename.bz becomes filename - filename.tbz2 becomes filename.tar - filename.tbz becomes filename.tar - anyothername becomes anyothername.out - - If the file does not end in one of the recognised endings, - .bz2, .bz, .tbz2 or .tbz, bzip2 complains that it cannot - guess the name of the original file, and uses the original - name with .out appended. - - As with compression, supplying no filenames causes decom- - pression from standard input to standard output. - - bunzip2 will correctly decompress a file which is the con- - catenation of two or more compressed files. The result is - the concatenation of the corresponding uncompressed files. - Integrity testing (-t) of concatenated compressed files is - also supported. - - You can also compress or decompress files to the standard - output by giving the -c flag. Multiple files may be com- - pressed and decompressed like this. The resulting outputs - are fed sequentially to stdout. Compression of multiple - files in this manner generates a stream containing multi- - ple compressed file representations. Such a stream can be - decompressed correctly only by bzip2 version 0.9.0 or - later. Earlier versions of bzip2 will stop after decom- - pressing the first file in the stream. - - bzcat (or bzip2 -dc) decompresses all specified files to - the standard output. - - bzip2 will read arguments from the environment variables - BZIP2 and BZIP, in that order, and will process them - before any arguments read from the command line. This - gives a convenient way to supply default arguments. - - Compression is always performed, even if the compressed - file is slightly larger than the original. Files of less - than about one hundred bytes tend to get larger, since the - compression mechanism has a constant overhead in the - region of 50 bytes. Random data (including the output of - most file compressors) is coded at about 8.05 bits per - byte, giving an expansion of around 0.5%. - - As a self-check for your protection, bzip2 uses 32-bit - CRCs to make sure that the decompressed version of a file - is identical to the original. This guards against corrup- - tion of the compressed data, and against undetected bugs - in bzip2 (hopefully very unlikely). The chances of data - corruption going undetected is microscopic, about one - chance in four billion for each file processed. Be aware, - though, that the check occurs upon decompression, so it - can only tell you that something is wrong. It can't help - you recover the original uncompressed data. You can use - bzip2recover to try to recover data from damaged files. - - Return values: 0 for a normal exit, 1 for environmental - problems (file not found, invalid flags, I/O errors, &c), - 2 to indicate a corrupt compressed file, 3 for an internal - consistency error (eg, bug) which caused bzip2 to panic. - - -OPTIONS - -c --stdout - Compress or decompress to standard output. - - -d --decompress - Force decompression. bzip2, bunzip2 and bzcat are - really the same program, and the decision about - what actions to take is done on the basis of which - name is used. This flag overrides that mechanism, - and forces bzip2 to decompress. - - -z --compress - The complement to -d: forces compression, - regardless of the invocation name. - - -t --test - Check integrity of the specified file(s), but don't - decompress them. This really performs a trial - decompression and throws away the result. - - -f --force - Force overwrite of output files. Normally, bzip2 - will not overwrite existing output files. Also - forces bzip2 to break hard links to files, which it - otherwise wouldn't do. - - bzip2 normally declines to decompress files which - don't have the correct magic header bytes. If - forced (-f), however, it will pass such files - through unmodified. This is how GNU gzip behaves. - - -k --keep - Keep (don't delete) input files during compression - or decompression. - - -s --small - Reduce memory usage, for compression, decompression - and testing. Files are decompressed and tested - using a modified algorithm which only requires 2.5 - bytes per block byte. This means any file can be - decompressed in 2300k of memory, albeit at about - half the normal speed. - - During compression, -s selects a block size of - 200k, which limits memory use to around the same - figure, at the expense of your compression ratio. - In short, if your machine is low on memory (8 - megabytes or less), use -s for everything. See - MEMORY MANAGEMENT below. - - -q --quiet - Suppress non-essential warning messages. Messages - pertaining to I/O errors and other critical events - will not be suppressed. - - -v --verbose - Verbose mode -- show the compression ratio for each - file processed. Further -v's increase the ver- - bosity level, spewing out lots of information which - is primarily of interest for diagnostic purposes. - - -L --license -V --version - Display the software version, license terms and - conditions. - - -1 (or --fast) to -9 (or --best) - Set the block size to 100 k, 200 k .. 900 k when - compressing. Has no effect when decompressing. - See MEMORY MANAGEMENT below. The --fast and --best - aliases are primarily for GNU gzip compatibility. - In particular, --fast doesn't make things signifi- - cantly faster. And --best merely selects the - default behaviour. - - -- Treats all subsequent arguments as file names, even - if they start with a dash. This is so you can han- - dle files with names beginning with a dash, for - example: bzip2 -- -myfilename. - - --repetitive-fast --repetitive-best - These flags are redundant in versions 0.9.5 and - above. They provided some coarse control over the - behaviour of the sorting algorithm in earlier ver- - sions, which was sometimes useful. 0.9.5 and above - have an improved algorithm which renders these - flags irrelevant. - - -MEMORY MANAGEMENT - bzip2 compresses large files in blocks. The block size - affects both the compression ratio achieved, and the - amount of memory needed for compression and decompression. - The flags -1 through -9 specify the block size to be - 100,000 bytes through 900,000 bytes (the default) respec- - tively. At decompression time, the block size used for - compression is read from the header of the compressed - file, and bunzip2 then allocates itself just enough memory - to decompress the file. Since block sizes are stored in - compressed files, it follows that the flags -1 to -9 are - irrelevant to and so ignored during decompression. - - Compression and decompression requirements, in bytes, can - be estimated as: - - Compression: 400k + ( 8 x block size ) - - Decompression: 100k + ( 4 x block size ), or - 100k + ( 2.5 x block size ) - - Larger block sizes give rapidly diminishing marginal - returns. Most of the compression comes from the first two - or three hundred k of block size, a fact worth bearing in - mind when using bzip2 on small machines. It is also - important to appreciate that the decompression memory - requirement is set at compression time by the choice of - block size. - - For files compressed with the default 900k block size, - bunzip2 will require about 3700 kbytes to decompress. To - support decompression of any file on a 4 megabyte machine, - bunzip2 has an option to decompress using approximately - half this amount of memory, about 2300 kbytes. Decompres- - sion speed is also halved, so you should use this option - only where necessary. The relevant flag is -s. - - In general, try and use the largest block size memory con- - straints allow, since that maximises the compression - achieved. Compression and decompression speed are virtu- - ally unaffected by block size. - - Another significant point applies to files which fit in a - single block -- that means most files you'd encounter - using a large block size. The amount of real memory - touched is proportional to the size of the file, since the - file is smaller than a block. For example, compressing a - file 20,000 bytes long with the flag -9 will cause the - compressor to allocate around 7600k of memory, but only - touch 400k + 20000 * 8 = 560 kbytes of it. Similarly, the - decompressor will allocate 3700k but only touch 100k + - 20000 * 4 = 180 kbytes. - - Here is a table which summarises the maximum memory usage - for different block sizes. Also recorded is the total - compressed size for 14 files of the Calgary Text Compres- - sion Corpus totalling 3,141,622 bytes. This column gives - some feel for how compression varies with block size. - These figures tend to understate the advantage of larger - block sizes for larger files, since the Corpus is domi- - nated by smaller files. - - Compress Decompress Decompress Corpus - Flag usage usage -s usage Size - - -1 1200k 500k 350k 914704 - -2 2000k 900k 600k 877703 - -3 2800k 1300k 850k 860338 - -4 3600k 1700k 1100k 846899 - -5 4400k 2100k 1350k 845160 - -6 5200k 2500k 1600k 838626 - -7 6100k 2900k 1850k 834096 - -8 6800k 3300k 2100k 828642 - -9 7600k 3700k 2350k 828642 - - -RECOVERING DATA FROM DAMAGED FILES - bzip2 compresses files in blocks, usually 900kbytes long. - Each block is handled independently. If a media or trans- - mission error causes a multi-block .bz2 file to become - damaged, it may be possible to recover data from the - undamaged blocks in the file. - - The compressed representation of each block is delimited - by a 48-bit pattern, which makes it possible to find the - block boundaries with reasonable certainty. Each block - also carries its own 32-bit CRC, so damaged blocks can be - distinguished from undamaged ones. - - bzip2recover is a simple program whose purpose is to - search for blocks in .bz2 files, and write each block out - into its own .bz2 file. You can then use bzip2 -t to test - the integrity of the resulting files, and decompress those - which are undamaged. - - bzip2recover takes a single argument, the name of the dam- - aged file, and writes a number of files - "rec00001file.bz2", "rec00002file.bz2", etc, containing - the extracted blocks. The output filenames are - designed so that the use of wildcards in subsequent pro- - cessing -- for example, "bzip2 -dc rec*file.bz2 > recov- - ered_data" -- processes the files in the correct order. - - bzip2recover should be of most use dealing with large .bz2 - files, as these will contain many blocks. It is clearly - futile to use it on damaged single-block files, since a - damaged block cannot be recovered. If you wish to min- - imise any potential data loss through media or transmis- - sion errors, you might consider compressing with a smaller - block size. - - -PERFORMANCE NOTES - The sorting phase of compression gathers together similar - strings in the file. Because of this, files containing - very long runs of repeated symbols, like "aabaabaabaab - ..." (repeated several hundred times) may compress more - slowly than normal. Versions 0.9.5 and above fare much - better than previous versions in this respect. The ratio - between worst-case and average-case compression time is in - the region of 10:1. For previous versions, this figure - was more like 100:1. You can use the -vvvv option to mon- - itor progress in great detail, if you want. - - Decompression speed is unaffected by these phenomena. - - bzip2 usually allocates several megabytes of memory to - operate in, and then charges all over it in a fairly ran- - dom fashion. This means that performance, both for com- - pressing and decompressing, is largely determined by the - speed at which your machine can service cache misses. - Because of this, small changes to the code to reduce the - miss rate have been observed to give disproportionately - large performance improvements. I imagine bzip2 will per- - form best on machines with very large caches. - - -CAVEATS - I/O error messages are not as helpful as they could be. - bzip2 tries hard to detect I/O errors and exit cleanly, - but the details of what the problem is sometimes seem - rather misleading. - - This manual page pertains to version 1.0.8 of bzip2. Com- - pressed data created by this version is entirely forwards - and backwards compatible with the previous public - releases, versions 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, - 1.0.2 and above, but with the following exception: 0.9.0 - and above can correctly decompress multiple concatenated - compressed files. 0.1pl2 cannot do this; it will stop - after decompressing just the first file in the stream. - - bzip2recover versions prior to 1.0.2 used 32-bit integers - to represent bit positions in compressed files, so they - could not handle compressed files more than 512 megabytes - long. Versions 1.0.2 and above use 64-bit ints on some - platforms which support them (GNU supported targets, and - Windows). To establish whether or not bzip2recover was - built with such a limitation, run it without arguments. - In any event you can build yourself an unlimited version - if you can recompile it with MaybeUInt64 set to be an - unsigned 64-bit integer. - - -AUTHOR - Julian Seward, jseward@acm.org - - https://sourceware.org/bzip2/ - - The ideas embodied in bzip2 are due to (at least) the fol- - lowing people: Michael Burrows and David Wheeler (for the - block sorting transformation), David Wheeler (again, for - the Huffman coder), Peter Fenwick (for the structured cod- - ing model in the original bzip, and many refinements), and - Alistair Moffat, Radford Neal and Ian Witten (for the - arithmetic coder in the original bzip). I am much - indebted for their help, support and advice. See the man- - ual in the source distribution for pointers to sources of - documentation. Christian von Roques encouraged me to look - for faster sorting algorithms, so as to speed up compres- - sion. Bela Lubkin encouraged me to improve the worst-case - compression performance. Donna Robinson XMLised the docu- - mentation. The bz* scripts are derived from those of GNU - gzip. Many people sent patches, helped with portability - problems, lent machines, gave advice and were generally - helpful. - diff --git a/prepare-release.sh b/prepare-release.sh index 12c29f7..1bc8474 100755 --- a/prepare-release.sh +++ b/prepare-release.sh @@ -45,7 +45,7 @@ sed -i -e "s@ENTITY bz-version \".*\"@ENTITY bz-version \"$VERSION\"@" \ # isn't, so explicitly change it here too. sed -i -e "s@This manual page pertains to version .* of@This manual page pertains to version $VERSION of@" \ -e "s@sorting file compressor, v.*@sorting file compressor, v$VERSION@" \ - bzip2.1* bzip2.txt + bzip2.1 # Update sources. All sources, use bzlib_private. # Except bzip2recover, which embeds a version string...