bzip2-0.1

2024-11-23 03:33:27 +08:00 · 1997-08-07 22:13:13 +02:00 · 1997-08-07 22:13:13 +02:00 · 33d1340302
commit 33d1340302
22 changed files with 6550 additions and 0 deletions
--- a/47
+++ b/47
@ -0,0 +1,47 @@
+
+Bzip2 is not research work, in the sense that it doesn't present any
+new ideas.  Rather, it's an engineering exercise based on existing
+ideas.
+
+Four documents describe essentially all the ideas behind bzip2:
+ 
+   Michael Burrows and D. J. Wheeler:
+     "A block-sorting lossless data compression algorithm"
+      10th May 1994. 
+      Digital SRC Research Report 124.
+      ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
+
+   Daniel S. Hirschberg and Debra A. LeLewer
+     "Efficient Decoding of Prefix Codes"
+      Communications of the ACM, April 1990, Vol 33, Number 4.
+      You might be able to get an electronic copy of this
+         from the ACM Digital Library.
+
+   David J. Wheeler
+      Program bred3.c and accompanying document bred3.ps.
+      This contains the idea behind the multi-table Huffman
+      coding scheme.
+      ftp://ftp.cl.cam.ac.uk/pub/user/djw3/
+
+   Jon L. Bentley and Robert Sedgewick
+     "Fast Algorithms for Sorting and Searching Strings"
+      Available from Sedgewick's web page,
+      www.cs.princeton.edu/~rs
+
+The following paper gives valuable additional insights into the
+algorithm, but is not immediately the basis of any code
+used in bzip2.
+
+   Peter Fenwick:
+      Block Sorting Text Compression
+      Proceedings of the 19th Australasian Computer Science Conference,
+        Melbourne, Australia.  Jan 31 - Feb 2, 1996.
+      ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
+      
+All three are well written, and make fascinating reading.  If you want
+to modify bzip2 in any non-trivial way, I strongly suggest you obtain,
+read and understand these papers.
+
+I am much indebted to the various authors for their help, support and
+advice.
+
--- a/339
+++ b/339
@ -0,0 +1,339 @@
+		    GNU GENERAL PUBLIC LICENSE
+		       Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.
+                          675 Mass Ave, Cambridge, MA 02139, USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+			    Preamble
+
+  The licenses for most software are designed to take away your
+freedom to share and change it.  By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users.  This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it.  (Some other Free Software Foundation software is covered by
+the GNU Library General Public License instead.)  You can apply it to
+your programs, too.
+
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+  To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+  For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have.  You must make sure that they, too, receive or can get the
+source code.  And you must show them these terms so they know their
+rights.
+
+  We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+  Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software.  If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+  Finally, any free program is threatened constantly by software
+patents.  We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary.  To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.
+
+		    GNU GENERAL PUBLIC LICENSE
+   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+  0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License.  The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language.  (Hereinafter, translation is included without limitation in
+the term "modification".)  Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope.  The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+  1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+  2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+    a) You must cause the modified files to carry prominent notices
+    stating that you changed the files and the date of any change.
+
+    b) You must cause any work that you distribute or publish, that in
+    whole or in part contains or is derived from the Program or any
+    part thereof, to be licensed as a whole at no charge to all third
+    parties under the terms of this License.
+
+    c) If the modified program normally reads commands interactively
+    when run, you must cause it, when started running for such
+    interactive use in the most ordinary way, to print or display an
+    announcement including an appropriate copyright notice and a
+    notice that there is no warranty (or else, saying that you provide
+    a warranty) and that users may redistribute the program under
+    these conditions, and telling the user how to view a copy of this
+    License.  (Exception: if the Program itself is interactive but
+    does not normally print such an announcement, your work based on
+    the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole.  If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works.  But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+  3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+    a) Accompany it with the complete corresponding machine-readable
+    source code, which must be distributed under the terms of Sections
+    1 and 2 above on a medium customarily used for software interchange; or,
+
+    b) Accompany it with a written offer, valid for at least three
+    years, to give any third party, for a charge no more than your
+    cost of physically performing source distribution, a complete
+    machine-readable copy of the corresponding source code, to be
+    distributed under the terms of Sections 1 and 2 above on a medium
+    customarily used for software interchange; or,
+
+    c) Accompany it with the information you received as to the offer
+    to distribute corresponding source code.  (This alternative is
+    allowed only for noncommercial distribution and only if you
+    received the program in object code or executable form with such
+    an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it.  For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable.  However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+  4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License.  Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+  5. You are not required to accept this License, since you have not
+signed it.  However, nothing else grants you permission to modify or
+distribute the Program or its derivative works.  These actions are
+prohibited by law if you do not accept this License.  Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+  6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions.  You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+  7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all.  For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices.  Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+  8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded.  In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+  9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time.  Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number.  If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation.  If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+  10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission.  For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this.  Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+			    NO WARRANTY
+
+  11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+  12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+		     END OF TERMS AND CONDITIONS
+
+	Appendix: How to Apply These Terms to Your New Programs
+
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) 19yy  <name of author>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program; if not, write to the Free Software
+    Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program is interactive, make it output a short notice like this
+when it starts in an interactive mode:
+
+    Gnomovision version 69, Copyright (C) 19yy name of author
+    Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+    This is free software, and you are welcome to redistribute it
+    under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License.  Of course, the commands you use may
+be called something other than `show w' and `show c'; they could even be
+mouse-clicks or menu items--whatever suits your program.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the program, if
+necessary.  Here is a sample; alter the names:
+
+  Yoyodyne, Inc., hereby disclaims all copyright interest in the program
+  `Gnomovision' (which makes passes at compilers) written by James Hacker.
+
+  <signature of Ty Coon>, 1 April 1989
+  Ty Coon, President of Vice
+
+This General Public License does not permit incorporating your program into
+proprietary programs.  If your program is a subroutine library, you may
+consider it more useful to permit linking proprietary applications with the
+library.  If this is what you want to do, use the GNU Library General
+Public License instead of this License.
--- a/30
+++ b/30
@ -0,0 +1,30 @@
+
+CC = gcc
+SH = /bin/sh
+
+CFLAGS = -O3 -fomit-frame-pointer -funroll-loops -Wall -Winline -W
+
+
+
+all:
+	cat words0
+	$(CC) $(CFLAGS) -o bzip2 bzip2.c
+	$(CC) $(CFLAGS) -o bzip2recover bzip2recover.c
+	rm -f bunzip2
+	ln -s ./bzip2 ./bunzip2
+	cat words1
+	./bzip2 -1 < sample1.ref > sample1.rb2
+	./bzip2 -2 < sample2.ref > sample2.rb2
+	./bunzip2 < sample1.bz2 > sample1.tst
+	./bunzip2 < sample2.bz2 > sample2.tst
+	cat words2
+	cmp sample1.bz2 sample1.rb2 
+	cmp sample2.bz2 sample2.rb2
+	cmp sample1.tst sample1.ref
+	cmp sample2.tst sample2.ref
+	cat words3
+
+
+clean:
+	rm -f bzip2 bunzip2 bzip2recover sample*.tst sample*.rb2
+
--- a/243
+++ b/243
@ -0,0 +1,243 @@
+
+GREETINGS!
+
+   This is the README for bzip2, my block-sorting file compressor,
+   version 0.1.  
+
+   bzip2 is distributed under the GNU General Public License version 2;
+   for details, see the file LICENSE.  Pointers to the algorithms used
+   are in ALGORITHMS.  Instructions for use are in bzip2.1.preformatted.
+
+   Please read this file carefully.
+
+
+
+HOW TO BUILD
+
+   -- for UNIX:
+
+        Type `make'.     (tough, huh? :-)
+
+        This creates binaries "bzip2", and "bunzip2",
+        which is a symbolic link to "bzip2".
+
+        It also runs four compress-decompress tests to make sure
+        things are working properly.  If all goes well, you should be up &
+        running.  Please be sure to read the output from `make'
+        just to be sure that the tests went ok.
+
+        To install bzip2 properly:
+
+           -- Copy the binary "bzip2" to a publically visible place,
+              possibly /usr/bin, /usr/common/bin or /usr/local/bin.
+
+           -- In that directory, make "bunzip2" be a symbolic link
+              to "bzip2".
+
+           -- Copy the manual page, bzip2.1, to the relevant place.
+              Probably the right place is /usr/man/man1/.
+   
+   -- for Windows 95 and NT: 
+
+        For a start, do you *really* want to recompile bzip2?  
+        The standard distribution includes a pre-compiled version
+        for Windows 95 and NT, `bzip2.exe'.
+
+        This executable was created with Jacob Navia's excellent
+        port to Win32 of Chris Fraser & David Hanson's excellent
+        ANSI C compiler, "lcc".  You can get to it at the pages
+        of the CS department of Princeton University, 
+        www.cs.princeton.edu.  
+        I have not tried to compile this version of bzip2 with
+        a commercial C compiler such as MS Visual C, as I don't
+        have one available.
+
+        Note that lcc is designed primarily to be portable and
+        fast.  Code quality is a secondary aim, so bzip2.exe
+        runs perhaps 40% slower than it could if compiled with
+        a good optimising compiler.
+
+        I compiled a previous version of bzip (0.21) with Borland
+        C 5.0, which worked fine, and with MS VC++ 2.0, which
+        didn't.  Here is an comment from the README for bzip-0.21.
+
+           MS VC++ 2.0's optimising compiler has a bug which, at 
+           maximum optimisation, gives an executable which produces 
+           garbage compressed files.  Proceed with caution. 
+           I do not know whether or not this happens with later 
+           versions of VC++.
+
+           Edit the defines starting at line 86 of bzip.c to 
+           select your platform/compiler combination, and then compile.
+           Then check that the resulting executable (assumed to be 
+           called bzip.exe) works correctly, using the SELFTEST.BAT file.  
+           Bearing in mind the previous paragraph, the self-test is
+           important.
+
+        Note that the defines which bzip-0.21 had, to support 
+        compilation with VC 2.0 and BC 5.0, are gone.  Windows
+        is not my preferred operating system, and I am, for the
+        moment, content with the modestly fast executable created
+        by lcc-win32.
+
+   A manual page is supplied, unformatted (bzip2.1),
+   preformatted (bzip2.1.preformatted), and preformatted
+   and sanitised for MS-DOS (bzip2.txt).
+
+   
+
+COMPILATION NOTES
+
+   bzip2 should work on any 32 or 64-bit machine.  It is known to work
+   [meaning: it has compiled and passed self-tests] on the 
+   following platform-os combinations:
+
+      Intel i386/i486        running Linux 2.0.21
+      Sun Sparcs (various)   running SunOS 4.1.4 and Solaris 2.5
+      Intel i386/i486        running Windows 95 and NT
+      DEC Alpha              running Digital Unix 4.0
+
+   Following the release of bzip-0.21, many people mailed me
+   from around the world to say they had made it work on all sorts
+   of weird and wonderful machines.  Chances are, if you have
+   a reasonable ANSI C compiler and a 32-bit machine, you can
+   get it to work.
+
+   The #defines starting at around line 82 of bzip2.c supply some
+   degree of platform-independance.  If you configure bzip2 for some
+   new far-out platform which is not covered by the existing definitions,
+   please send me the relevant definitions.
+
+   I recommend GNU C for compilation.  The code is standard ANSI C,
+   except for the Unix-specific file handling, so any ANSI C compiler
+   should work.  Note however that the many routines marked INLINE
+   should be inlined by your compiler, else performance will be very
+   poor.  Asking your compiler to unroll loops gives some
+   small improvement too; for gcc, the relevant flag is
+   -funroll-loops.
+
+   On a 386/486 machines, I'd recommend giving gcc the
+   -fomit-frame-pointer flag; this liberates another register for
+   allocation, which measurably improves performance.
+
+   I used the abovementioned lcc compiler to develop bzip2.
+   I would highly recommend this compiler for day-to-day development;
+   it is fast, reliable, lightweight, has an excellent profiler,
+   and is generally excellent.  And it's fun to retarget, if you're
+   into that kind of thing.
+
+   If you compile bzip2 on a new platform or with a new compiler,
+   please be sure to run the four compress-decompress tests, either
+   using the Makefile, or with the test.bat (MSDOS) or test.cmd (OS/2)
+   files.  Some compilers have been seen to introduce subtle bugs
+   when optimising, so this check is important.  Ideally you should
+   then go on to test bzip2 on a file several megabytes or even
+   tens of megabytes long, just to be 110% sure.  ``Professional
+   programmers are paranoid programmers.'' (anon).
+
+
+
+VALIDATION
+
+   Correct operation, in the sense that a compressed file can always be
+   decompressed to reproduce the original, is obviously of paramount
+   importance.  To validate bzip2, I used a modified version of 
+   Mark Nelson's churn program.  Churn is an automated test driver
+   which recursively traverses a directory structure, using bzip2 to
+   compress and then decompress each file it encounters, and checking
+   that the decompressed data is the same as the original.  As test 
+   material, I used several runs over several filesystems of differing
+   sizes.
+
+   One set of tests was done on my base Linux filesystem,
+   410 megabytes in 23,000 files.  There were several runs over
+   this filesystem, in various configurations designed to break bzip2.
+   That filesystem also contained some specially constructed test
+   files designed to exercise boundary cases in the code.
+   This included files of zero length, various long, highly repetitive 
+   files, and some files which generate blocks with all values the same.
+
+   The other set of tests was done just with the "normal" configuration,
+   but on a much larger quantity of data.
+
+      Tests are:
+
+         Linux FS, 410M, 23000 files
+
+         As above, with --repetitive-fast
+
+         As above, with -1
+
+         Low level disk image of a disk containing
+            Windows NT4.0; 420M in a single huge file
+
+         Linux distribution, incl Slackware, 
+            all GNU sources.   1900M in 2300 files.
+
+         Approx ~100M compiler sources and related
+            programming tools, running under Purify.
+
+         About 500M of data in 120 files of around
+            4 M each.  This is raw data from a 
+            biomagnetometer (SQUID-based thing).
+
+      Overall, total volume of test data is about
+         3300 megabytes in 25000 files.
+
+   The distribution does four tests after building bzip.  These tests
+   include test decompressions of pre-supplied compressed files, so
+   they not only test that bzip works correctly on the machine it was
+   built on, but can also decompress files compressed on a different
+   machine.  This guards against unforseen interoperability problems.
+
+
+Please read and be aware of the following:
+
+WARNING:
+
+   This program (attempts to) compress data by performing several
+   non-trivial transformations on it.  Unless you are 100% familiar
+   with *all* the algorithms contained herein, and with the
+   consequences of modifying them, you should NOT meddle with the
+   compression or decompression machinery.  Incorrect changes can and
+   very likely *will* lead to disastrous loss of data.
+
+
+DISCLAIMER:
+
+   I TAKE NO RESPONSIBILITY FOR ANY LOSS OF DATA ARISING FROM THE
+   USE OF THIS PROGRAM, HOWSOEVER CAUSED.
+
+   Every compression of a file implies an assumption that the
+   compressed file can be decompressed to reproduce the original.
+   Great efforts in design, coding and testing have been made to
+   ensure that this program works correctly.  However, the complexity
+   of the algorithms, and, in particular, the presence of various
+   special cases in the code which occur with very low but non-zero
+   probability make it impossible to rule out the possibility of bugs
+   remaining in the program.  DO NOT COMPRESS ANY DATA WITH THIS
+   PROGRAM UNLESS YOU ARE PREPARED TO ACCEPT THE POSSIBILITY, HOWEVER
+   SMALL, THAT THE DATA WILL NOT BE RECOVERABLE.
+
+   That is not to say this program is inherently unreliable.  Indeed,
+   I very much hope the opposite is true.  bzip2 has been carefully
+   constructed and extensively tested.
+
+End of nasty legalities.
+
+
+I hope you find bzip2 useful.  Feel free to contact me at
+   jseward@acm.org
+if you have any suggestions or queries.  Many people mailed me with
+comments, suggestions and patches after the releases of 0.15 and 0.21, 
+and the changes in bzip2 are largely a result of this feedback.
+I thank you for your comments.
+
+Julian Seward
+
+Manchester, UK
+18 July 1996 (version 0.15)
+25 August 1996 (version 0.21)
+
+Guildford, Surrey, UK
+7 August 1997 (bzip2, version 0.0)
--- a/README.DOS
+++ b/README.DOS
@ -0,0 +1,20 @@
+
+Windows 95 & Windows NT users:
+
+1.  There's a pre-built executable, bzip2.exe, which
+    should work.  You don't need to compile anything.
+    You can run the `test.bat' batch file to check
+    the executable is working ok, if you want.
+
+2.  The control-C signal catcher seems pretty dodgy
+    under Windows, at least for the executable supplied.
+    When it catches a control-C, bzip2 tries to delete
+    its output file, so you don't get left with a half-
+    baked file.  But this sometimes seems to fail
+    under Windows.  Caveat Emptor!  I think I am doing
+    something not-quite-right in the signal catching.
+    Windows-&-C gurus got any suggestions?
+
+    Control-C handling all seems to work fine under Unix.
+
+7 Aug 97
--- a/bzip2.1
+++ b/bzip2.1
@ -0,0 +1,441 @@
+.PU
+.TH bzip2 1
+.SH NAME
+bzip2, bunzip2 \- a block-sorting file compressor, v0.1
+.br
+bzip2recover \- recovers data from damaged bzip2 files
+
+.SH SYNOPSIS
+.ll +8
+.B bzip2
+.RB [ " \-cdfkstvVL123456789 " ]
+[
+.I "filenames \&..."
+]
+.ll -8
+.br
+.B bunzip2
+.RB [ " \-kvsVL " ]
+[
+.I "filenames \&..."
+]
+.br
+.B bzip2recover
+.I "filename"
+
+.SH DESCRIPTION
+.I Bzip2
+compresses files using the Burrows-Wheeler block-sorting 
+text compression algorithm, and Huffman coding.
+Compression is generally considerably
+better than that 
+achieved by more conventional LZ77/LZ78-based compressors,
+and approaches the performance of the PPM family of statistical
+compressors.
+
+The command-line options are deliberately very similar to 
+those of 
+.I GNU Gzip,
+but they are not identical.
+
+.I Bzip2 
+expects a list of file names to accompany the command-line flags.  
+Each file is replaced by a compressed version of itself,
+with the name "original_name.bz2".
+Each compressed file has the same modification date and permissions
+as the corresponding original, so that these properties can be 
+correctly restored at decompression time.  File name handling is
+naive in the sense that there is no mechanism for preserving
+original file names, permissions and dates in filesystems 
+which lack these concepts, or have serious file name length
+restrictions, such as MS-DOS.
+
+.I Bzip2
+and
+.I bunzip2
+will not overwrite existing files; if you want this to happen,
+you should delete them first.
+
+If no file names are specified,
+.I bzip2
+compresses from standard input to standard output.
+In this case,
+.I bzip2
+will decline to write compressed output to a terminal, as
+this would be entirely incomprehensible and therefore pointless.
+
+.I Bunzip2
+(or
+.I bzip2 \-d
+) decompresses and restores all specified files whose names
+end in ".bz2".
+Files without this suffix are ignored.  
+Again, supplying no filenames
+causes decompression from standard input to standard output.
+
+You can also compress or decompress files to
+the standard output by giving the \-c flag.
+You can decompress multiple files like this, but you may
+only compress a single file this way, since it would otherwise
+be difficult to separate out the compressed representations of
+the original files.
+
+Compression is always performed, even if the compressed file is
+slightly larger than the original.  Files of less than about
+one hundred bytes tend to get larger, since the compression 
+mechanism has a constant overhead in the region of 50 bytes.
+Random data (including the output of most file compressors)
+is coded at about 8.05 bits per byte, giving an expansion of 
+around 0.5%.
+
+As a self-check for your protection,
+.I bzip2
+uses 32-bit CRCs to make sure that the decompressed
+version of a file is identical to the original.  
+This guards against corruption of the compressed data,
+and against undetected bugs in
+.I bzip2
+(hopefully very unlikely).
+The chances of data corruption going undetected is 
+microscopic, about one chance in four billion
+for each file processed.  Be aware, though, that the check
+occurs upon decompression, so it can only tell you that
+that something is wrong.  It can't help you recover the
+original uncompressed data.
+You can use
+.I bzip2recover
+to try to recover data from damaged files.
+
+Return values: 
+0 for a normal exit, 
+1 for environmental
+problems (file not found, invalid flags, I/O errors, &c),
+2 to indicate a corrupt compressed file,
+3 for an internal consistency error (eg, bug) which caused
+.I bzip2 
+to panic.
+
+.SH MEMORY MANAGEMENT
+.I Bzip2
+compresses large files in blocks.  The block size affects both the 
+compression ratio achieved, and the amount of memory needed both for
+compression and decompression.  The flags \-1 through \-9
+specify the block size to be 100,000 bytes through 900,000 bytes
+(the default) respectively.  At decompression-time, the block size used for
+compression is read from the header of the compressed file, and
+.I bunzip2
+then allocates itself just enough memory to decompress the file.
+Since block sizes are stored in compressed files, it follows that the flags
+\-1 to \-9
+are irrelevant to and so ignored during decompression.
+Compression and decompression requirements, in bytes, can be estimated as:
+
+      Compression:   400k + ( 7 x block size )
+
+      Decompression: 100k + ( 5 x block size ), or
+.br
+                     100k + ( 2.5 x block size )
+
+Larger block sizes give rapidly diminishing marginal returns; most
+of the 
+compression comes from the first two or three hundred k of block size,
+a fact worth bearing in mind when using 
+.I bzip2
+on small machines.  It is also important to appreciate that the
+decompression memory requirement is set at compression-time by the
+choice of block size.
+
+For files compressed with the default 900k block size, 
+.I bunzip2
+will require about 4600 kbytes to decompress.
+To support decompression of any file on a 4 megabyte machine,
+.I bunzip2
+has an option to decompress using approximately half this
+amount of memory, about 2300 kbytes.  Decompression speed is
+also halved, so you should use this option only where necessary.
+The relevant flag is \-s.
+
+In general, try and use the largest block size
+memory constraints allow, since that maximises the compression
+achieved.  Compression and decompression
+speed are virtually unaffected by block size.
+
+Another significant point applies to files which fit in a single
+block -- that means most files you'd encounter using a large 
+block size.  The amount of real memory touched is proportional
+to the size of the file, since the file is smaller than a block.
+For example, compressing a file 20,000 bytes long with the flag
+\-9
+will cause the compressor to allocate around
+6700k of memory, but only touch 400k + 20000 * 7 = 540
+kbytes of it.  Similarly, the decompressor will allocate 4600k but
+only touch 100k + 20000 * 5 = 200 kbytes.
+
+Here is a table which summarises the maximum memory usage for 
+different block sizes.  Also recorded is the total compressed
+size for 14 files of the Calgary Text Compression Corpus
+totalling 3,141,622 bytes.  This column gives some feel for how
+compression varies with block size.  These figures tend to understate
+the advantage of larger block sizes for larger files, since the
+Corpus is dominated by smaller files.
+
+           Compress   Decompress   Decompress   Corpus
+    Flag     usage      usage       -s usage     Size
+
+     -1      1100k       600k         350k      914704
+     -2      1800k      1100k         600k      877703
+     -3      2500k      1600k         850k      860338
+     -4      3200k      2100k        1100k      846899
+     -5      3900k      2600k        1350k      845160
+     -6      4600k      3100k        1600k      838626
+     -7      5400k      3600k        1850k      834096
+     -8      6000k      4100k        2100k      828642
+     -9      6700k      4600k        2350k      828642
+
+.SH OPTIONS
+.TP
+.B \-c  --stdout
+Compress or decompress to standard output.  \-c will decompress
+multiple files to stdout, but will only compress a single file to
+stdout.
+.TP
+.B \-d --decompress
+Force decompression.
+.I Bzip2
+and
+.I bunzip2
+are really the same program, and the decision about whether to
+compress or decompress is done on the basis of which name is
+used.  This flag overrides that mechanism, and forces
+.I bzip2
+to decompress.
+.TP 
+.B \-f --compress
+The complement to \-d: forces compression, regardless of the invokation
+name.
+.TP
+.B \-t --test
+Check integrity of the specified file(s), but don't decompress them.
+This really performs a trial decompression and throws away the result,
+using the low-memory decompression algorithm (see \-s).
+.TP
+.B \-k --keep
+Keep (don't delete) input files during compression or decompression.
+.TP
+.B \-s --small
+Reduce memory usage, both for compression and decompression.
+Files are decompressed using a modified algorithm which only
+requires 2.5 bytes per block byte.  This means any file can be
+decompressed in 2300k of memory, albeit somewhat more slowly than
+usual.
+
+During compression, -s selects a block size of 200k, which limits
+memory use to around the same figure, at the expense of your
+compression ratio.  In short, if your machine is low on memory
+(8 megabytes or less), use -s for everything.  See
+MEMORY MANAGEMENT above.
+
+.TP
+.B \-v --verbose
+Verbose mode -- show the compression ratio for each file processed.
+Further \-v's increase the verbosity level, spewing out lots of
+information which is primarily of interest for diagnostic purposes.
+.TP
+.B \-L --license
+Display the software version, license terms and conditions.
+.TP
+.B \-V --version
+Same as \-L.
+.TP
+.B \-1 to \-9 
+Set the block size to 100 k, 200 k .. 900 k when
+compressing.  Has no effect when decompressing.
+See MEMORY MANAGEMENT above.
+.TP
+.B \--repetitive-fast
+.I bzip2
+injects some small pseudo-random variations
+into very repetitive blocks to limit
+worst-case performance during compression.
+If sorting runs into difficulties, the block
+is randomised, and sorting is restarted.  
+Very roughly, 
+.I bzip2
+persists for three times as long as a well-behaved input
+would take before resorting to randomisation.
+This flag makes it give up much sooner.
+
+.TP
+.B \--repetitive-best
+Opposite of \--repetitive-fast; try a lot harder before 
+resorting to randomisation.
+
+.SH RECOVERING DATA FROM DAMAGED FILES
+.I bzip2
+compresses files in blocks, usually 900kbytes long.
+Each block is handled independently.  If a media or
+transmission error causes a multi-block .bz2 
+file to become damaged,
+it may be possible to recover data from the undamaged blocks
+in the file.  
+
+The compressed representation of each block is delimited by
+a 48-bit pattern, which makes it possible to find the block
+boundaries with reasonable certainty.  Each block also carries
+its own 32-bit CRC, so damaged blocks can be
+distinguished from undamaged ones.
+
+.I bzip2recover
+is a simple program whose purpose is to search for 
+blocks in .bz2 files, and write each block out into
+its own .bz2 file.  You can then use
+.I bzip2 -t
+to test the integrity of the resulting files, 
+and decompress those which are undamaged.
+
+.I bzip2recover
+takes a single argument, the name of the damaged file,
+and writes a number of files "rec0001file.bz2", "rec0002file.bz2",
+etc, containing the extracted blocks.  The output filenames
+are designed so that the use of wildcards in subsequent processing
+-- for example, "bzip2 -dc rec*file.bz2 > recovered_data" --
+lists the files in the "right" order.
+
+.I bzip2recover
+should be of most use dealing with large .bz2 files, as
+these will contain many blocks.  It is clearly futile to
+use it on damaged single-block files, since a damaged
+block cannot be recovered.  If you wish to minimise 
+any potential data loss through media or transmission
+errors, you might consider compressing with a smaller
+block size.
+
+.SH PERFORMANCE NOTES
+The sorting phase of compression gathers together similar strings
+in the file.  Because of this, files containing very long 
+runs of repeated symbols, like "aabaabaabaab ..." (repeated
+several hundred times) may compress extraordinarily slowly.
+You can use the
+\-vvvvv 
+option to monitor progress in great detail, if you want.
+Decompression speed is unaffected.
+
+Such pathological cases
+seem rare in practice, appearing mostly in artificially-constructed
+test files, and in low-level disk images.  It may be inadvisable to
+use 
+.I bzip2
+to compress the latter.  
+If you do get a file which causes severe slowness in compression,
+try making the block size as small as possible, with flag \-1.
+
+Incompressible or virtually-incompressible data may decompress
+rather more slowly than one would hope.  This is due to 
+a naive implementation of the move-to-front coder.
+
+.I bzip2
+usually allocates several megabytes of memory to operate in,
+and then charges all over it in a fairly random fashion.  This
+means that performance, both for compressing and decompressing,
+is largely determined by the speed
+at which your machine can service cache misses.  
+Because of this, small changes
+to the code to reduce the miss rate have been observed to give
+disproportionately large performance improvements.
+I imagine 
+.I bzip2
+will perform best on machines with very large caches.
+
+Test mode (\-t) uses the low-memory decompression algorithm
+(\-s).  This means test mode does not run as fast as it could;
+it could run as fast as the normal decompression machinery.
+This could easily be fixed at the cost of some code bloat.
+
+.SH CAVEATS
+I/O error messages are not as helpful as they could be.
+.I Bzip2
+tries hard to detect I/O errors and exit cleanly, but the
+details of what the problem is sometimes seem rather misleading.
+
+This manual page pertains to version 0.1 of 
+.I bzip2.  
+It may well happen that some future version will
+use a different compressed file format.  If you try to 
+decompress, using 0.1, a .bz2 file created with some
+future version which uses a different compressed file format,
+0.1 will complain that your file "is not a bzip2 file".
+If that happens, you should obtain a more recent version
+of 
+.I bzip2
+and use that to decompress the file.
+
+Wildcard expansion for Windows 95 and NT 
+is flaky.
+
+.I bzip2recover
+uses 32-bit integers to represent bit positions in
+compressed files, so it cannot handle compressed files
+more than 512 megabytes long.  This could easily be fixed.
+
+.I bzip2recover
+sometimes reports a very small, incomplete final block.
+This is spurious and can be safely ignored.
+
+.SH RELATIONSHIP TO bzip-0.21
+This program is a descendant of the 
+.I bzip
+program, version 0.21, which I released in August 1996.  
+The primary difference of
+.I bzip2
+is its avoidance of the possibly patented algorithms
+which were used in 0.21.  
+.I bzip2
+also brings various useful refinements (\-s, \-t),
+uses less memory, decompresses significantly faster, and
+has support for recovering data from damaged files.
+
+Because
+.I bzip2
+uses Huffman coding to construct the compressed bitstream,
+rather than the arithmetic coding used in 0.21,
+the compressed representations generated by the two programs
+are incompatible, and they will not interoperate.  The change
+in suffix from .bz to .bz2 reflects this.  It would have been
+helpful to at least allow
+.I bzip2
+to decompress files created by 0.21, but this would
+defeat the primary aim of having a patent-free compressor.
+
+Huffman coding necessarily involves some coding inefficiency
+compared to arithmetic coding.  This means that
+.I bzip2
+compresses about 1% worse than 0.21, an unfortunate but
+unavoidable fact-of-life.  On the other hand, decompression
+is approximately 50% faster for the same reason, and the
+change in file format gave an opportunity to add data-recovery
+features.  So it is not all bad.
+
+.SH AUTHOR
+Julian Seward, jseward@acm.org.
+
+The ideas embodied in 
+.I bzip
+and
+.I bzip2
+are due to (at least) the following people:
+Michael Burrows and David Wheeler (for the block sorting
+transformation), David Wheeler (again, for the Huffman coder),
+Peter Fenwick (for the structured coding model in 0.21, 
+and many refinements),
+and
+Alistair Moffat, Radford Neal and Ian Witten (for the arithmetic
+coder in 0.21).  I am much indebted for their help, support and advice.
+See the file ALGORITHMS in the source distribution for pointers to
+sources of documentation.
+Christian von Roques encouraged me to look for faster
+sorting algorithms, so as to speed up compression.
+Bela Lubkin encouraged me to improve the worst-case
+compression performance.
+Many people sent patches, helped with portability problems,
+lent machines, gave advice and were generally helpful.
+
--- a/bzip2.1.preformatted
+++ b/bzip2.1.preformatted
@ -0,0 +1,462 @@
+
+
+
+bzip2(1)                                                 bzip2(1)
+
+
+NNAAMMEE
+       bzip2, bunzip2 - a block-sorting file compressor, v0.1
+       bzip2recover - recovers data from damaged bzip2 files
+
+
+SSYYNNOOPPSSIISS
+       bbzziipp22 [ --ccddffkkssttvvVVLL112233445566778899 ] [ _f_i_l_e_n_a_m_e_s _._._.  ]
+       bbuunnzziipp22 [ --kkvvssVVLL ] [ _f_i_l_e_n_a_m_e_s _._._.  ]
+       bbzziipp22rreeccoovveerr _f_i_l_e_n_a_m_e
+
+
+DDEESSCCRRIIPPTTIIOONN
+       _B_z_i_p_2  compresses  files  using the Burrows-Wheeler block-
+       sorting text compression algorithm,  and  Huffman  coding.
+       Compression  is  generally  considerably  better than that
+       achieved by more conventional LZ77/LZ78-based compressors,
+       and  approaches  the performance of the PPM family of sta-
+       tistical compressors.
+
+       The command-line options are deliberately very similar  to
+       those of _G_N_U _G_z_i_p_, but they are not identical.
+
+       _B_z_i_p_2  expects  a list of file names to accompany the com-
+       mand-line flags.  Each file is replaced  by  a  compressed
+       version  of  itself,  with  the  name "original_name.bz2".
+       Each compressed file has the same  modification  date  and
+       permissions  as  the corresponding original, so that these
+       properties can  be  correctly  restored  at  decompression
+       time.  File name handling is naive in the sense that there
+       is no mechanism for preserving original file  names,  per-
+       missions  and  dates  in filesystems which lack these con-
+       cepts, or have serious file name length restrictions, such
+       as MS-DOS.
+
+       _B_z_i_p_2  and  _b_u_n_z_i_p_2  will not overwrite existing files; if
+       you want this to happen, you should delete them first.
+
+       If no file names  are  specified,  _b_z_i_p_2  compresses  from
+       standard  input  to  standard output.  In this case, _b_z_i_p_2
+       will decline to write compressed output to a terminal,  as
+       this  would  be  entirely  incomprehensible  and therefore
+       pointless.
+
+       _B_u_n_z_i_p_2 (or _b_z_i_p_2 _-_d ) decompresses and restores all spec-
+       ified files whose names end in ".bz2".  Files without this
+       suffix are ignored.  Again, supplying no filenames  causes
+       decompression from standard input to standard output.
+
+       You  can also compress or decompress files to the standard
+       output by giving the -c flag.  You can decompress multiple
+       files  like  this, but you may only compress a single file
+       this way, since it would otherwise be difficult  to  sepa-
+       rate  out  the  compressed representations of the original
+       files.
+
+
+
+                                                                1
+
+
+
+
+
+bzip2(1)                                                 bzip2(1)
+
+
+       Compression is always performed, even  if  the  compressed
+       file  is slightly larger than the original.  Files of less
+       than about one hundred bytes tend to get larger, since the
+       compression  mechanism  has  a  constant  overhead  in the
+       region of 50 bytes.  Random data (including the output  of
+       most  file  compressors)  is  coded at about 8.05 bits per
+       byte, giving an expansion of around 0.5%.
+
+       As a self-check for your  protection,  _b_z_i_p_2  uses  32-bit
+       CRCs  to make sure that the decompressed version of a file
+       is identical to the original.  This guards against corrup-
+       tion  of  the compressed data, and against undetected bugs
+       in _b_z_i_p_2 (hopefully very unlikely).  The chances  of  data
+       corruption  going  undetected  is  microscopic,  about one
+       chance in four billion for each file processed.  Be aware,
+       though,  that  the  check occurs upon decompression, so it
+       can only tell you that that something is wrong.  It  can't
+       help  you recover the original uncompressed data.  You can
+       use _b_z_i_p_2_r_e_c_o_v_e_r to  try  to  recover  data  from  damaged
+       files.
+
+       Return  values:  0  for a normal exit, 1 for environmental
+       problems (file not found, invalid flags, I/O errors,  &c),
+       2 to indicate a corrupt compressed file, 3 for an internal
+       consistency error (eg, bug) which caused _b_z_i_p_2 to panic.
+
+
+MMEEMMOORRYY MMAANNAAGGEEMMEENNTT
+       _B_z_i_p_2 compresses large files in blocks.   The  block  size
+       affects  both  the  compression  ratio  achieved,  and the
+       amount of memory needed both for  compression  and  decom-
+       pression.   The flags -1 through -9 specify the block size
+       to be 100,000 bytes through 900,000  bytes  (the  default)
+       respectively.   At decompression-time, the block size used
+       for compression is read from the header of the  compressed
+       file, and _b_u_n_z_i_p_2 then allocates itself just enough memory
+       to decompress the file.  Since block sizes are  stored  in
+       compressed  files,  it follows that the flags -1 to -9 are
+       irrelevant to and so ignored during  decompression.   Com-
+       pression  and decompression requirements, in bytes, can be
+       estimated as:
+
+             Compression:   400k + ( 7 x block size )
+
+             Decompression: 100k + ( 5 x block size ), or
+                            100k + ( 2.5 x block size )
+
+       Larger  block  sizes  give  rapidly  diminishing  marginal
+       returns;  most of the compression comes from the first two
+       or three hundred k of block size, a fact worth bearing  in
+       mind  when  using  _b_z_i_p_2  on  small  machines.  It is also
+       important to  appreciate  that  the  decompression  memory
+       requirement  is  set  at compression-time by the choice of
+       block size.
+
+
+
+                                                                2
+
+
+
+
+
+bzip2(1)                                                 bzip2(1)
+
+
+       For files compressed with the  default  900k  block  size,
+       _b_u_n_z_i_p_2  will require about 4600 kbytes to decompress.  To
+       support decompression of any file on a 4 megabyte machine,
+       _b_u_n_z_i_p_2  has  an  option to decompress using approximately
+       half this amount of memory, about 2300 kbytes.  Decompres-
+       sion  speed  is also halved, so you should use this option
+       only where necessary.  The relevant flag is -s.
+
+       In general, try and use the largest block size memory con-
+       straints  allow,  since  that  maximises  the  compression
+       achieved.  Compression and decompression speed are  virtu-
+       ally unaffected by block size.
+
+       Another  significant point applies to files which fit in a
+       single block -- that  means  most  files  you'd  encounter
+       using  a  large  block  size.   The  amount of real memory
+       touched is proportional to the size of the file, since the
+       file  is smaller than a block.  For example, compressing a
+       file 20,000 bytes long with the flag  -9  will  cause  the
+       compressor  to  allocate  around 6700k of memory, but only
+       touch 400k + 20000 * 7 = 540 kbytes of it.  Similarly, the
+       decompressor  will  allocate  4600k  but only touch 100k +
+       20000 * 5 = 200 kbytes.
+
+       Here is a table which summarises the maximum memory  usage
+       for  different  block  sizes.   Also recorded is the total
+       compressed size for 14 files of the Calgary Text  Compres-
+       sion  Corpus totalling 3,141,622 bytes.  This column gives
+       some feel for how  compression  varies  with  block  size.
+       These  figures  tend to understate the advantage of larger
+       block sizes for larger files, since the  Corpus  is  domi-
+       nated by smaller files.
+
+                  Compress   Decompress   Decompress   Corpus
+           Flag     usage      usage       -s usage     Size
+
+            -1      1100k       600k         350k      914704
+            -2      1800k      1100k         600k      877703
+            -3      2500k      1600k         850k      860338
+            -4      3200k      2100k        1100k      846899
+            -5      3900k      2600k        1350k      845160
+            -6      4600k      3100k        1600k      838626
+            -7      5400k      3600k        1850k      834096
+            -8      6000k      4100k        2100k      828642
+            -9      6700k      4600k        2350k      828642
+
+
+OOPPTTIIOONNSS
+       --cc ----ssttddoouutt
+              Compress or decompress to standard output.  -c will
+              decompress multiple files to stdout, but will  only
+              compress a single file to stdout.
+
+
+
+
+
+                                                                3
+
+
+
+
+
+bzip2(1)                                                 bzip2(1)
+
+
+       --dd ----ddeeccoommpprreessss
+              Force  decompression.  _B_z_i_p_2 and _b_u_n_z_i_p_2 are really
+              the same program, and the decision about whether to
+              compress  or  decompress  is  done  on the basis of
+              which name is used.  This flag overrides that mech-
+              anism, and forces _b_z_i_p_2 to decompress.
+
+       --ff ----ccoommpprreessss
+              The  complement  to -d: forces compression, regard-
+              less of the invokation name.
+
+       --tt ----tteesstt
+              Check integrity of the specified file(s), but don't
+              decompress  them.   This  really  performs  a trial
+              decompression and throws away the result, using the
+              low-memory decompression algorithm (see -s).
+
+       --kk ----kkeeeepp
+              Keep  (don't delete) input files during compression
+              or decompression.
+
+       --ss ----ssmmaallll
+              Reduce  memory  usage,  both  for  compression  and
+              decompression.  Files are decompressed using a mod-
+              ified algorithm which only requires 2.5  bytes  per
+              block  byte.   This  means  any  file can be decom-
+              pressed in 2300k of memory,  albeit  somewhat  more
+              slowly than usual.
+
+              During  compression,  -s  selects  a  block size of
+              200k, which limits memory use to  around  the  same
+              figure,  at  the expense of your compression ratio.
+              In short, if your  machine  is  low  on  memory  (8
+              megabytes  or  less),  use  -s for everything.  See
+              MEMORY MANAGEMENT above.
+
+
+       --vv ----vveerrbboossee
+              Verbose mode -- show the compression ratio for each
+              file  processed.   Further  -v's  increase the ver-
+              bosity level, spewing out lots of information which
+              is primarily of interest for diagnostic purposes.
+
+       --LL ----lliicceennssee
+              Display  the  software  version,  license terms and
+              conditions.
+
+       --VV ----vveerrssiioonn
+              Same as -L.
+
+       --11 ttoo --99
+              Set the block size to 100 k, 200 k ..  900  k  when
+              compressing.   Has  no  effect  when decompressing.
+              See MEMORY MANAGEMENT above.
+
+
+
+                                                                4
+
+
+
+
+
+bzip2(1)                                                 bzip2(1)
+
+
+       ----rreeppeettiittiivvee--ffaasstt
+              _b_z_i_p_2 injects some small  pseudo-random  variations
+              into  very  repetitive  blocks  to limit worst-case
+              performance during compression.   If  sorting  runs
+              into  difficulties,  the  block  is randomised, and
+              sorting is restarted.  Very roughly, _b_z_i_p_2 persists
+              for  three  times  as  long as a well-behaved input
+              would take before resorting to randomisation.  This
+              flag makes it give up much sooner.
+
+
+       ----rreeppeettiittiivvee--bbeesstt
+              Opposite  of  --repetitive-fast;  try  a lot harder
+              before resorting to randomisation.
+
+
+RREECCOOVVEERRIINNGG DDAATTAA FFRROOMM DDAAMMAAGGEEDD FFIILLEESS
+       _b_z_i_p_2 compresses files in blocks, usually 900kbytes  long.
+       Each block is handled independently.  If a media or trans-
+       mission error causes a multi-block  .bz2  file  to  become
+       damaged,  it  may  be  possible  to  recover data from the
+       undamaged blocks in the file.
+
+       The compressed representation of each block  is  delimited
+       by  a  48-bit pattern, which makes it possible to find the
+       block boundaries with reasonable  certainty.   Each  block
+       also  carries its own 32-bit CRC, so damaged blocks can be
+       distinguished from undamaged ones.
+
+       _b_z_i_p_2_r_e_c_o_v_e_r is a  simple  program  whose  purpose  is  to
+       search  for blocks in .bz2 files, and write each block out
+       into its own .bz2 file.  You can then use _b_z_i_p_2 _-_t to test
+       the integrity of the resulting files, and decompress those
+       which are undamaged.
+
+       _b_z_i_p_2_r_e_c_o_v_e_r takes a single argument, the name of the dam-
+       aged file, and writes a number of files "rec0001file.bz2",
+       "rec0002file.bz2", etc, containing the  extracted  blocks.
+       The output filenames are designed so that the use of wild-
+       cards in subsequent processing -- for example, "bzip2  -dc
+       rec*file.bz2  >  recovered_data" -- lists the files in the
+       "right" order.
+
+       _b_z_i_p_2_r_e_c_o_v_e_r should be of most use dealing with large .bz2
+       files,  as  these will contain many blocks.  It is clearly
+       futile to use it on damaged single-block  files,  since  a
+       damaged  block  cannot  be recovered.  If you wish to min-
+       imise any potential data loss through media  or  transmis-
+       sion errors, you might consider compressing with a smaller
+       block size.
+
+
+PPEERRFFOORRMMAANNCCEE NNOOTTEESS
+       The sorting phase of compression gathers together  similar
+
+
+
+                                                                5
+
+
+
+
+
+bzip2(1)                                                 bzip2(1)
+
+
+       strings  in  the  file.  Because of this, files containing
+       very long runs of  repeated  symbols,  like  "aabaabaabaab
+       ..."   (repeated   several  hundred  times)  may  compress
+       extraordinarily slowly.  You can use the -vvvvv option  to
+       monitor progress in great detail, if you want.  Decompres-
+       sion speed is unaffected.
+
+       Such pathological cases seem rare in  practice,  appearing
+       mostly in artificially-constructed test files, and in low-
+       level disk images.  It may be inadvisable to use _b_z_i_p_2  to
+       compress  the  latter.   If you do get a file which causes
+       severe slowness in compression, try making the block  size
+       as small as possible, with flag -1.
+
+       Incompressible or virtually-incompressible data may decom-
+       press rather more slowly than one would hope.  This is due
+       to a naive implementation of the move-to-front coder.
+
+       _b_z_i_p_2  usually  allocates  several  megabytes of memory to
+       operate in, and then charges all over it in a fairly  ran-
+       dom  fashion.   This means that performance, both for com-
+       pressing and decompressing, is largely determined  by  the
+       speed  at  which  your  machine  can service cache misses.
+       Because of this, small changes to the code to  reduce  the
+       miss  rate  have  been observed to give disproportionately
+       large performance improvements.  I imagine _b_z_i_p_2 will per-
+       form best on machines with very large caches.
+
+       Test mode (-t) uses the low-memory decompression algorithm
+       (-s).  This means test mode does not run  as  fast  as  it
+       could;  it  could  run as fast as the normal decompression
+       machinery.  This could easily be fixed at the cost of some
+       code bloat.
+
+
+CCAAVVEEAATTSS
+       I/O  error  messages  are not as helpful as they could be.
+       _B_z_i_p_2 tries hard to detect I/O errors  and  exit  cleanly,
+       but  the  details  of  what  the problem is sometimes seem
+       rather misleading.
+
+       This manual page pertains to version 0.1 of _b_z_i_p_2_.  It may
+       well  happen that some future version will use a different
+       compressed file format.  If you try to  decompress,  using
+       0.1,  a  .bz2  file created with some future version which
+       uses a different compressed file format, 0.1 will complain
+       that  your  file  "is not a bzip2 file".  If that happens,
+       you should obtain a more recent version of _b_z_i_p_2  and  use
+       that to decompress the file.
+
+       Wildcard expansion for Windows 95 and NT is flaky.
+
+       _b_z_i_p_2_r_e_c_o_v_e_r  uses  32-bit integers to represent bit posi-
+       tions in compressed files, so it cannot handle  compressed
+
+
+
+                                                                6
+
+
+
+
+
+bzip2(1)                                                 bzip2(1)
+
+
+       files  more than 512 megabytes long.  This could easily be
+       fixed.
+
+       _b_z_i_p_2_r_e_c_o_v_e_r sometimes reports a  very  small,  incomplete
+       final  block.  This is spurious and can be safely ignored.
+
+
+RREELLAATTIIOONNSSHHIIPP TTOO bbzziipp--00..2211
+       This program is a descendant of the _b_z_i_p program,  version
+       0.21,  which  I released in August 1996.  The primary dif-
+       ference of _b_z_i_p_2 is its avoidance of the possibly patented
+       algorithms  which  were  used  in 0.21.  _b_z_i_p_2 also brings
+       various useful refinements (-s,  -t),  uses  less  memory,
+       decompresses  significantly  faster,  and  has support for
+       recovering data from damaged files.
+
+       Because _b_z_i_p_2 uses Huffman coding to  construct  the  com-
+       pressed  bitstream, rather than the arithmetic coding used
+       in 0.21, the compressed representations generated  by  the
+       two  programs are incompatible, and they will not interop-
+       erate.  The change in suffix from  .bz  to  .bz2  reflects
+       this.   It would have been helpful to at least allow _b_z_i_p_2
+       to decompress files created by 0.21, but this would defeat
+       the primary aim of having a patent-free compressor.
+
+       Huffman  coding  necessarily  involves some coding ineffi-
+       ciency compared to arithmetic  coding.   This  means  that
+       _b_z_i_p_2  compresses about 1% worse than 0.21, an unfortunate
+       but unavoidable fact-of-life.  On the other  hand,  decom-
+       pression  is approximately 50% faster for the same reason,
+       and the change in file format gave an opportunity  to  add
+       data-recovery features.  So it is not all bad.
+
+
+AAUUTTHHOORR
+       Julian Seward, jseward@acm.org.
+
+       The ideas embodied in _b_z_i_p and _b_z_i_p_2 are due to (at least)
+       the following people: Michael Burrows  and  David  Wheeler
+       (for  the  block  sorting  transformation),  David Wheeler
+       (again, for the Huffman coder),  Peter  Fenwick  (for  the
+       structured  coding  model  in 0.21, and many refinements),
+       and Alistair Moffat, Radford Neal and Ian Witten (for  the
+       arithmetic  coder  in 0.21).  I am much indebted for their
+       help, support and advice.  See the file ALGORITHMS in  the
+       source  distribution for pointers to sources of documenta-
+       tion.  Christian von Roques  encouraged  me  to  look  for
+       faster  sorting algorithms, so as to speed up compression.
+       Bela Lubkin encouraged me to improve the  worst-case  com-
+       pression  performance.   Many  people sent patches, helped
+       with portability problems, lent machines, gave advice  and
+       were generally helpful.
+
+
+
+
+
+                                                                7
+
+
--- a/bzip2.c
+++ b/bzip2.c
--- a/bzip2.exe
+++ b/bzip2.exe
--- a/bzip2.txt
+++ b/bzip2.txt
@ -0,0 +1,462 @@
+
+
+
+bzip2(1)                                                 bzip2(1)
+
+
+NAME
+       bzip2, bunzip2 - a block-sorting file compressor, v0.1
+       bzip2recover - recovers data from damaged bzip2 files
+
+
+SYNOPSIS
+       bzip2 [ -cdfkstvVL123456789 ] [ filenames ...  ]
+       bunzip2 [ -kvsVL ] [ filenames ...  ]
+       bzip2recover filename
+
+
+DESCRIPTION
+       Bzip2  compresses  files  using the Burrows-Wheeler block-
+       sorting text compression algorithm,  and  Huffman  coding.
+       Compression  is  generally  considerably  better than that
+       achieved by more conventional LZ77/LZ78-based compressors,
+       and  approaches  the performance of the PPM family of sta-
+       tistical compressors.
+
+       The command-line options are deliberately very similar  to
+       those of GNU Gzip, but they are not identical.
+
+       Bzip2  expects  a list of file names to accompany the com-
+       mand-line flags.  Each file is replaced  by  a  compressed
+       version  of  itself,  with  the  name "original_name.bz2".
+       Each compressed file has the same  modification  date  and
+       permissions  as  the corresponding original, so that these
+       properties can  be  correctly  restored  at  decompression
+       time.  File name handling is naive in the sense that there
+       is no mechanism for preserving original file  names,  per-
+       missions  and  dates  in filesystems which lack these con-
+       cepts, or have serious file name length restrictions, such
+       as MS-DOS.
+
+       Bzip2  and  bunzip2  will not overwrite existing files; if
+       you want this to happen, you should delete them first.
+
+       If no file names  are  specified,  bzip2  compresses  from
+       standard  input  to  standard output.  In this case, bzip2
+       will decline to write compressed output to a terminal,  as
+       this  would  be  entirely  incomprehensible  and therefore
+       pointless.
+
+       Bunzip2 (or bzip2 -d ) decompresses and restores all spec-
+       ified files whose names end in ".bz2".  Files without this
+       suffix are ignored.  Again, supplying no filenames  causes
+       decompression from standard input to standard output.
+
+       You  can also compress or decompress files to the standard
+       output by giving the -c flag.  You can decompress multiple
+       files  like  this, but you may only compress a single file
+       this way, since it would otherwise be difficult  to  sepa-
+       rate  out  the  compressed representations of the original
+       files.
+
+
+
+                                                                1
+
+
+
+
+
+bzip2(1)                                                 bzip2(1)
+
+
+       Compression is always performed, even  if  the  compressed
+       file  is slightly larger than the original.  Files of less
+       than about one hundred bytes tend to get larger, since the
+       compression  mechanism  has  a  constant  overhead  in the
+       region of 50 bytes.  Random data (including the output  of
+       most  file  compressors)  is  coded at about 8.05 bits per
+       byte, giving an expansion of around 0.5%.
+
+       As a self-check for your  protection,  bzip2  uses  32-bit
+       CRCs  to make sure that the decompressed version of a file
+       is identical to the original.  This guards against corrup-
+       tion  of  the compressed data, and against undetected bugs
+       in bzip2 (hopefully very unlikely).  The chances  of  data
+       corruption  going  undetected  is  microscopic,  about one
+       chance in four billion for each file processed.  Be aware,
+       though,  that  the  check occurs upon decompression, so it
+       can only tell you that that something is wrong.  It  can't
+       help  you recover the original uncompressed data.  You can
+       use bzip2recover to  try  to  recover  data  from  damaged
+       files.
+
+       Return  values:  0  for a normal exit, 1 for environmental
+       problems (file not found, invalid flags, I/O errors,  &c),
+       2 to indicate a corrupt compressed file, 3 for an internal
+       consistency error (eg, bug) which caused bzip2 to panic.
+
+
+MEMORY MANAGEMENT
+       Bzip2 compresses large files in blocks.   The  block  size
+       affects  both  the  compression  ratio  achieved,  and the
+       amount of memory needed both for  compression  and  decom-
+       pression.   The flags -1 through -9 specify the block size
+       to be 100,000 bytes through 900,000  bytes  (the  default)
+       respectively.   At decompression-time, the block size used
+       for compression is read from the header of the  compressed
+       file, and bunzip2 then allocates itself just enough memory
+       to decompress the file.  Since block sizes are  stored  in
+       compressed  files,  it follows that the flags -1 to -9 are
+       irrelevant to and so ignored during  decompression.   Com-
+       pression  and decompression requirements, in bytes, can be
+       estimated as:
+
+             Compression:   400k + ( 7 x block size )
+
+             Decompression: 100k + ( 5 x block size ), or
+                            100k + ( 2.5 x block size )
+
+       Larger  block  sizes  give  rapidly  diminishing  marginal
+       returns;  most of the compression comes from the first two
+       or three hundred k of block size, a fact worth bearing  in
+       mind  when  using  bzip2  on  small  machines.  It is also
+       important to  appreciate  that  the  decompression  memory
+       requirement  is  set  at compression-time by the choice of
+       block size.
+
+
+
+                                                                2
+
+
+
+
+
+bzip2(1)                                                 bzip2(1)
+
+
+       For files compressed with the  default  900k  block  size,
+       bunzip2  will require about 4600 kbytes to decompress.  To
+       support decompression of any file on a 4 megabyte machine,
+       bunzip2  has  an  option to decompress using approximately
+       half this amount of memory, about 2300 kbytes.  Decompres-
+       sion  speed  is also halved, so you should use this option
+       only where necessary.  The relevant flag is -s.
+
+       In general, try and use the largest block size memory con-
+       straints  allow,  since  that  maximises  the  compression
+       achieved.  Compression and decompression speed are  virtu-
+       ally unaffected by block size.
+
+       Another  significant point applies to files which fit in a
+       single block -- that  means  most  files  you'd  encounter
+       using  a  large  block  size.   The  amount of real memory
+       touched is proportional to the size of the file, since the
+       file  is smaller than a block.  For example, compressing a
+       file 20,000 bytes long with the flag  -9  will  cause  the
+       compressor  to  allocate  around 6700k of memory, but only
+       touch 400k + 20000 * 7 = 540 kbytes of it.  Similarly, the
+       decompressor  will  allocate  4600k  but only touch 100k +
+       20000 * 5 = 200 kbytes.
+
+       Here is a table which summarises the maximum memory  usage
+       for  different  block  sizes.   Also recorded is the total
+       compressed size for 14 files of the Calgary Text  Compres-
+       sion  Corpus totalling 3,141,622 bytes.  This column gives
+       some feel for how  compression  varies  with  block  size.
+       These  figures  tend to understate the advantage of larger
+       block sizes for larger files, since the  Corpus  is  domi-
+       nated by smaller files.
+
+                  Compress   Decompress   Decompress   Corpus
+           Flag     usage      usage       -s usage     Size
+
+            -1      1100k       600k         350k      914704
+            -2      1800k      1100k         600k      877703
+            -3      2500k      1600k         850k      860338
+            -4      3200k      2100k        1100k      846899
+            -5      3900k      2600k        1350k      845160
+            -6      4600k      3100k        1600k      838626
+            -7      5400k      3600k        1850k      834096
+            -8      6000k      4100k        2100k      828642
+            -9      6700k      4600k        2350k      828642
+
+
+OPTIONS
+       -c --stdout
+              Compress or decompress to standard output.  -c will
+              decompress multiple files to stdout, but will  only
+              compress a single file to stdout.
+
+
+
+
+
+                                                                3
+
+
+
+
+
+bzip2(1)                                                 bzip2(1)
+
+
+       -d --decompress
+              Force  decompression.  Bzip2 and bunzip2 are really
+              the same program, and the decision about whether to
+              compress  or  decompress  is  done  on the basis of
+              which name is used.  This flag overrides that mech-
+              anism, and forces bzip2 to decompress.
+
+       -f --compress
+              The  complement  to -d: forces compression, regard-
+              less of the invokation name.
+
+       -t --test
+              Check integrity of the specified file(s), but don't
+              decompress  them.   This  really  performs  a trial
+              decompression and throws away the result, using the
+              low-memory decompression algorithm (see -s).
+
+       -k --keep
+              Keep  (don't delete) input files during compression
+              or decompression.
+
+       -s --small
+              Reduce  memory  usage,  both  for  compression  and
+              decompression.  Files are decompressed using a mod-
+              ified algorithm which only requires 2.5  bytes  per
+              block  byte.   This  means  any  file can be decom-
+              pressed in 2300k of memory,  albeit  somewhat  more
+              slowly than usual.
+
+              During  compression,  -s  selects  a  block size of
+              200k, which limits memory use to  around  the  same
+              figure,  at  the expense of your compression ratio.
+              In short, if your  machine  is  low  on  memory  (8
+              megabytes  or  less),  use  -s for everything.  See
+              MEMORY MANAGEMENT above.
+
+
+       -v --verbose
+              Verbose mode -- show the compression ratio for each
+              file  processed.   Further  -v's  increase the ver-
+              bosity level, spewing out lots of information which
+              is primarily of interest for diagnostic purposes.
+
+       -L --license
+              Display  the  software  version,  license terms and
+              conditions.
+
+       -V --version
+              Same as -L.
+
+       -1 to -9
+              Set the block size to 100 k, 200 k ..  900  k  when
+              compressing.   Has  no  effect  when decompressing.
+              See MEMORY MANAGEMENT above.
+
+
+
+                                                                4
+
+
+
+
+
+bzip2(1)                                                 bzip2(1)
+
+
+       --repetitive-fast
+              bzip2 injects some small  pseudo-random  variations
+              into  very  repetitive  blocks  to limit worst-case
+              performance during compression.   If  sorting  runs
+              into  difficulties,  the  block  is randomised, and
+              sorting is restarted.  Very roughly, bzip2 persists
+              for  three  times  as  long as a well-behaved input
+              would take before resorting to randomisation.  This
+              flag makes it give up much sooner.
+
+
+       --repetitive-best
+              Opposite  of  --repetitive-fast;  try  a lot harder
+              before resorting to randomisation.
+
+
+RECOVERING DATA FROM DAMAGED FILES
+       bzip2 compresses files in blocks, usually 900kbytes  long.
+       Each block is handled independently.  If a media or trans-
+       mission error causes a multi-block  .bz2  file  to  become
+       damaged,  it  may  be  possible  to  recover data from the
+       undamaged blocks in the file.
+
+       The compressed representation of each block  is  delimited
+       by  a  48-bit pattern, which makes it possible to find the
+       block boundaries with reasonable  certainty.   Each  block
+       also  carries its own 32-bit CRC, so damaged blocks can be
+       distinguished from undamaged ones.
+
+       bzip2recover is a  simple  program  whose  purpose  is  to
+       search  for blocks in .bz2 files, and write each block out
+       into its own .bz2 file.  You can then use bzip2 -t to test
+       the integrity of the resulting files, and decompress those
+       which are undamaged.
+
+       bzip2recover takes a single argument, the name of the dam-
+       aged file, and writes a number of files "rec0001file.bz2",
+       "rec0002file.bz2", etc, containing the  extracted  blocks.
+       The output filenames are designed so that the use of wild-
+       cards in subsequent processing -- for example, "bzip2  -dc
+       rec*file.bz2  >  recovered_data" -- lists the files in the
+       "right" order.
+
+       bzip2recover should be of most use dealing with large .bz2
+       files,  as  these will contain many blocks.  It is clearly
+       futile to use it on damaged single-block  files,  since  a
+       damaged  block  cannot  be recovered.  If you wish to min-
+       imise any potential data loss through media  or  transmis-
+       sion errors, you might consider compressing with a smaller
+       block size.
+
+
+PERFORMANCE NOTES
+       The sorting phase of compression gathers together  similar
+
+
+
+                                                                5
+
+
+
+
+
+bzip2(1)                                                 bzip2(1)
+
+
+       strings  in  the  file.  Because of this, files containing
+       very long runs of  repeated  symbols,  like  "aabaabaabaab
+       ..."   (repeated   several  hundred  times)  may  compress
+       extraordinarily slowly.  You can use the -vvvvv option  to
+       monitor progress in great detail, if you want.  Decompres-
+       sion speed is unaffected.
+
+       Such pathological cases seem rare in  practice,  appearing
+       mostly in artificially-constructed test files, and in low-
+       level disk images.  It may be inadvisable to use bzip2  to
+       compress  the  latter.   If you do get a file which causes
+       severe slowness in compression, try making the block  size
+       as small as possible, with flag -1.
+
+       Incompressible or virtually-incompressible data may decom-
+       press rather more slowly than one would hope.  This is due
+       to a naive implementation of the move-to-front coder.
+
+       bzip2  usually  allocates  several  megabytes of memory to
+       operate in, and then charges all over it in a fairly  ran-
+       dom  fashion.   This means that performance, both for com-
+       pressing and decompressing, is largely determined  by  the
+       speed  at  which  your  machine  can service cache misses.
+       Because of this, small changes to the code to  reduce  the
+       miss  rate  have  been observed to give disproportionately
+       large performance improvements.  I imagine bzip2 will per-
+       form best on machines with very large caches.
+
+       Test mode (-t) uses the low-memory decompression algorithm
+       (-s).  This means test mode does not run  as  fast  as  it
+       could;  it  could  run as fast as the normal decompression
+       machinery.  This could easily be fixed at the cost of some
+       code bloat.
+
+
+CAVEATS
+       I/O  error  messages  are not as helpful as they could be.
+       Bzip2 tries hard to detect I/O errors  and  exit  cleanly,
+       but  the  details  of  what  the problem is sometimes seem
+       rather misleading.
+
+       This manual page pertains to version 0.1 of bzip2.  It may
+       well  happen that some future version will use a different
+       compressed file format.  If you try to  decompress,  using
+       0.1,  a  .bz2  file created with some future version which
+       uses a different compressed file format, 0.1 will complain
+       that  your  file  "is not a bzip2 file".  If that happens,
+       you should obtain a more recent version of bzip2  and  use
+       that to decompress the file.
+
+       Wildcard expansion for Windows 95 and NT is flaky.
+
+       bzip2recover  uses  32-bit integers to represent bit posi-
+       tions in compressed files, so it cannot handle  compressed
+
+
+
+                                                                6
+
+
+
+
+
+bzip2(1)                                                 bzip2(1)
+
+
+       files  more than 512 megabytes long.  This could easily be
+       fixed.
+
+       bzip2recover sometimes reports a  very  small,  incomplete
+       final  block.  This is spurious and can be safely ignored.
+
+
+RELATIONSHIP TO bzip-0.21
+       This program is a descendant of the bzip program,  version
+       0.21,  which  I released in August 1996.  The primary dif-
+       ference of bzip2 is its avoidance of the possibly patented
+       algorithms  which  were  used  in 0.21.  bzip2 also brings
+       various useful refinements (-s,  -t),  uses  less  memory,
+       decompresses  significantly  faster,  and  has support for
+       recovering data from damaged files.
+
+       Because bzip2 uses Huffman coding to  construct  the  com-
+       pressed  bitstream, rather than the arithmetic coding used
+       in 0.21, the compressed representations generated  by  the
+       two  programs are incompatible, and they will not interop-
+       erate.  The change in suffix from  .bz  to  .bz2  reflects
+       this.   It would have been helpful to at least allow bzip2
+       to decompress files created by 0.21, but this would defeat
+       the primary aim of having a patent-free compressor.
+
+       Huffman  coding  necessarily  involves some coding ineffi-
+       ciency compared to arithmetic  coding.   This  means  that
+       bzip2  compresses about 1% worse than 0.21, an unfortunate
+       but unavoidable fact-of-life.  On the other  hand,  decom-
+       pression  is approximately 50% faster for the same reason,
+       and the change in file format gave an opportunity  to  add
+       data-recovery features.  So it is not all bad.
+
+
+AUTHOR
+       Julian Seward, jseward@acm.org.
+
+       The ideas embodied in bzip and bzip2 are due to (at least)
+       the following people: Michael Burrows  and  David  Wheeler
+       (for  the  block  sorting  transformation),  David Wheeler
+       (again, for the Huffman coder),  Peter  Fenwick  (for  the
+       structured  coding  model  in 0.21, and many refinements),
+       and Alistair Moffat, Radford Neal and Ian Witten (for  the
+       arithmetic  coder  in 0.21).  I am much indebted for their
+       help, support and advice.  See the file ALGORITHMS in  the
+       source  distribution for pointers to sources of documenta-
+       tion.  Christian von Roques  encouraged  me  to  look  for
+       faster  sorting algorithms, so as to speed up compression.
+       Bela Lubkin encouraged me to improve the  worst-case  com-
+       pression  performance.   Many  people sent patches, helped
+       with portability problems, lent machines, gave advice  and
+       were generally helpful.
+
+
+
+
+
+                                                                7
+
+
--- a/bzip2recover.c
+++ b/bzip2recover.c
@ -0,0 +1,399 @@
+
+/*-----------------------------------------------------------*/
+/*--- Block recoverer program for bzip2                   ---*/
+/*---                                      bzip2recover.c ---*/
+/*-----------------------------------------------------------*/
+
+/*--
+  This program is bzip2recover, a program to attempt data 
+  salvage from damaged files created by the accompanying
+  bzip2 program.
+
+  Copyright (C) 1996, 1997 by Julian Seward.
+     Guildford, Surrey, UK
+     email: jseward@acm.org
+
+  This program is free software; you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation; either version 2 of the License, or
+  (at your option) any later version.
+
+  This program is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with this program; if not, write to the Free Software
+  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+  The GNU General Public License is contained in the file LICENSE.
+--*/
+
+
+#include <stdio.h>
+#include <errno.h>
+#include <malloc.h>
+#include <stdlib.h>
+#include <strings.h>  /*-- or try string.h --*/
+
+#define UInt32  unsigned int
+#define Int32   int
+#define UChar   unsigned char
+#define Char    char
+#define Bool    unsigned char
+#define True    1
+#define False   0
+
+
+Char inFileName[2000];
+Char outFileName[2000];
+Char progName[2000];
+
+UInt32 bytesOut = 0;
+UInt32 bytesIn  = 0;
+
+
+/*---------------------------------------------------*/
+/*--- I/O errors                                  ---*/
+/*---------------------------------------------------*/
+
+/*---------------------------------------------*/
+void readError ( void )
+{
+   fprintf ( stderr,
+             "%s: I/O error reading `%s', possible reason follows.\n",
+            progName, inFileName );
+   perror ( progName );
+   fprintf ( stderr, "%s: warning: output file(s) may be incomplete.\n",
+             progName );
+   exit ( 1 );
+}
+
+
+/*---------------------------------------------*/
+void writeError ( void )
+{
+   fprintf ( stderr,
+             "%s: I/O error reading `%s', possible reason follows.\n",
+            progName, inFileName );
+   perror ( progName );
+   fprintf ( stderr, "%s: warning: output file(s) may be incomplete.\n",
+             progName );
+   exit ( 1 );
+}
+
+
+/*---------------------------------------------*/
+void mallocFail ( Int32 n )
+{
+   fprintf ( stderr,
+             "%s: malloc failed on request for %d bytes.\n",
+            progName, n );
+   fprintf ( stderr, "%s: warning: output file(s) may be incomplete.\n",
+             progName );
+   exit ( 1 );
+}
+
+
+/*---------------------------------------------------*/
+/*--- Bit stream I/O                              ---*/
+/*---------------------------------------------------*/
+
+typedef
+   struct {
+      FILE*  handle;
+      Int32  buffer;
+      Int32  buffLive;
+      Char   mode;
+   }
+   BitStream;
+
+
+/*---------------------------------------------*/
+BitStream* bsOpenReadStream ( FILE* stream )
+{
+   BitStream *bs = malloc ( sizeof(BitStream) );
+   if (bs == NULL) mallocFail ( sizeof(BitStream) );
+   bs->handle = stream;
+   bs->buffer = 0;
+   bs->buffLive = 0;
+   bs->mode = 'r';
+   return bs;
+}
+
+
+/*---------------------------------------------*/
+BitStream* bsOpenWriteStream ( FILE* stream )
+{
+   BitStream *bs = malloc ( sizeof(BitStream) );
+   if (bs == NULL) mallocFail ( sizeof(BitStream) );
+   bs->handle = stream;
+   bs->buffer = 0;
+   bs->buffLive = 0;
+   bs->mode = 'w';
+   return bs;
+}
+
+
+/*---------------------------------------------*/
+void bsPutBit ( BitStream* bs, Int32 bit )
+{
+   if (bs->buffLive == 8) {
+      Int32 retVal = putc ( (UChar) bs->buffer, bs->handle );
+      if (retVal == EOF) writeError();
+      bytesOut++;
+      bs->buffLive = 1;
+      bs->buffer = bit & 0x1;
+   } else {
+      bs->buffer = ( (bs->buffer << 1) | (bit & 0x1) );
+      bs->buffLive++;
+   };
+}
+
+
+/*---------------------------------------------*/
+/*--
+   Returns 0 or 1, or 2 to indicate EOF.
+--*/
+Int32 bsGetBit ( BitStream* bs )
+{
+   if (bs->buffLive > 0) {
+      bs->buffLive --;
+      return ( ((bs->buffer) >> (bs->buffLive)) & 0x1 );
+   } else {
+      Int32 retVal = getc ( bs->handle );
+      if ( retVal == EOF ) {
+         if (errno != 0) readError();
+         return 2;
+      }
+      bs->buffLive = 7;
+      bs->buffer = retVal;
+      return ( ((bs->buffer) >> 7) & 0x1 );
+   }
+}
+
+
+/*---------------------------------------------*/
+void bsClose ( BitStream* bs )
+{
+   Int32 retVal;
+
+   if ( bs->mode == 'w' ) {
+      while ( bs->buffLive < 8 ) {
+         bs->buffLive++;
+         bs->buffer <<= 1;
+      };
+      retVal = putc ( (UChar) (bs->buffer), bs->handle );
+      if (retVal == EOF) writeError();
+      bytesOut++;
+      retVal = fflush ( bs->handle );
+      if (retVal == EOF) writeError();
+   }
+   retVal = fclose ( bs->handle );
+   if (retVal == EOF)
+      if (bs->mode == 'w') writeError(); else readError();
+   free ( bs );
+}
+
+
+/*---------------------------------------------*/
+void bsPutUChar ( BitStream* bs, UChar c )
+{
+   Int32 i;
+   for (i = 7; i >= 0; i--)
+      bsPutBit ( bs, (((UInt32) c) >> i) & 0x1 );
+}
+
+
+/*---------------------------------------------*/
+void bsPutUInt32 ( BitStream* bs, UInt32 c )
+{
+   Int32 i;
+
+   for (i = 31; i >= 0; i--)
+      bsPutBit ( bs, (c >> i) & 0x1 );
+}
+
+
+/*---------------------------------------------*/
+Bool endsInBz2 ( Char* name )
+{
+   Int32 n = strlen ( name );
+   if (n <= 4) return False;
+   return
+      (name[n-4] == '.' &&
+       name[n-3] == 'b' &&
+       name[n-2] == 'z' &&
+       name[n-1] == '2');
+}
+
+
+/*---------------------------------------------------*/
+/*---                                             ---*/
+/*---------------------------------------------------*/
+
+#define BLOCK_HEADER_HI  0x00003141UL
+#define BLOCK_HEADER_LO  0x59265359UL
+
+#define BLOCK_ENDMARK_HI 0x00001772UL
+#define BLOCK_ENDMARK_LO 0x45385090UL
+
+Int32 main ( Int32 argc, Char** argv )
+{
+   FILE*       inFile;
+   FILE*       outFile;
+   BitStream*  bsIn, *bsWr;
+   Int32       currBlock, b, wrBlock;
+   UInt32      bitsRead;
+   UInt32      bStart[20000];
+   UInt32      bEnd[20000];
+   UInt32      buffHi, buffLo, blockCRC;
+   Char*       p;
+
+   strcpy ( progName, argv[0] );
+   inFileName[0] = outFileName[0] = 0;
+
+   fprintf ( stderr, "bzip2recover: extracts blocks from damaged .bz2 files.\n" );
+
+   if (argc != 2) {
+      fprintf ( stderr, "%s: usage is `%s damaged_file_name'.\n",
+                        progName, progName );
+      exit(1);
+   }
+
+   strcpy ( inFileName, argv[1] );
+
+   inFile = fopen ( inFileName, "rb" );
+   if (inFile == NULL) {
+      fprintf ( stderr, "%s: can't read `%s'\n", progName, inFileName );
+      exit(1);
+   }
+
+   bsIn = bsOpenReadStream ( inFile );
+   fprintf ( stderr, "%s: searching for block boundaries ...\n", progName );
+
+   bitsRead = 0;
+   buffHi = buffLo = 0;
+   currBlock = 0;
+   bStart[currBlock] = 0;
+
+   while (True) {
+      b = bsGetBit ( bsIn );
+      bitsRead++;
+      if (b == 2) {
+         if (bitsRead >= bStart[currBlock] &&
+            (bitsRead - bStart[currBlock]) >= 40) {
+            bEnd[currBlock] = bitsRead-1;
+            if (currBlock > 0)
+               fprintf ( stderr, "   block %d runs from %d to %d (incomplete)\n",
+                         currBlock,  bStart[currBlock], bEnd[currBlock] );
+         } else
+            currBlock--;
+         break;
+      }
+      buffHi = (buffHi << 1) | (buffLo >> 31);
+      buffLo = (buffLo << 1) | (b & 1);
+      if ( ( (buffHi & 0x0000ffff) == BLOCK_HEADER_HI 
+             && buffLo == BLOCK_HEADER_LO)
+           || 
+           ( (buffHi & 0x0000ffff) == BLOCK_ENDMARK_HI 
+             && buffLo == BLOCK_ENDMARK_LO)
+         ) {
+         if (bitsRead > 49)
+            bEnd[currBlock] = bitsRead-49; else
+            bEnd[currBlock] = 0;
+         if (currBlock > 0)
+            fprintf ( stderr, "   block %d runs from %d to %d\n",
+                      currBlock,  bStart[currBlock], bEnd[currBlock] );
+         currBlock++;
+         bStart[currBlock] = bitsRead;
+      }
+   }
+
+   bsClose ( bsIn );
+
+   /*-- identified blocks run from 1 to currBlock inclusive. --*/
+
+   if (currBlock < 1) {
+      fprintf ( stderr,
+                "%s: sorry, I couldn't find any block boundaries.\n",
+                progName );
+      exit(1);
+   };
+
+   fprintf ( stderr, "%s: splitting into blocks\n", progName );
+
+   inFile = fopen ( inFileName, "rb" );
+   if (inFile == NULL) {
+      fprintf ( stderr, "%s: can't open `%s'\n", progName, inFileName );
+      exit(1);
+   }
+   bsIn = bsOpenReadStream ( inFile );
+
+   /*-- placate gcc's dataflow analyser --*/
+   blockCRC = 0; bsWr = 0;
+
+   bitsRead = 0;
+   outFile = NULL;
+   wrBlock = 1;
+   while (True) {
+      b = bsGetBit(bsIn);
+      if (b == 2) break;
+      buffHi = (buffHi << 1) | (buffLo >> 31);
+      buffLo = (buffLo << 1) | (b & 1);
+      if (bitsRead == 47+bStart[wrBlock]) 
+         blockCRC = (buffHi << 16) | (buffLo >> 16);
+
+      if (outFile != NULL && bitsRead >= bStart[wrBlock]
+                          && bitsRead <= bEnd[wrBlock]) {
+         bsPutBit ( bsWr, b );
+      }
+
+      bitsRead++;
+
+      if (bitsRead == bEnd[wrBlock]+1) {
+         if (outFile != NULL) {
+            bsPutUChar ( bsWr, 0x17 ); bsPutUChar ( bsWr, 0x72 );
+            bsPutUChar ( bsWr, 0x45 ); bsPutUChar ( bsWr, 0x38 );
+            bsPutUChar ( bsWr, 0x50 ); bsPutUChar ( bsWr, 0x90 );
+            bsPutUInt32 ( bsWr, blockCRC );
+            bsClose ( bsWr );
+         }
+         if (wrBlock >= currBlock) break;
+         wrBlock++;
+      } else
+      if (bitsRead == bStart[wrBlock]) {
+         outFileName[0] = 0;
+         sprintf ( outFileName, "rec%4d", wrBlock );
+         for (p = outFileName; *p != 0; p++) if (*p == ' ') *p = '0';
+         strcat ( outFileName, inFileName );
+         if ( !endsInBz2(outFileName)) strcat ( outFileName, ".bz2" );
+
+         fprintf ( stderr, "   writing block %d to `%s' ...\n",
+                           wrBlock, outFileName );
+
+         outFile = fopen ( outFileName, "wb" );
+         if (outFile == NULL) {
+            fprintf ( stderr, "%s: can't write `%s'\n",
+                      progName, outFileName );
+            exit(1);
+         }
+         bsWr = bsOpenWriteStream ( outFile );
+         bsPutUChar ( bsWr, 'B' ); bsPutUChar ( bsWr, 'Z' );
+         bsPutUChar ( bsWr, 'h' ); bsPutUChar ( bsWr, '9' );
+         bsPutUChar ( bsWr, 0x31 ); bsPutUChar ( bsWr, 0x41 );
+         bsPutUChar ( bsWr, 0x59 ); bsPutUChar ( bsWr, 0x26 );
+         bsPutUChar ( bsWr, 0x53 ); bsPutUChar ( bsWr, 0x59 );
+      }
+   }
+
+   fprintf ( stderr, "%s: finished\n", progName );
+   return 0;
+}
+
+
+
+/*-----------------------------------------------------------*/
+/*--- end                                  bzip2recover.c ---*/
+/*-----------------------------------------------------------*/
--- a/sample1.bz2
+++ b/sample1.bz2
--- a/sample1.ref
+++ b/sample1.ref
--- a/sample2.bz2
+++ b/sample2.bz2
--- a/sample2.ref
+++ b/sample2.ref
--- a/test.bat
+++ b/test.bat
@ -0,0 +1,9 @@
+@rem
+@rem MSDOS test driver for bzip2
+@rem
+type words1
+.\bzip2 -1 < sample1.ref > sample1.rbz
+.\bzip2 -2 < sample2.ref > sample2.rbz
+.\bzip2 -dvv < sample1.bz2 > sample1.tst
+.\bzip2 -dvv < sample2.bz2 > sample2.tst
+type words3sh
--- a/test.cmd
+++ b/test.cmd
@ -0,0 +1,9 @@
+@rem
+@rem OS/2 test driver for bzip2
+@rem
+type words1
+.\bzip2 -1 < sample1.ref > sample1.rbz
+.\bzip2 -2 < sample2.ref > sample2.rbz
+.\bzip2 -dvv < sample1.bz2 > sample1.tst
+.\bzip2 -dvv < sample2.bz2 > sample2.tst
+type words3sh
--- a/7
+++ b/7
@ -0,0 +1,7 @@
+***-------------------------------------------------***
+***--------- IMPORTANT: READ WHAT FOLLOWS! ---------***
+***---------     viz: pay attention :-)    ---------***
+***-------------------------------------------------***
+
+Compiling bzip2 ...
+
--- a/5
+++ b/5
@ -0,0 +1,5 @@
+
+
+Doing 4 tests (2 compress, 2 uncompress) ...
+If there's a problem, things might stop at this point.
+ 
--- a/6
+++ b/6
@ -0,0 +1,6 @@
+
+
+Checking test results.  If any of the four "cmp"s which follow
+report any differences, something is wrong.  If you can't easily
+figure out what, please let me know (jseward@acm.org).
+
--- a/23
+++ b/23
@ -0,0 +1,23 @@
+
+
+If you got this far and the "cmp"s didn't find anything amiss, looks
+like you're in business.  You should install bzip2 and bunzip2:
+
+   copy bzip2 to a public place, maybe /usr/bin.
+   In that public place, make bunzip2 a symbolic link
+      to the bzip2 you just copied there.
+   Put the manual page, bzip2.1, somewhere appropriate;
+      perhaps in /usr/man/man1.
+
+Complete instructions for use are in the preformatted
+manual page, in the file bzip2.1.preformatted.
+
+You can also do "bzip2 --help" to see some helpful information. 
+
+"bzip2 -L" displays the software license.
+
+Please read the README file carefully.  
+Finally, note that bzip2 comes with ABSOLUTELY NO WARRANTY.
+
+Happy compressing!
+
--- a/12
+++ b/12
@ -0,0 +1,12 @@
+If you got this far and the "bzip2 -dvv"s give identical
+stored vs computed CRCs, you're probably in business.
+Complete instructions for use are in the preformatted manual page, 
+in the file bzip2.txt.
+
+You can also do "bzip2 --help" to see some helpful information. 
+"bzip2 -L" displays the software license.
+
+Please read the README file carefully.  
+Finally, note that bzip2 comes with ABSOLUTELY NO WARRANTY.
+
+Happy compressing!