ntfs-3g/doc/compression.txt


Description of the NTFS (de)compression algorithm (based on a modified LZ77
algorithm)

Copyright (c) 2001 Anton Altaparmakov

This document is published under the GNU General Public License.

Credits: This is based on notes taken from various places (most notably from
Regis Duchesne's NTFS documentation and from various LZ77 descriptions) and
further refined by looking at a few compressed streams to figure out some
uncertainties.

Note: You should also read the runlist description with regards to compression
in linux-ntfs/include/layout.h. Just search for "Attribute compression".
FIXME: Should merge the info from there into this document some time.

Compressed data is organized in logical "compression" blocks (cb). Each cb has
a size (cb_size) of 2^compression_unit clusters. In all versions of Windows,
NTFS (NT/2k/XP, NTFS 1.2-3.1), the only valid compression_unit is 4, IOW, each
cb is 2^4 = 16 clusters in size.

We detect and warn about a compression_unit != 4 but we try to decompress the
data anyway.

Compression is only supported for cluster sizes between 512 and 4096. Thus a
cb can be between 8 and 64kiB in size.

Each cb is independent of the other cbs and is thus the minimal unit we have
to parse even if we wanted to decompress only one byte.

Also, a cb can be totally uncompressed and this would be indicated as a sparse
cb in the runlist.

Thus, we need to look at the runlist of the compressed data stream, starting
at the beginning of the first cb overlapping @page. So we convert the page
offset into units of clusters (vcn), and round the vcn down to a mutliple of
cb_size clusters.

We then scan the runlist for the appropriate position. Based on what we find
there, we decide how to proceed.

If the cb is not compressed at all, and covers the whole of @page, we pretend
to be accessing an uncompressed file, so we fall back to what we do in
aops.c::ntfs_file_readpage(), i.e. we do:
	return block_read_full_page(page, ntfs_file_get_block);

If the cb is completely sparse, and covers the whole of @page, we can just
zero out @page and complete the io (set @page up-to-date, unlock it, and
finally return 0).

In all other cases we initiate the decompression engine, but first some more
on the compression algorithm.

Before compression the data of each cb is further divided into 4kiB blocks, we
call them "sub compression" blocks (sb), each including a header specifying
its compressed length. So we could just scan the cb for the first sb
overlapping @page and skip the sbs before that, or we could decompress the
whole cb injecting the superfluous decompressed pages into the page cache as a
form of read ahead (this is what zisofs does for example).

In either case, we then need to read and decompress all sbs overlapping @page,
potentially having to decompress one or more other cbs, too.

As soon as @page is completed we could either stop or continue until we finish
the current cb, injecting pages as we go along (again following the zisofs
example).

Because the sbs follow each other directly, we need to actually read in the
whole cb in order to be able to scan through the cb to find the first sb
overlapping @page, so it does make sense to follow the zisofs approach of
decompressing the whole cb and injecting pages as we go along. So all
discussion from now on will assume that we are going to do that. Although it
might make sense not to decompress any sbs locate before @page because this
would be a kind of "read-behind" which is probably silly, unless someone is
reading the file backwards. Performing read-ahead by decompressing all sbs
following @page OTOH, is very likely to be a good idea.

So, we read the whole cb from disk and start at the first sb.

As mentioned above, each sb is started with a header. The header is 16 bits of
which the lower twelve bits (i.e. bits 0 to 11) are the length (L) - 3 of the
sb (including the two bytes for the header itself, or L - 1 not counting the
two bytes for the header). The higher four bits are set to 1011 (0xb) by the
compressor for a compressed block, or to 0000 for an uncompressed block, but
the decompressor only checks the most significant bit taking a 1 to signify a
compressed block, and a 0 an uncompressed block.

So from the header we know how many compressed bytes we need to decompress to
obtain the next 4kiB of uncompressed data and if we didn't want to decompress
this sb we could just seek to the next next one using the length read from the
header. We could then continue seeking until we reach the first sb overlapping
@page.

In either case, we will reach a sb which we want to decompress.

Having dealt with the 16-bit header of the sb, we now have length bytes of
compressed data to decompress. This compressed stream is further split into
tokens which are organized into groups of eight tokens. Each token group (tg)
starts with a tag byte, which is an eight bit bitmap, the bits specifying the
type of each of the following eight tokens. The least significant bit (LSB)
corresponds to the first token and the most significant bit (MSB) corresponds
to the last token.

The two types of tokens are symbol tokens, specified by a zero bit, and phrase
tokens, specified by a set bit.

A symbol token (st) is a single byte and is to be taken literally and copied
into the sliding window (the decompressed data).

A phrase token (pt) is a pointer back into the sliding window (in bytes),
together with a length (again in bytes), starting at the byte the back pointer
is pointing to. Thus a phrase token defines a sequence of bytes in the sliding
window which need to be copied at the current position into the sliding window
(the decompressed data stream).

Each pt consists of 2 bytes split into the back pointer (p) and the length (l),
each of variable bit width (but the sum of the widths of p and l is fixed at
16 bits). p is at least 4 bits and l is at most 12 bits.

The most significant bits contain the back pointer (p), while the least
significant bits contain the length (l).

l is actually stored as the number of bytes minus 3 (unsigned) as anything
shorter than that would be at least as long as the 2 bytes needed for the
actual pt, so no compression would be achieved.

p is stored as the positive number of bytes minus 1 (unsigned) as going zero
bytes back is meaningless.

Note that decompression has to occur byte by byte, as it is possible that some
of the bytes pointed to by the pt will only be generated in the sliding window
as the byte sequence pointed to by the pt is being copied into it!

To give a concrete example: a block full of the letter A would be compressed
by storing the byte A once as a symbol token, followed by a single phrase
token with back pointer -1 (p = 0, therefore go back by -(0 + 1) bytes) and
length 4095 (l=0xffc, therefore length 0xffc + 3 bytes).

The widths of p and l are determined from the current position within the
decompressed data (cur_pos). We don't actually care about the widths as such
however, but instead we want the mask (l_mask) with which to AND the pt to
obtain l, and the number of bits (p_shift) by which to right shift the pt to
obtain p. These are determined using the following algorithm:

for (i = cur_pos, l_mask = 0xfff, p_shift = 12; i >= 0x10; i >>= 1) {
	l_mask >>= 1;
	p_shift--;
}

Note, that as usual in NTFS, the sb header, as well as each pt, are stored in
little endian format.