lz4/doc/lz4_Block_format.md
2022-07-31 17:47:00 +02:00

10 KiB

LZ4 Block Format Description

Last revised: 2022-07-31 . Author : Yann Collet

This specification is intended for developers willing to produce or read LZ4 compressed data blocks using any programming language of their choice.

LZ4 is an LZ77-type compressor with a fixed byte-oriented encoding format. There is no entropy encoder back-end nor framing layer. The latter is assumed to be handled by other parts of the system (see LZ4 Frame format). This design is assumed to favor simplicity and speed.

This document describes only the Block Format, not how the compressor nor decompressor actually work. For more details on such topics, see later section "Implementation Notes".

Compressed block format

An LZ4 compressed block is composed of sequences. A sequence is a suite of literals (not-compressed bytes), followed by a match copy operation.

Each sequence starts with a token. The token is a one byte value, separated into two 4-bits fields. Therefore each field ranges from 0 to 15.

The first field uses the 4 high-bits of the token. It provides the length of literals to follow.

If the field value is smaller than 15, then it represents the total nb of literals present in the sequence, including 0, in which case there is no literal.

The value 15 is a special case: more bytes are required to indicate the full length. Each additional byte then represents a value from 0 to 255, which is added to the previous value to produce a total length. When the byte value is 255, another byte must be read and added, and so on. There can be any number of bytes of value 255 following token. The Block Format does not define any "size limit", though real implementations may feature some practical limits (see more details in later chapter "Implementation Notes").

Note : this format explains why a non-compressible input block is expanded by 0.4%.

Example 1 : A literal length of 48 will be represented as :

  • 15 : value for the 4-bits High field
  • 33 : (=48-15) remaining length to reach 48

Example 2 : A literal length of 280 will be represented as :

  • 15 : value for the 4-bits High field
  • 255 : following byte is maxed, since 280-15 >= 255
  • 10 : (=280 - 15 - 255) remaining length to reach 280

Example 3 : A literal length of 15 will be represented as :

  • 15 : value for the 4-bits High field
  • 0 : (=15-15) yes, the zero must be output

Following token and optional length bytes, are the literals themselves. They are exactly as numerous as just decoded (length of literals). Reminder: it's possible that there are zero literals.

Following the literals is the match copy operation.

It starts by the offset value. This is a 2 bytes value, in little endian format (the 1st byte is the "low" byte, the 2nd one is the "high" byte).

The offset represents the position of the match to be copied from the past. For example, 1 means "current position - 1 byte". The maximum offset value is 65535. 65536 and beyond cannot be coded. Note that 0 is an invalid offset value. The presence of a 0 offset value denotes an invalid (corrupted) block.

Then the matchlength can be extracted. For this, we use the second token field, the low 4-bits. Such a value, obviously, ranges from 0 to 15. However here, 0 means that the copy operation is minimal. The minimum length of a match, called minmatch, is 4. As a consequence, a 0 value means 4 bytes. Similarly to literal length, any value smaller than 15 represents a length, to which 4 (minmatch) must be added, thus ranging from 4 to 18. A value of 15 is special, meaning 19+ bytes, to which one must read additional bytes, one at a time, with each byte value ranging from 0 to 255. They are added to total to provide the final match length. A 255 value means there is another byte to read and add. There is no limit to the number of optional 255 bytes that can be present, and therefore no limit to representable match length, though real-life implementations are likely going to enforce limits for practical reasons (see more details in "Implementation Notes" section below).

Note: this format has a maximum achievable compression ratio of about ~250.

Decoding the matchlength reaches the end of current sequence. Next byte will be the start of another sequence, and therefore a new token.

End of block conditions

There are specific restrictions required to terminate an LZ4 block.

  1. The last sequence contains only literals. The block ends right after the literals (no offset field).
  2. The last 5 bytes of input are always literals. Therefore, the last sequence contains at least 5 bytes.
    • Special : if input is smaller than 5 bytes, there is only one sequence, it contains the whole input as literals. Even empty input can be represented, using a zero byte, interpreted as a final token without literal and without a match.
  3. The last match must start at least 12 bytes before the end of block. The last match is part of the penultimate sequence. It is followed by the last sequence, which contains only literals.
    • Note that, as a consequence, blocks < 12 bytes cannot be compressed. And as an extension, independent blocks < 13 bytes cannot be compressed, because they must start by at least one literal, that the match can then copy afterwards.

When a block does not respect these end conditions, a conformant decoder is allowed to reject the block as incorrect.

These rules are in place to ensure compatibility with a wide range of historical decoders which rely on these conditions for their speed-oriented design.

Implementation notes

The LZ4 Block Format only defines the compressed format, it does not tell how to create a decoder or an encoder, which design is left free to the imagination of the implementer.

However, thanks to experience, there are a number of typical topics that most implementations will have to consider. This section tries to provide a few guidelines.

Metadata

An LZ4-compressed Block requires additional metadata for proper decoding. Typically, a decoder will require the compressed block's size, and an upper bound of decompressed size. Other variants exist, such as knowing the decompressed size, and having an upper bound of the input size. The Block Format does not specify how to transmit such information, which is considered an out-of-band information channel. That's because in many cases, the information is present in the environment. For example, databases must store the size of their compressed block for indexing, and know that their decompressed block can't be larger than a certain threshold.

If you need a format which is "self-contained", and also transports the necessary metadata for proper decoding on any platform, consider employing the LZ4 Frame format instead.

Large lengths

While the Block Format does not define any maximum value for length fields, in practice, most implementations will feature some form of limit, since it's expected for such values to be stored into registers of fixed bit width.

If length fields use 64-bit registers, then it can be assumed that there is no practical limit, as it would require a single continuous block of multiple petabytes to reach it, which is unreasonable by today's standard.

If length fields use 32-bit registers, then it can be overflowed, but requires a compressed block of size > 16 MB. Therefore, implementations that do not deal with compressed blocks > 16 MB are safe. However, if such a case is allowed, then it's recommended to check that no large length overflows the register.

If length fields use 16-bit registers, then it's definitely possible to overflow such register, with less than < 300 bytes of compressed data.

A conformant decoder should be able to detect length overflows when it's possible, and simply error out when that happens. The input block might not be invalid, it's just not decodable by the local decoder implementation.

Note that, in order to be compatible with the larger LZ4 ecosystem, it's recommended to be able to read and represent lengths of up to 4 MB, and to accept blocks of size up to 4 MB. Such limits are compatible with 32-bit length registers, and prevent overflow of 32-bit registers.

Safe decoding

If a decoder receives compressed data from any external source, it is recommended to ensure that the decoder is resilient to corrupted input, and made safe from buffer overflow manipulations. Always ensure that read and write operations remain within the limits of provided buffers.

Of particular importance, ensure that the nb of bytes instructed to copy does not overflow neither the input nor the output buffers. Ensure also, when reading an offset value, that the resulting position to copy does not reach beyond the beginning of the buffer. Such a situation can happen during the first 64 KB of decoded data.

For more safety, test the decoder with fuzzers to ensure it's resilient to improbable sequences of conditions. Combine them with sanitizers, in order to catch overflows (asan) or initialization issues (msan).

Pay some attention to offset 0 scenario, which is invalid, and therefore must not be blindly decoded: a naive implementation could preserve destination buffer content, which could then result in information disclosure if such buffer was uninitialized and still containing private data. For reference, in such a scenario, the reference LZ4 decoder clears the match segment with 0 bytes, though other solutions are certainly possible.

Finally, pay attention to the "overlap match" scenario, when matchlength is larger than offset. In which case, since match_pos + matchlength > current_pos, some of the later bytes to copy do not exist yet, and will be generated during the early stage of match copy operation. Such scenario must be handled with special care. A common case is an offset of 1, meaning the last byte is repeated matchlength times.

Compression techniques

The core of a LZ4 compressor is to detect duplicated data across past 64 KB. The format makes no assumption nor limits to the way a compressor searches and selects matches within the source data block. For example, an upper compression limit can be reached, using a technique called "full optimal parsing", at high cpu and memory cost. But multiple other techniques can be considered, featuring distinct time / performance trade-offs. As long as the specified format is respected, the result will be compatible with and decodable by any compliant decoder.