mirror of
https://github.com/lz4/lz4.git
synced 2024-11-27 20:04:17 +08:00
updated block format documentation
to specifically call attention to offset==0 scenario, which is invalid, and could lead to some form of security issue if incorrectly dealt with (ignored).
This commit is contained in:
parent
7601d57f87
commit
1da033e601
@ -1,6 +1,6 @@
|
||||
LZ4 Block Format Description
|
||||
============================
|
||||
Last revised: 2019-03-30.
|
||||
Last revised: 2022-02-02.
|
||||
Author : Yann Collet
|
||||
|
||||
|
||||
@ -42,8 +42,9 @@ If the field value is 0, then there is no literal.
|
||||
If it is 15, then we need to add some more bytes to indicate the full length.
|
||||
Each additional byte then represent a value from 0 to 255,
|
||||
which is added to the previous value to produce a total length.
|
||||
When the byte value is 255, another byte is output.
|
||||
There can be any number of bytes following `token`. There is no "size limit".
|
||||
When the byte value is 255, another byte must read and added, and so on.
|
||||
There can be any number of bytes of value "255" following `token`.
|
||||
There is no "size limit".
|
||||
(Side note : this is why a not-compressible input block is expanded by 0.4%).
|
||||
|
||||
Example 1 : A literal length of 48 will be represented as :
|
||||
@ -74,22 +75,23 @@ This is a 2 bytes value, in little endian format
|
||||
(the 1st byte is the "low" byte, the 2nd one is the "high" byte).
|
||||
|
||||
The `offset` represents the position of the match to be copied from.
|
||||
1 means "current position - 1 byte".
|
||||
The maximum `offset` value is 65535, 65536 cannot be coded.
|
||||
Note that 0 is an invalid value, not used.
|
||||
For example, 1 means "current position - 1 byte".
|
||||
The maximum `offset` value is 65535, 65536 and beyond cannot be coded.
|
||||
Note that 0 is an invalid offset value.
|
||||
The presence of such a value denotes an invalid (corrupted) block.
|
||||
|
||||
Then we need to extract the `matchlength`.
|
||||
Then the `matchlength` can be extracted.
|
||||
For this, we use the second token field, the low 4-bits.
|
||||
Value, obviously, ranges from 0 to 15.
|
||||
Such a value, obviously, ranges from 0 to 15.
|
||||
However here, 0 means that the copy operation will be minimal.
|
||||
The minimum length of a match, called `minmatch`, is 4.
|
||||
As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes.
|
||||
Similar to literal length, on reaching the highest possible value (15),
|
||||
we output additional bytes, one at a time, with values ranging from 0 to 255.
|
||||
one must read additional bytes, one at a time, with values ranging from 0 to 255.
|
||||
They are added to total to provide the final match length.
|
||||
A 255 value means there is another byte to read and add.
|
||||
There is no limit to the number of optional bytes that can be output this way.
|
||||
(This points towards a maximum achievable compression ratio of about 250).
|
||||
There is no limit to the number of optional "255" bytes that can be present.
|
||||
(Note: this points towards a maximum achievable compression ratio of about 250).
|
||||
|
||||
Decoding the `matchlength` reaches the end of current sequence.
|
||||
Next byte will be the start of another sequence.
|
||||
@ -97,9 +99,9 @@ But before moving to next sequence,
|
||||
it's time to use the decoded match position and length.
|
||||
The decoder copies `matchlength` bytes from match position to current position.
|
||||
|
||||
In some cases, `matchlength` is larger than `offset`.
|
||||
Therefore, `match_pos + matchlength > current_pos`,
|
||||
which means that later bytes to copy are not yet decoded.
|
||||
In some cases, `matchlength` can be larger than `offset`.
|
||||
Therefore, since `match_pos + matchlength > current_pos`,
|
||||
later bytes to copy are not decoded yet.
|
||||
This is called an "overlap match", and must be handled with special care.
|
||||
A common case is an offset of 1,
|
||||
meaning the last byte is repeated `matchlength` times.
|
||||
@ -107,7 +109,7 @@ meaning the last byte is repeated `matchlength` times.
|
||||
|
||||
End of block restrictions
|
||||
-----------------------
|
||||
There are specific rules required to terminate a block.
|
||||
There are specific restrictions required to terminate an LZ4 block.
|
||||
|
||||
1. The last sequence contains only literals.
|
||||
The block ends right after them.
|
||||
@ -124,33 +126,42 @@ There are specific rules required to terminate a block.
|
||||
an independent block < 13 bytes cannot be compressed,
|
||||
because the match must copy "something",
|
||||
so it needs at least one prior byte.
|
||||
- When a block can reference data from another block,
|
||||
- However, when a block can reference data from another block,
|
||||
it can start immediately with a match and no literal,
|
||||
so a block of 12 bytes can be compressed.
|
||||
therefore a block of exactly 12 bytes can be compressed.
|
||||
|
||||
When a block does not respect these end conditions,
|
||||
a conformant decoder is allowed to reject the block as incorrect.
|
||||
|
||||
These rules are in place to ensure that a conformant decoder
|
||||
can be designed for speed, issuing speculatively instructions,
|
||||
while never reading nor writing beyond provided I/O buffers.
|
||||
|
||||
These rules are in place to ensure compatibility with
|
||||
a wide range of historical decoders
|
||||
which rely on these conditions in their speed-oriented design.
|
||||
|
||||
Additional notes
|
||||
-----------------------
|
||||
If the decoder will decompress data from an external source,
|
||||
it is recommended to ensure that the decoder will not be vulnerable to
|
||||
buffer overflow manipulations.
|
||||
If the decoder will decompress data from any external source,
|
||||
it is recommended to ensure that the decoder is resilient to corrupted data,
|
||||
and typically not vulnerable to buffer overflow manipulations.
|
||||
Always ensure that read and write operations
|
||||
remain within the limits of provided buffers.
|
||||
Test the decoder with fuzzers
|
||||
to ensure it's resilient to improbable combinations.
|
||||
to ensure it's resilient to improbable sequences of conditions.
|
||||
Combine them with sanitizers, in order to catch overflows (asan)
|
||||
or initialization issues (msan).
|
||||
Pay some attention to offset 0 scenario, which is invalid,
|
||||
and therefore must not be blindly decoded
|
||||
(a naive implementation could preserve destination buffer content,
|
||||
which could then result in information disclosure
|
||||
if such buffer was uninitialized and still containing private data).
|
||||
For reference, in such a scenario, the reference LZ4 decoder
|
||||
clears the match segment with `0` bytes,
|
||||
though other solutions are certainly possible.
|
||||
|
||||
The format makes no assumption nor limits to the way the compressor
|
||||
The format makes no assumption nor limits to the way a compressor
|
||||
searches and selects matches within the source data block.
|
||||
Multiple techniques can be considered,
|
||||
featuring distinct time / performance trade offs.
|
||||
As long as the format is respected,
|
||||
For example, an upper compression limit can be reached,
|
||||
using a technique called "full optimal parsing", at very high cpu cost.
|
||||
As long as the specified format is respected,
|
||||
the result will be compatible and decodable by any compliant decoder.
|
||||
An upper compression limit can be reached,
|
||||
using a technique called "full optimal parsing", at high cpu cost.
|
||||
|
Loading…
Reference in New Issue
Block a user