Merge pull request #1201 from facebook/rfcUpdate

updated Zstandard frame format
This commit is contained in:
Yann Collet 2018-06-22 11:53:50 -07:00 committed by GitHub
commit d70c4a5074
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -16,7 +16,7 @@ Distribution of this document is unlimited.
### Version
0.2.7 (30/04/18)
0.2.8 (30/05/18)
Introduction
@ -27,6 +27,8 @@ that is independent of CPU type, operating system,
file system and character set, suitable for
file compression, pipe and streaming compression,
using the [Zstandard algorithm](http://www.zstandard.org).
The text of the specification assumes a basic background in programming
at the level of bits and other primitive data representations.
The data can be produced or consumed,
even for an arbitrarily long sequentially presented input data stream,
@ -39,11 +41,6 @@ for detection of data corruption.
The data format defined by this specification
does not attempt to allow random access to compressed data.
This specification is intended for use by implementers of software
to compress data into Zstandard format and/or decompress data from Zstandard format.
The text of the specification assumes a basic background in programming
at the level of bits and other primitive data representations.
Unless otherwise indicated below,
a compliant compressor must produce data sets
that conform to the specifications presented here.
@ -57,6 +54,12 @@ Whenever it does not support a parameter defined in the compressed stream,
it must produce a non-ambiguous error code and associated error message
explaining which parameter is unsupported.
This specification is intended for use by implementers of software
to compress data into Zstandard format and/or decompress data from Zstandard format.
The Zstandard format is supported by an open source reference implementation,
written in portable C, and available at : https://github.com/facebook/zstd .
### Overall conventions
In this document:
- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
@ -92,14 +95,14 @@ Overview
Frames
------
Zstandard compressed data is made of one or more __frames__.
Each frame is independent and can be decompressed indepedently of other frames.
Each frame is independent and can be decompressed independently of other frames.
The decompressed content of multiple concatenated frames is the concatenation of
each frame decompressed content.
There are two frame formats defined by Zstandard:
Zstandard frames and Skippable frames.
Zstandard frames contain compressed data, while
skippable frames contain no data and can be used for metadata.
skippable frames contain custom user metadata.
## Zstandard frames
The structure of a single Zstandard frame is following:
@ -201,10 +204,10 @@ depending on local limitations.
__`Unused_bit`__
The value of this bit should be set to zero.
A decoder compliant with this specification version shall not interpret it.
It might be used in a future version,
to signal a property which is not mandatory to properly decode the frame.
A decoder compliant with this specification version shall not interpret this bit.
It might be used in any future version,
to signal a property which is transparent to properly decode the frame.
An encoder compliant with this specification version must set this bit to zero.
__`Reserved_bit`__
@ -254,6 +257,9 @@ Window_Size = windowBase + windowAdd;
The minimum `Window_Size` is 1 KB.
The maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB.
In general, larger `Window_Size` tend to improve compression ratio,
but at the cost of memory usage.
To properly decode compressed data,
a decoder will need to allocate a buffer of at least `Window_Size` bytes.
@ -262,8 +268,8 @@ a decoder is allowed to reject a compressed frame
which requests a memory size beyond decoder's authorized range.
For improved interoperability,
decoders are recommended to be compatible with `Window_Size <= 8 MB`,
and encoders are recommended to not request more than 8 MB.
it's recommended for decoders to support `Window_Size` of up to 8 MB,
and it's recommended for encoders to not generate frame requiring `Window_Size` larger than 8 MB.
It's merely a recommendation though,
decoders are free to support larger or lower limits,
depending on local limitations.
@ -273,7 +279,7 @@ depending on local limitations.
This is a variable size field, which contains
the ID of the dictionary required to properly decode the frame.
`Dictionary_ID` field is optional. When it's not present,
it's up to the decoder to make sure it uses the correct dictionary.
it's up to the decoder to know which dictionary to use.
`Dictionary_ID` field size is provided by `DID_Field_Size`.
`DID_Field_Size` is directly derived from value of `Dictionary_ID_flag`.
@ -286,13 +292,21 @@ It's allowed to represent a small ID (for example `13`)
with a large 4-bytes dictionary ID, even if it is less efficient.
_Reserved ranges :_
If the frame is going to be distributed in a private environment,
any dictionary ID can be used.
However, for public distribution of compressed frames using a dictionary,
the following ranges are reserved and shall not be used :
Within private environments, any `Dictionary_ID` can be used.
However, for frames and dictionaries distributed in public space,
`Dictionary_ID` must be attributed carefully.
Rules for public environment are not yet decided,
but the following ranges are reserved for some future registrar :
- low range : `<= 32767`
- high range : `>= (1 << 31)`
Outside of these ranges, any value of `Dictionary_ID`
which is both `>= 32768` and `< (1<<31)` can be used freely,
even in public environment.
#### `Frame_Content_Size`
This is the original (uncompressed) size. This information is optional.
@ -365,6 +379,7 @@ There are 4 block types :
- `Reserved` - this is not a block.
This value cannot be used with current version of this specification.
If such a value is present, it is considered corrupted data.
__`Block_Size`__
@ -377,6 +392,8 @@ A block can contain any number of bytes (even zero), up to
A `Compressed_Block` has the extra restriction that `Block_Size` is always
strictly less than the decompressed size.
If this condition cannot be respected,
the block must be sent uncompressed instead (`Raw_Block`).
Compressed Blocks
@ -394,7 +411,7 @@ data in [Sequence Execution](#sequence-execution)
#### Prerequisites
To decode a compressed block, the following elements are necessary :
- Previous decoded data, up to a distance of `Window_Size`,
or all previously decoded data when `Single_Segment_flag` is set.
or beginning of the Frame, whichever is smaller.
- List of "recent offsets" from previous `Compressed_Block`.
- The previous Huffman tree, required by `Treeless_Literals_Block` type
- Previous FSE decoding tables, required by `Repeat_Mode`
@ -415,11 +432,11 @@ Literals can be stored uncompressed or compressed using Huffman prefix codes.
When compressed, an optional tree description can be present,
followed by 1 or 4 streams.
| `Literals_Section_Header` | [`Huffman_Tree_Description`] | Stream1 | [Stream2] | [Stream3] | [Stream4] |
| ------------------------- | ---------------------------- | ------- | --------- | --------- | --------- |
| `Literals_Section_Header` | [`Huffman_Tree_Description`] | [jumpTable] | Stream1 | [Stream2] | [Stream3] | [Stream4] |
| ------------------------- | ---------------------------- | ----------- | ------- | --------- | --------- | --------- |
#### `Literals_Section_Header`
### `Literals_Section_Header`
Header is in charge of describing how literals are packed.
It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
@ -511,50 +528,55 @@ Both `Compressed_Size` and `Regenerated_Size` fields follow __little-endian__ co
Note: `Compressed_Size` __includes__ the size of the Huffman Tree description
_when_ it is present.
### Raw Literals Block
#### Raw Literals Block
The data in Stream1 is `Regenerated_Size` bytes long,
it contains the raw literals data to be used during [Sequence Execution].
### RLE Literals Block
#### RLE Literals Block
Stream1 consists of a single byte which should be repeated `Regenerated_Size` times
to generate the decoded literals.
### Compressed Literals Block and Treeless Literals Block
#### Compressed Literals Block and Treeless Literals Block
Both of these modes contain Huffman encoded data.
`Treeless_Literals_Block` does not have a `Huffman_Tree_Description`.
#### `Huffman_Tree_Description`
For `Treeless_Literals_Block`,
the Huffman table comes from previously compressed literals block,
or from a dictionary.
### `Huffman_Tree_Description`
This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description).
The size of `Huffman_Tree_Description` is determined during decoding process,
it must be used to determine where streams begin.
`Total_Streams_Size = Compressed_Size - Huffman_Tree_Description_Size`.
For `Treeless_Literals_Block`,
the Huffman table comes from previously compressed literals block,
or from a dictionary.
Huffman compressed data consists of either 1 or 4 Huffman-coded streams.
### Jump Table
The Jump Table is only present when there are 4 Huffman-coded streams.
Reminder : Huffman compressed data consists of either 1 or 4 Huffman-coded streams.
If only one stream is present, it is a single bitstream occupying the entire
remaining portion of the literals block, encoded as described within
[Huffman-Coded Streams](#huffman-coded-streams).
If there are four streams, the literals section header only provides enough
information to know the decompressed and compressed sizes of all four streams _combined_.
The decompressed size of each stream is equal to `(Regenerated_Size+3)/4`,
If there are four streams, `Literals_Section_Header` only provided
enough information to know the decompressed and compressed sizes
of all four streams _combined_.
The decompressed size of _each_ stream is equal to `(Regenerated_Size+3)/4`,
except for the last stream which may be up to 3 bytes smaller,
to reach a total decompressed size as specified in `Regenerated_Size`.
The compressed size of each stream is provided explicitly:
the first 6 bytes of the compressed data consist of three 2-byte __little-endian__ fields,
The compressed size of each stream is provided explicitly in the Jump Table.
Jump Table is 6 bytes long, and consist of three 2-byte __little-endian__ fields,
describing the compressed sizes of the first three streams.
`Stream4_Size` is computed from total `Total_Streams_Size` minus sizes of other streams.
`Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size`.
Note: remember that `Total_Streams_Size` can be smaller than `Compressed_Size` in header,
because `Compressed_Size` also contains `Huffman_Tree_Description_Size` when it is present.
Note: if `Stream1_Size + Stream2_Size + Stream3_Size > Total_Streams_Size`,
data is considered corrupted.
Each of these 4 bitstreams is then decoded independently as a Huffman-Coded stream,
as described at [Huffman-Coded Streams](#huffman-coded-streams)
@ -572,7 +594,7 @@ When all _sequences_ are decoded,
if there are literals left in the _literal section_,
these bytes are added at the end of the block.
This is described in more detail in [Sequence Execution](#sequence-execution)
This is described in more detail in [Sequence Execution](#sequence-execution).
The `Sequences_Section` regroup all symbols required to decode commands.
There are 3 symbol types : literals lengths, offsets and match lengths.
@ -630,15 +652,8 @@ They follow the same enumeration :
- `Predefined_Mode` : A predefined FSE distribution table is used, defined in
[default distributions](#default-distributions).
No distribution table will be present.
- `RLE_Mode` : The table description consists of a single byte.
This code will be repeated for all sequences.
- `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again,
or if this is the first block, table in the dictionary will be used
No distribution table will be present.
Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated.
Note that this also includes `Predefined_Mode`.
If this mode is used without any previous sequence table in the frame
(or [dictionary](#dictionary-format)) to repeat, this should be treated as corruption.
- `RLE_Mode` : The table description consists of a single byte, which contains the symbol's value.
This symbol will be used for all sequences.
- `FSE_Compressed_Mode` : standard FSE compression.
A distribution table will be present.
The format of this distribution table is described in [FSE Table Description](#fse-table-description).
@ -646,6 +661,13 @@ They follow the same enumeration :
and the maximum accuracy log for the offsets table is 8.
`FSE_Compressed_Mode` must not be used when only one symbol is present,
`RLE_Mode` should be used instead (although any other mode will work).
- `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again,
or if this is the first block, table in the dictionary will be used.
Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated.
It also includes `Predefined_Mode`, in which case `Repeat_Mode` will have same outcome as `Predefined_Mode`.
No distribution table will be present.
If this mode is used without any previous sequence table in the frame
(nor [dictionary](#dictionary-format)) to repeat, this should be treated as corruption.
#### The codes for literals lengths, match lengths, and offsets.
@ -718,7 +740,7 @@ Offset codes are values ranging from `0` to `N`.
A decoder is free to limit its maximum `N` supported.
Recommendation is to support at least up to `22`.
For information, at the time of this writing.
the reference decoder supports a maximum `N` value of `31` in 64-bits mode.
the reference decoder supports a maximum `N` value of `31`.
An offset code is also the number of additional bits to read in __little-endian__ fashion,
and can be translated into an `Offset_Value` using the following formulas :
@ -727,7 +749,8 @@ and can be translated into an `Offset_Value` using the following formulas :
Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
if (Offset_Value > 3) offset = Offset_Value - 3;
```
It means that maximum `Offset_Value` is `(2^(N+1))-1` and it supports back-reference distance up to `(2^(N+1))-4`
It means that maximum `Offset_Value` is `(2^(N+1))-1`
supporting back-reference distances up to `(2^(N+1))-4`,
but is limited by [maximum back-reference distance](#window_descriptor).
`Offset_Value` from 1 to 3 are special : they define "repeat codes".
@ -878,7 +901,8 @@ so an `offset_value` of 1 means `Repeated_Offset2`,
an `offset_value` of 2 means `Repeated_Offset3`,
and an `offset_value` of 3 means `Repeated_Offset1 - 1_byte`.
For the first block, the starting offset history is populated with the following values : 1, 4 and 8 (in order),
For the first block, the starting offset history is populated with following values :
`Repeated_Offset1`=1, `Repeated_Offset2`=4, `Repeated_Offset3`=8,
unless a dictionary is used, in which case they come from the dictionary.
Then each block gets its starting offset history from the ending values of the most recent `Compressed_Block`.
@ -903,21 +927,28 @@ Skippable Frames
|:--------------:|:------------:|:-----------:|
| 4 bytes | 4 bytes | n bytes |
Skippable frames allow the insertion of user-defined data
Skippable frames allow the insertion of user-defined metadata
into a flow of concatenated frames.
Its design is pretty straightforward,
with the sole objective to allow the decoder to quickly skip
over user-defined data and continue decoding.
Skippable frames defined in this specification are compatible with [LZ4] ones.
[LZ4]:http://www.lz4.org
From a compliant decoder perspective, skippable frames need just be skipped,
and their content ignored, resuming decoding after the skippable frame.
It can be noted that a skippable frame
can be used to watermark a stream of concatenated frames
embedding any kind of tracking information (even just an UUID).
Users wary of such possibility should scan the stream of concatenated frames
in an attempt to detect such frame for analysis or removal.
__`Magic_Number`__
4 Bytes, __little-endian__ format.
Value : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F.
All 16 values are valid to identify a skippable frame.
This specification doesn't detail any specific tagging for skippable frames.
__`Frame_Size`__
@ -931,10 +962,16 @@ __`User_Data`__
The `User_Data` can be anything. Data will just be skipped by the decoder.
Entropy Encoding
----------------
Two types of entropy encoding are used by the Zstandard format:
FSE, and Huffman coding.
Huffman is used to compress literals,
while FSE is used for all other symbols
(`Literals_Length_Code`, `Match_Length_Code`, offset codes)
and to compress Huffman headers.
FSE
---
@ -952,7 +989,7 @@ For additional details on FSE, see [Finite State Entropy].
FSE decoding involves a decoding table which has a power of 2 size, and contain three elements:
`Symbol`, `Num_Bits`, and `Baseline`.
The `log2` of the table size is its `Accuracy_Log`.
The FSE state represents an index in this table.
An FSE state value represents an index in this table.
To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value.
The next symbol in the stream is the `Symbol` indicated in the table for that state.
@ -971,10 +1008,11 @@ on a normalized scale of `1 << Accuracy_Log` .
Note that there must be two or more symbols with nonzero probability.
It's a bitstream which is read forward, in __little-endian__ fashion.
It's not necessary to know its exact size,
since it will be discovered and reported by the decoding process.
It's not necessary to know bitstream exact size,
it will be discovered and reported by the decoding process.
The bitstream starts by reporting on which scale it operates.
Let's `low4Bits` designate the lowest 4 bits of the first byte :
`Accuracy_Log = low4bits + 5`.
Then follows each symbol value, from `0` to last present one.
@ -1032,7 +1070,7 @@ and how many symbols are present.
The bitstream consumes a round number of bytes.
Any remaining bit within the last byte is just unused.
##### From normalized distribution to decoding tables
#### From normalized distribution to decoding tables
The distribution of normalized probabilities is enough
to create a unique decoding table.
@ -1143,7 +1181,7 @@ More bits improve accuracy but cost more header size,
and require more memory or more complex decoding operations.
This specification limits maximum code length to 11 bits.
##### Representation
#### Representation
All literal values from zero (included) to last present one (excluded)
are represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
@ -1158,12 +1196,13 @@ This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
__Example__ :
Let's presume the following Huffman tree must be described :
| literal | 0 | 1 | 2 | 3 | 4 | 5 |
| literal value | 0 | 1 | 2 | 3 | 4 | 5 |
| ---------------- | --- | --- | --- | --- | --- | --- |
| `Number_of_Bits` | 1 | 2 | 3 | 0 | 4 | 4 |
The tree depth is 4, since its smallest element uses 4 bits.
Value `5` will not be listed as it can be determined from the values for 0-4,
The tree depth is 4, since its longest elements uses 4 bits
(longest elements are the one with smallest frequency).
Value `5` will not be listed, as it can be determined from values for 0-4,
nor will values above `5` as they are all 0.
Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
Weight formula is :
@ -1172,9 +1211,9 @@ Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
```
It gives the following series of weights :
| literal | 0 | 1 | 2 | 3 | 4 |
| -------- | --- | --- | --- | --- | --- |
| `Weight` | 4 | 3 | 2 | 0 | 1 |
| literal value | 0 | 1 | 2 | 3 | 4 |
| ------------- | --- | --- | --- | --- | --- |
| `Weight` | 4 | 3 | 2 | 0 | 1 |
The decoder will do the inverse operation :
having collected weights of literals from `0` to `4`,
@ -1183,12 +1222,16 @@ The weight of `5` can be determined by advancing to the next power of 2.
The sum of `2^(Weight-1)` (excluding 0's) is :
`8 + 4 + 2 + 0 + 1 = 15`.
Nearest power of 2 is 16.
Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 1`.
Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 16-15 = 1`.
##### Huffman Tree header
#### Huffman Tree header
This is a single byte value (0-255),
which describes how to decode the list of weights.
which describes how the series of weights is encoded.
- if `headerByte` < 128 :
the series of weights is compressed using FSE (see below).
The length of the FSE-compressed series is equal to `headerByte` (0-127).
- if `headerByte` >= 128 : this is a direct representation,
where each `Weight` is written directly as a 4 bits field (0-15).
@ -1196,17 +1239,15 @@ which describes how to decode the list of weights.
the top four bits and the second taking the bottom four (e.g. the following
operations could be used to read the weights:
`Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc.).
The full representation occupies `((Number_of_Symbols+1)/2)` bytes,
meaning it uses a last full byte even if `Number_of_Symbols` is odd.
The full representation occupies `Ceiling(Number_of_Symbols/2)` bytes,
meaning it uses only full bytes even if `Number_of_Symbols` is odd.
`Number_of_Symbols = headerByte - 127`.
Note that maximum `Number_of_Symbols` is 255-127 = 128.
A larger series must necessarily use FSE compression.
If any literal has a value > 128, raw header mode is not possible.
In such case, it's necessary to use FSE compression.
- if `headerByte` < 128 :
the series of weights is compressed by FSE.
The length of the FSE-compressed series is equal to `headerByte` (0-127).
##### Finite State Entropy (FSE) compression of Huffman weights
#### Finite State Entropy (FSE) compression of Huffman weights
In this case, the series of Huffman weights is compressed using FSE compression.
It's a single bitstream with 2 interleaved states,
@ -1235,18 +1276,19 @@ The number of symbols to decode is determined
by tracking bitStream overflow condition:
If updating state after decoding a symbol would require more bits than
remain in the stream, it is assumed that extra bits are 0. Then,
the symbols for each of the final states are decoded and the process is complete.
symbols for each of the final states are decoded and the process is complete.
##### Conversion from weights to Huffman prefix codes
#### Conversion from weights to Huffman prefix codes
All present symbols shall now have a `Weight` value.
It is possible to transform weights into Number_of_Bits, using this formula:
It is possible to transform weights into `Number_of_Bits`, using this formula:
```
Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0
Number_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0
```
Symbols are sorted by `Weight`. Within same `Weight`, symbols keep natural order.
Symbols are sorted by `Weight`.
Within same `Weight`, symbols keep natural sequential order.
Symbols with a `Weight` of zero are removed.
Then, starting from lowest weight, prefix codes are distributed in order.
Then, starting from lowest weight, prefix codes are distributed in sequential order.
__Example__ :
Let's presume the following list of weights has been decoded :
@ -1255,7 +1297,7 @@ Let's presume the following list of weights has been decoded :
| -------- | --- | --- | --- | --- | --- | --- |
| `Weight` | 4 | 3 | 2 | 0 | 1 | 1 |
Sorted by weight and then natural order,
Sorted by weight and then natural sequential order,
it gives the following distribution :
| Literal | 3 | 4 | 5 | 2 | 1 | 0 |
@ -1265,6 +1307,7 @@ it gives the following distribution :
| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
### Huffman-coded Streams
Given a Huffman decoding table,
it's possible to decode a Huffman-coded stream.
@ -1342,7 +1385,7 @@ _Reserved ranges :_
- low range : <= 32767
- high range : >= (2^31)
__`Entropy_Tables`__ : following the same format as the tables in compressed blocks.
__`Entropy_Tables`__ : follow the same format as tables in [compressed blocks].
See the relevant [FSE](#fse-table-description)
and [Huffman](#huffman-tree-description) sections for how to decode these tables.
They are stored in following order :
@ -1366,6 +1409,10 @@ __`Content`__ : The rest of the dictionary is its content.
[compressed blocks]: #the-format-of-compressed_block
If a dictionary is provided by an external source,
it should be loaded with great care, its content considered untrusted.
Appendix A - Decoding tables for predefined codes
-------------------------------------------------
@ -1552,8 +1599,29 @@ to crosscheck that an implementation build its decoding tables correctly.
| 30 | 25 | 5 | 0 |
| 31 | 24 | 5 | 0 |
Appendix B - Resources for implementers
-------------------------------------------------
An open source reference implementation is available on :
https://github.com/facebook/zstd
The project contains a frame generator, called [decodeCorpus],
which can be used by any 3rd-party implementation
to verify that a tested decoder is compliant with the specification.
[decodeCorpus]: https://github.com/facebook/zstd/tree/v1.3.4/tests#decodecorpus---tool-to-generate-zstandard-frames-for-decoder-testing
`decodeCorpus` generates random valid frames.
A compliant decoder should be able to decode them all,
or at least provide a meaningful error code explaining for which reason it cannot
(memory limit restrictions for example).
Version changes
---------------
- 0.2.8 : clarifications for IETF RFC discuss
- 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell
- 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz
- 0.2.5 : minor typos and clarifications