the flexArray in structure FSE_DecompressWksp
is just a way to derive a pointer easily,
without risk/complexity of calculating it manually.
Not sure if this change is good enough to avoid ubsan warnings though.
within ZSTDMT_.
This pattern is flagged by less forgiving variants of ubsan
notably used during compilation of the Linux Kernel.
There are 2 other places in the code where this pattern is used.
This fixes just one of them.
* Remove all pointer-overflow suppressions from our UBSAN builds/tests.
* Add `ZSTD_ALLOW_POINTER_OVERFLOW_ATTR` macro to suppress
pointer-overflow at a per-function level. This is a superior approach
because it also applies to users who build zstd with UBSAN.
* Add `ZSTD_wrappedPtr{Diff,Add,Sub}()` that use these suppressions.
The end goal is to only tag these functions with
`ZSTD_ALLOW_POINTER_OVERFLOW`. But we can start by annoting functions
that rely on pointer overflow, and gradually transition to using
these.
* Add `ZSTD_maybeNullPtrAdd()` to simplify pointer addition when the
pointer may be `NULL`.
* Fix all the fuzzer issues that came up. I'm sure there will be a lot
more, but these are the ones that came up within a few minutes of
running the fuzzers, and while running GitHub CI.
Refine the macro guards to define the functions exactly when they are
needed.
This fixes the chromium build with zstd.
Thanks to @GregTho for reporting!
The Huffman repeat mode checker assumed that the CTable was zeroed in the region `[maxSymbolValue + 1, 256)`.
This assumption didn't hold for tables built in the dictionaries, because it didn't go through the same codepath.
Since this code was originally written, we added a header to the CTable that specifies the `tableLog`.
Add `maxSymbolValue` to that header, and check that the table's `maxSymbolValue` is at least the block's `maxSymbolValue`.
This solution is cleaner because we write this header for every CTable we build, so it can't be missed in any code path.
Credit to OSS-Fuzz
Fix the following warnings reported by the compiler when
ZDICTLIB_STATIC_API is not defined to ZDICTLIB_API:
lib/dictBuilder/cover.c:1122:21: warning: redeclaration of 'ZDICT_optimizeTrainFromBuffer_cover' with different visibility (old visibility
preserved)
lib/dictBuilder/cover.c:736:21: warning: redeclaration of 'ZDICT_trainFromBuffer_cover' with different visibility (old visibility
+preserved)
lib/dictBuilder/fastcover.c:549:1: warning: redeclaration of 'ZDICT_trainFromBuffer_fastCover' with different visibility (old visibility
preserved)
lib/dictBuilder/fastcover.c:618:1: warning: redeclaration of 'ZDICT_optimizeTrainFromBuffer_fastCover' with different visibility (old
visibility preserved)
We already have logic in our Huffman encoder to validate Huffman tables with missing symbols.
We use this for higher compression levels to re-use the previous blocks statistics, or when the dictionaries table has zero-weighted symbols.
This check was leftover as an oversight from before we added validation for Huffman tables.
I validated that the `dictionary_loader` fuzzer has coverage of every line in the `ZSTD_loadCEntropy()` function to validate that it is correctly testing this function.
MSAN is hooked into the system malloc, but when the user provides a custom
allocator, it may not provide the same cleansing behavior. So if we leave
memory poisoned and return it to the user's allocator, where it is re-used
elsewhere, our poisoning can blow up in some other context.
They are Linux-like environments under Windows and have all the tools needed to support staged installation and testing.
Beware: this only affects the make build system.
The sequence section starts with a number, which tells how sequences are present in the section.
If this number if 0, the section automatically ends.
The number 0 can be represented using the 1 byte or the 2 bytes formats.
That's because the 2-bytes formats fully overlaps the 1 byte format.
However, when 0 is represented using the 2-bytes format,
the decoder was expecting the sequence section to continue,
and was looking for FSE tables, which is incorrect.
Fixed this behavior, in both the reference decoder and the educational behavior.
In practice, this behavior never happens,
because the encoder will always select the 1-byte format to represent 0,
since this is more efficient.
Completed the fix with a new golden sample for tests,
a clarification of the specification,
and a decoder errata paragraph.
and in `decodecorpus`:
the specific case `nbSeq=127` can be represented using the 1-byte format.
Note that both the 1-byte and the 2-bytes formats are valid to represent this case,
so there was no "error", produced data remains valid,
it's just that the 1-byte format is more efficient.
fix#3667
Credit to @ip7z for finding this issue.