mirror of
https://github.com/facebook/zstd.git
synced 2024-12-18 12:40:09 +08:00
7cbb8bbbbf
The cover algorithm selects one segment per epoch, and it selects the epoch size such that `epochs * segmentSize ~= dictSize`. Selecting less epochs gives the algorithm more candidates to choose from for each segment it selects, and then it will loop back to the first epoch when it hits the last one. The trade off is that now it takes longer to select each segment, since it has to look at more data before making a choice. I benchmarked on the following data sets using this command: ```sh $ZSTD -T0 -3 --train-cover=d=8,steps=256 $DIR -r -o dict && $ZSTD -3 -D dict -rc $DIR | wc -c ``` | Data set | k (approx) | Before | After | % difference | |--------------|------------|----------|----------|--------------| | GitHub | ~1000 | 738138 | 746610 | +1.14% | | hg-changelog | ~90 | 4295156 | 4285336 | -0.23% | | hg-commands | ~500 | 1095580 | 1079814 | -1.44% | | hg-manifest | ~400 | 16559892 | 16504346 | -0.34% | There is some noise in the measurements, since small changes to `k` can have large differences, which is why I'm using `steps=256`, to try to minimize the noise. However, the GitHub data set still has some noise. If I run the GitHub data set on my Mac, which presumably lists directory entries in a different order, so the dictionary builder sees the files in a different order, or I use `steps=1024` I see these results. | Run | Before | After | % difference | |------------|--------|--------|--------------| | steps=1024 | 738138 | 734470 | -0.50% | | MacBook | 738451 | 737132 | -0.18% | Question: Should we expose this as a parameter? I don't think it is necessary. Someone might want to turn it up to exchange a much longer dictionary building time in exchange for a slightly better dictionary. I tested `2`, `4`, and `16`, and `4` got most of the benefit of `16` with a faster running time. |
||
---|---|---|
.. | ||
cover.c | ||
divsufsort.c | ||
divsufsort.h | ||
zdict.c | ||
zdict.h |