zstd/dictBuilder at 211a61b69b919d0963ed0ce9de943a20866bef1b - zstd

mirror of https://github.com/facebook/zstd.git synced 2024-12-18 12:40:09 +08:00

History

Nick Terrell 7cbb8bbbbf [cover] Small compression ratio improvement The cover algorithm selects one segment per epoch, and it selects the epoch size such that `epochs * segmentSize ~= dictSize`. Selecting less epochs gives the algorithm more candidates to choose from for each segment it selects, and then it will loop back to the first epoch when it hits the last one. The trade off is that now it takes longer to select each segment, since it has to look at more data before making a choice. I benchmarked on the following data sets using this command: ```sh $ZSTD -T0 -3 --train-cover=d=8,steps=256 $DIR -r -o dict && $ZSTD -3 -D dict -rc $DIR \| wc -c ``` \| Data set \| k (approx) \| Before \| After \| % difference \| \|--------------\|------------\|----------\|----------\|--------------\| \| GitHub \| ~1000 \| 738138 \| 746610 \| +1.14% \| \| hg-changelog \| ~90 \| 4295156 \| 4285336 \| -0.23% \| \| hg-commands \| ~500 \| 1095580 \| 1079814 \| -1.44% \| \| hg-manifest \| ~400 \| 16559892 \| 16504346 \| -0.34% \| There is some noise in the measurements, since small changes to `k` can have large differences, which is why I'm using `steps=256`, to try to minimize the noise. However, the GitHub data set still has some noise. If I run the GitHub data set on my Mac, which presumably lists directory entries in a different order, so the dictionary builder sees the files in a different order, or I use `steps=1024` I see these results. \| Run \| Before \| After \| % difference \| \|------------\|--------\|--------\|--------------\| \| steps=1024 \| 738138 \| 734470 \| -0.50% \| \| MacBook \| 738451 \| 737132 \| -0.18% \| Question: Should we expose this as a parameter? I don't think it is necessary. Someone might want to turn it up to exchange a much longer dictionary building time in exchange for a slightly better dictionary. I tested `2`, `4`, and `16`, and `4` got most of the benefit of `16` with a faster running time.		2018-05-18 16:15:27 -07:00
..
cover.c	[cover] Small compression ratio improvement	2018-05-18 16:15:27 -07:00
divsufsort.c	separation of lib/ into common/, compress/, decompress/, dictBuilder/, legacy/	2016-04-22 12:43:18 +02:00
divsufsort.h	fixed multiple minor warnings for XCode	2016-08-26 01:43:47 +02:00
zdict.c	Allow negative compression levels in training	2018-04-09 12:12:03 -07:00
zdict.h	bumped version number to v1.3.4	2018-01-27 22:23:26 -08:00