项目作者: lemire

项目描述 :
High-performance dictionary coding
高级语言: C++
项目地址: git://github.com/lemire/dictionary.git
创建时间: 2016-08-11T20:20:58Z
项目社区:https://github.com/lemire/dictionary

开源协议:Apache License 2.0

下载


dictionary

High-performance dictionary coding

Suppose you want to compress a large array of values with
(relatively) few distinct values. For example, maybe you have 16 distinct 64-bit
values. Only four bits are needed to store a value in the range [0,16) using
binary packing, so if you have long arrays, it is possible to save 60 bits per value (compress
the data by a factor of 16).

We consider the following (simple) form of dictionary coding. We
have a dictionary of 64-bit values (could be pointers) stored
in an array. In the compression phase, we convert the values to indexes
and binary pack them. In the decompression phase, we
try to recover the dictionary-coded values as fast as possible.

Dictionary coding is in common use within database systems (e.g., Oracle, Parquet and so forth).

We are going to assume that one has a recent Intel processor
for the sake of this experiment.

Core Idea

It is tempting in dictionary coding, to first unpack the indexes to a temporary buffer
and then run through it and look-up the values in the dictionary. What if it were possible
to decode the indexes and look-up the values in the dictionary in one step?
It is possible with vector instructions as long as you have access to a gather
instruction. Thankfully, recent commodity x64 processors have such an instruction.

A word on RAM access

There is no slower processor than an idle processor waiting for the memory
subsystem.

When working with large data sets, it is tempting to decompress them from RAM
to RAM, converting gigabytes of compressed data into (many more) gigabytes of
uncompressed data.

If the purpose of compression is to keep more of the data close to the CPU,
then this is wasteful.

One should engineer applications so as to work on cache-friendly blocks. For
example, if have an array made of billions of values, instead of decoding them
all to RAM, and then reading them, it is much better to decode them in small blocks
at a time. In fact, ideally, one would prefer not to decode the data at all if possible:
working directly over the compressed data would be ideal.

If you must decode gigabytes of data to RAM or to disk, then you should expect
to be wasting enormous quantities of CPU cycles.

Usage

  1. make && make test
  2. ./decodebenchmark

Experimental results (Skylake, August 24th 2016)

We find that an AVX2 dictionary decoder can be more than twice as fast as a good scalar decoder
on a recent Intel processor (Skylake) for modest dictionary sizes. Even with large
dictionaries, the AVX2 gather approach is still remarkably faster. See results below. We expect results on older
Intel architectures to be less impressive because the vpgather instruction that we use was
quite slow in its early incarnations.

The case with large dictionary as implemented here is somewhat pessimistic as it assumes
that all values are equally likely. In most instances, a dictionary will have frequent
values, more likely to be repeated. This will reduce the number of cache misses.

Also, in practice one might limit the size of the dictionary by horizontal partitions.

  1. $ ./decodebenchmark
  2. For this benchmark, use a recent (Skylake) Intel processor for best results.
  3. Intel processor: Skylake compiler version: 5.3.0 20151204 AVX2 is available.
  4. Using array sizes of 8388608 values or 65536 kiB.
  5. testing with dictionary of size 2
  6. Actual dict size: 2
  7. scalarcodec.uncompress(t,newbuf): 4.00 cycles per decoded value
  8. decodetocache(&sc, &t,newbuf,bufsize): 3.06 cycles per decoded value
  9. avxcodec.uncompress(t,newbuf): 3.45 cycles per decoded value
  10. AVXDictCODEC::fastuncompress(t,newbuf): 1.91 cycles per decoded value
  11. AVXdecodetocache(&t,newbuf,bufsize): 1.15 cycles per decoded value
  12. testing with dictionary of size 4
  13. Actual dict size: 4
  14. scalarcodec.uncompress(t,newbuf): 3.99 cycles per decoded value
  15. decodetocache(&sc, &t,newbuf,bufsize): 3.06 cycles per decoded value
  16. avxcodec.uncompress(t,newbuf): 3.46 cycles per decoded value
  17. AVXDictCODEC::fastuncompress(t,newbuf): 1.91 cycles per decoded value
  18. AVXdecodetocache(&t,newbuf,bufsize): 1.19 cycles per decoded value
  19. testing with dictionary of size 8
  20. Actual dict size: 8
  21. scalarcodec.uncompress(t,newbuf): 3.52 cycles per decoded value
  22. decodetocache(&sc, &t,newbuf,bufsize): 2.38 cycles per decoded value
  23. avxcodec.uncompress(t,newbuf): 3.49 cycles per decoded value
  24. AVXDictCODEC::fastuncompress(t,newbuf): 1.93 cycles per decoded value
  25. AVXdecodetocache(&t,newbuf,bufsize): 1.17 cycles per decoded value
  26. testing with dictionary of size 16
  27. Actual dict size: 16
  28. scalarcodec.uncompress(t,newbuf): 4.01 cycles per decoded value
  29. decodetocache(&sc, &t,newbuf,bufsize): 3.08 cycles per decoded value
  30. avxcodec.uncompress(t,newbuf): 3.50 cycles per decoded value
  31. AVXDictCODEC::fastuncompress(t,newbuf): 1.95 cycles per decoded value
  32. AVXdecodetocache(&t,newbuf,bufsize): 1.19 cycles per decoded value
  33. testing with dictionary of size 32
  34. Actual dict size: 32
  35. scalarcodec.uncompress(t,newbuf): 4.02 cycles per decoded value
  36. decodetocache(&sc, &t,newbuf,bufsize): 3.06 cycles per decoded value
  37. avxcodec.uncompress(t,newbuf): 3.51 cycles per decoded value
  38. AVXDictCODEC::fastuncompress(t,newbuf): 1.96 cycles per decoded value
  39. AVXdecodetocache(&t,newbuf,bufsize): 1.18 cycles per decoded value
  40. testing with dictionary of size 64
  41. Actual dict size: 64
  42. scalarcodec.uncompress(t,newbuf): 4.02 cycles per decoded value
  43. decodetocache(&sc, &t,newbuf,bufsize): 3.08 cycles per decoded value
  44. avxcodec.uncompress(t,newbuf): 3.54 cycles per decoded value
  45. AVXDictCODEC::fastuncompress(t,newbuf): 1.98 cycles per decoded value
  46. AVXdecodetocache(&t,newbuf,bufsize): 1.17 cycles per decoded value
  47. testing with dictionary of size 128
  48. Actual dict size: 128
  49. scalarcodec.uncompress(t,newbuf): 3.59 cycles per decoded value
  50. decodetocache(&sc, &t,newbuf,bufsize): 2.35 cycles per decoded value
  51. avxcodec.uncompress(t,newbuf): 3.55 cycles per decoded value
  52. AVXDictCODEC::fastuncompress(t,newbuf): 1.99 cycles per decoded value
  53. AVXdecodetocache(&t,newbuf,bufsize): 1.14 cycles per decoded value
  54. testing with dictionary of size 256
  55. Actual dict size: 256
  56. scalarcodec.uncompress(t,newbuf): 4.03 cycles per decoded value
  57. decodetocache(&sc, &t,newbuf,bufsize): 3.10 cycles per decoded value
  58. avxcodec.uncompress(t,newbuf): 3.55 cycles per decoded value
  59. AVXDictCODEC::fastuncompress(t,newbuf): 2.00 cycles per decoded value
  60. AVXdecodetocache(&t,newbuf,bufsize): 1.22 cycles per decoded value
  61. testing with dictionary of size 512
  62. Actual dict size: 512
  63. scalarcodec.uncompress(t,newbuf): 4.04 cycles per decoded value
  64. decodetocache(&sc, &t,newbuf,bufsize): 3.11 cycles per decoded value
  65. avxcodec.uncompress(t,newbuf): 3.55 cycles per decoded value
  66. AVXDictCODEC::fastuncompress(t,newbuf): 2.01 cycles per decoded value
  67. AVXdecodetocache(&t,newbuf,bufsize): 1.20 cycles per decoded value
  68. testing with dictionary of size 1024
  69. Actual dict size: 1024
  70. scalarcodec.uncompress(t,newbuf): 4.04 cycles per decoded value
  71. decodetocache(&sc, &t,newbuf,bufsize): 3.11 cycles per decoded value
  72. avxcodec.uncompress(t,newbuf): 3.57 cycles per decoded value
  73. AVXDictCODEC::fastuncompress(t,newbuf): 2.04 cycles per decoded value
  74. AVXdecodetocache(&t,newbuf,bufsize): 1.18 cycles per decoded value
  75. testing with dictionary of size 2048
  76. Actual dict size: 2048
  77. scalarcodec.uncompress(t,newbuf): 4.08 cycles per decoded value
  78. decodetocache(&sc, &t,newbuf,bufsize): 3.15 cycles per decoded value
  79. avxcodec.uncompress(t,newbuf): 3.67 cycles per decoded value
  80. AVXDictCODEC::fastuncompress(t,newbuf): 2.05 cycles per decoded value
  81. AVXdecodetocache(&t,newbuf,bufsize): 1.22 cycles per decoded value
  82. testing with dictionary of size 4096
  83. Actual dict size: 4096
  84. scalarcodec.uncompress(t,newbuf): 4.14 cycles per decoded value
  85. decodetocache(&sc, &t,newbuf,bufsize): 3.33 cycles per decoded value
  86. avxcodec.uncompress(t,newbuf): 3.69 cycles per decoded value
  87. AVXDictCODEC::fastuncompress(t,newbuf): 2.12 cycles per decoded value
  88. AVXdecodetocache(&t,newbuf,bufsize): 1.32 cycles per decoded value
  89. testing with dictionary of size 8192
  90. Actual dict size: 8192
  91. scalarcodec.uncompress(t,newbuf): 4.35 cycles per decoded value
  92. decodetocache(&sc, &t,newbuf,bufsize): 3.65 cycles per decoded value
  93. avxcodec.uncompress(t,newbuf): 3.85 cycles per decoded value
  94. AVXDictCODEC::fastuncompress(t,newbuf): 2.28 cycles per decoded value
  95. AVXdecodetocache(&t,newbuf,bufsize): 1.67 cycles per decoded value
  96. testing with dictionary of size 16384
  97. Actual dict size: 16384
  98. scalarcodec.uncompress(t,newbuf): 4.51 cycles per decoded value
  99. decodetocache(&sc, &t,newbuf,bufsize): 3.95 cycles per decoded value
  100. avxcodec.uncompress(t,newbuf): 4.07 cycles per decoded value
  101. AVXDictCODEC::fastuncompress(t,newbuf): 2.55 cycles per decoded value
  102. AVXdecodetocache(&t,newbuf,bufsize): 2.12 cycles per decoded value
  103. testing with dictionary of size 32768
  104. Actual dict size: 32768
  105. scalarcodec.uncompress(t,newbuf): 4.88 cycles per decoded value
  106. decodetocache(&sc, &t,newbuf,bufsize): 3.84 cycles per decoded value
  107. avxcodec.uncompress(t,newbuf): 4.89 cycles per decoded value
  108. AVXDictCODEC::fastuncompress(t,newbuf): 3.52 cycles per decoded value
  109. AVXdecodetocache(&t,newbuf,bufsize): 3.02 cycles per decoded value
  110. testing with dictionary of size 65536
  111. Actual dict size: 65536
  112. scalarcodec.uncompress(t,newbuf): 7.14 cycles per decoded value
  113. decodetocache(&sc, &t,newbuf,bufsize): 5.47 cycles per decoded value
  114. avxcodec.uncompress(t,newbuf): 6.68 cycles per decoded value
  115. AVXDictCODEC::fastuncompress(t,newbuf): 5.18 cycles per decoded value
  116. AVXdecodetocache(&t,newbuf,bufsize): 4.53 cycles per decoded value
  117. testing with dictionary of size 131072
  118. Actual dict size: 131072
  119. scalarcodec.uncompress(t,newbuf): 7.96 cycles per decoded value
  120. decodetocache(&sc, &t,newbuf,bufsize): 6.05 cycles per decoded value
  121. avxcodec.uncompress(t,newbuf): 7.53 cycles per decoded value
  122. AVXDictCODEC::fastuncompress(t,newbuf): 6.01 cycles per decoded value
  123. AVXdecodetocache(&t,newbuf,bufsize): 5.43 cycles per decoded value
  124. testing with dictionary of size 262144
  125. Actual dict size: 262144
  126. scalarcodec.uncompress(t,newbuf): 8.30 cycles per decoded value
  127. decodetocache(&sc, &t,newbuf,bufsize): 6.35 cycles per decoded value
  128. avxcodec.uncompress(t,newbuf): 8.08 cycles per decoded value
  129. AVXDictCODEC::fastuncompress(t,newbuf): 6.46 cycles per decoded value
  130. AVXdecodetocache(&t,newbuf,bufsize): 5.66 cycles per decoded value
  131. testing with dictionary of size 524288
  132. Actual dict size: 524288
  133. scalarcodec.uncompress(t,newbuf): 8.48 cycles per decoded value
  134. decodetocache(&sc, &t,newbuf,bufsize): 6.39 cycles per decoded value
  135. avxcodec.uncompress(t,newbuf): 8.09 cycles per decoded value
  136. AVXDictCODEC::fastuncompress(t,newbuf): 6.44 cycles per decoded value
  137. AVXdecodetocache(&t,newbuf,bufsize): 5.83 cycles per decoded value
  138. testing with dictionary of size 1048576
  139. Actual dict size: 1048235
  140. scalarcodec.uncompress(t,newbuf): 11.85 cycles per decoded value
  141. decodetocache(&sc, &t,newbuf,bufsize): 10.53 cycles per decoded value
  142. avxcodec.uncompress(t,newbuf): 11.65 cycles per decoded value
  143. AVXDictCODEC::fastuncompress(t,newbuf): 8.47 cycles per decoded value
  144. AVXdecodetocache(&t,newbuf,bufsize): 8.07 cycles per decoded value

Experimental results (Knights Landing, August 24th 2016)

We find that an AVX-512 dictionary decoder can be than twice as fast as an AVX dictionary
decoder which is in turn twice as fast as a scalar decoder
on a recent Intel processor (Knights Landing) for modest dictionary sizes.
The case with large dictionary as implemented here is somewhat pessimistic as it assumes
that all values are equally likely.

  1. $ ./decodebenchmark
  2. For this benchmark, use a recent (Skylake) Intel processor for best results.
  3. Intel processor: UNKNOWN compiler version: 5.3.0 AVX2 is available.
  4. Using array sizes of 8388608 values or 65536 kiB.
  5. testing with dictionary of size 2
  6. Actual dict size: 2
  7. scalarcodec.uncompress(t,newbuf): 7.75 cycles per decoded value
  8. decodetocache(&sc, &t,newbuf,bufsize): 7.39 cycles per decoded value
  9. avxcodec.uncompress(t,newbuf): 6.26 cycles per decoded value
  10. AVXDictCODEC::fastuncompress(t,newbuf): 3.22 cycles per decoded value
  11. AVXdecodetocache(&t,newbuf,bufsize): 3.06 cycles per decoded value
  12. AVX512DictCODEC::fastuncompress(t,newbuf): 1.48 cycles per decoded value
  13. AVX512decodetocache(&t,newbuf,bufsize): 1.14 cycles per decoded value
  14. testing with dictionary of size 4
  15. Actual dict size: 4
  16. scalarcodec.uncompress(t,newbuf): 7.83 cycles per decoded value
  17. decodetocache(&sc, &t,newbuf,bufsize): 7.49 cycles per decoded value
  18. avxcodec.uncompress(t,newbuf): 6.35 cycles per decoded value
  19. AVXDictCODEC::fastuncompress(t,newbuf): 3.23 cycles per decoded value
  20. AVXdecodetocache(&t,newbuf,bufsize): 3.10 cycles per decoded value
  21. AVX512DictCODEC::fastuncompress(t,newbuf): 1.49 cycles per decoded value
  22. AVX512decodetocache(&t,newbuf,bufsize): 1.21 cycles per decoded value
  23. testing with dictionary of size 8
  24. Actual dict size: 8
  25. scalarcodec.uncompress(t,newbuf): 7.27 cycles per decoded value
  26. decodetocache(&sc, &t,newbuf,bufsize): 6.99 cycles per decoded value
  27. avxcodec.uncompress(t,newbuf): 6.17 cycles per decoded value
  28. AVXDictCODEC::fastuncompress(t,newbuf): 3.23 cycles per decoded value
  29. AVXdecodetocache(&t,newbuf,bufsize): 3.10 cycles per decoded value
  30. AVX512DictCODEC::fastuncompress(t,newbuf): 1.59 cycles per decoded value
  31. AVX512decodetocache(&t,newbuf,bufsize): 1.25 cycles per decoded value
  32. testing with dictionary of size 16
  33. Actual dict size: 16
  34. scalarcodec.uncompress(t,newbuf): 7.98 cycles per decoded value
  35. decodetocache(&sc, &t,newbuf,bufsize): 7.65 cycles per decoded value
  36. avxcodec.uncompress(t,newbuf): 6.32 cycles per decoded value
  37. AVXDictCODEC::fastuncompress(t,newbuf): 3.23 cycles per decoded value
  38. AVXdecodetocache(&t,newbuf,bufsize): 3.16 cycles per decoded value
  39. AVX512DictCODEC::fastuncompress(t,newbuf): 1.68 cycles per decoded value
  40. AVX512decodetocache(&t,newbuf,bufsize): 1.34 cycles per decoded value
  41. testing with dictionary of size 32
  42. Actual dict size: 32
  43. scalarcodec.uncompress(t,newbuf): 7.92 cycles per decoded value
  44. decodetocache(&sc, &t,newbuf,bufsize): 7.63 cycles per decoded value
  45. avxcodec.uncompress(t,newbuf): 6.27 cycles per decoded value
  46. AVXDictCODEC::fastuncompress(t,newbuf): 3.23 cycles per decoded value
  47. AVXdecodetocache(&t,newbuf,bufsize): 3.19 cycles per decoded value
  48. AVX512DictCODEC::fastuncompress(t,newbuf): 1.65 cycles per decoded value
  49. AVX512decodetocache(&t,newbuf,bufsize): 1.43 cycles per decoded value
  50. testing with dictionary of size 64
  51. Actual dict size: 64
  52. scalarcodec.uncompress(t,newbuf): 8.05 cycles per decoded value
  53. decodetocache(&sc, &t,newbuf,bufsize): 7.76 cycles per decoded value
  54. avxcodec.uncompress(t,newbuf): 6.32 cycles per decoded value
  55. AVXDictCODEC::fastuncompress(t,newbuf): 3.31 cycles per decoded value
  56. AVXdecodetocache(&t,newbuf,bufsize): 3.25 cycles per decoded value
  57. AVX512DictCODEC::fastuncompress(t,newbuf): 1.85 cycles per decoded value
  58. AVX512decodetocache(&t,newbuf,bufsize): 1.66 cycles per decoded value
  59. testing with dictionary of size 128
  60. Actual dict size: 128
  61. scalarcodec.uncompress(t,newbuf): 6.64 cycles per decoded value
  62. decodetocache(&sc, &t,newbuf,bufsize): 6.36 cycles per decoded value
  63. avxcodec.uncompress(t,newbuf): 6.19 cycles per decoded value
  64. AVXDictCODEC::fastuncompress(t,newbuf): 3.34 cycles per decoded value
  65. AVXdecodetocache(&t,newbuf,bufsize): 3.28 cycles per decoded value
  66. AVX512DictCODEC::fastuncompress(t,newbuf): 1.83 cycles per decoded value
  67. AVX512decodetocache(&t,newbuf,bufsize): 1.57 cycles per decoded value
  68. testing with dictionary of size 256
  69. Actual dict size: 256
  70. scalarcodec.uncompress(t,newbuf): 8.07 cycles per decoded value
  71. decodetocache(&sc, &t,newbuf,bufsize): 7.87 cycles per decoded value
  72. avxcodec.uncompress(t,newbuf): 6.39 cycles per decoded value
  73. AVXDictCODEC::fastuncompress(t,newbuf): 3.39 cycles per decoded value
  74. AVXdecodetocache(&t,newbuf,bufsize): 3.35 cycles per decoded value
  75. AVX512DictCODEC::fastuncompress(t,newbuf): 1.95 cycles per decoded value
  76. AVX512decodetocache(&t,newbuf,bufsize): 1.69 cycles per decoded value
  77. testing with dictionary of size 512
  78. Actual dict size: 512
  79. scalarcodec.uncompress(t,newbuf): 8.07 cycles per decoded value
  80. decodetocache(&sc, &t,newbuf,bufsize): 7.87 cycles per decoded value
  81. avxcodec.uncompress(t,newbuf): 6.32 cycles per decoded value
  82. AVXDictCODEC::fastuncompress(t,newbuf): 3.52 cycles per decoded value
  83. AVXdecodetocache(&t,newbuf,bufsize): 3.48 cycles per decoded value
  84. AVX512DictCODEC::fastuncompress(t,newbuf): 2.04 cycles per decoded value
  85. AVX512decodetocache(&t,newbuf,bufsize): 1.76 cycles per decoded value
  86. testing with dictionary of size 1024
  87. Actual dict size: 1024
  88. scalarcodec.uncompress(t,newbuf): 8.22 cycles per decoded value
  89. decodetocache(&sc, &t,newbuf,bufsize): 7.97 cycles per decoded value
  90. avxcodec.uncompress(t,newbuf): 6.43 cycles per decoded value
  91. AVXDictCODEC::fastuncompress(t,newbuf): 3.63 cycles per decoded value
  92. AVXdecodetocache(&t,newbuf,bufsize): 3.57 cycles per decoded value
  93. AVX512DictCODEC::fastuncompress(t,newbuf): 2.05 cycles per decoded value
  94. AVX512decodetocache(&t,newbuf,bufsize): 1.83 cycles per decoded value
  95. testing with dictionary of size 2048
  96. Actual dict size: 2048
  97. scalarcodec.uncompress(t,newbuf): 7.97 cycles per decoded value
  98. decodetocache(&sc, &t,newbuf,bufsize): 7.69 cycles per decoded value
  99. avxcodec.uncompress(t,newbuf): 6.37 cycles per decoded value
  100. AVXDictCODEC::fastuncompress(t,newbuf): 3.76 cycles per decoded value
  101. AVXdecodetocache(&t,newbuf,bufsize): 3.64 cycles per decoded value
  102. AVX512DictCODEC::fastuncompress(t,newbuf): 2.11 cycles per decoded value
  103. AVX512decodetocache(&t,newbuf,bufsize): 1.91 cycles per decoded value
  104. testing with dictionary of size 4096
  105. Actual dict size: 4096
  106. scalarcodec.uncompress(t,newbuf): 8.53 cycles per decoded value
  107. decodetocache(&sc, &t,newbuf,bufsize): 8.20 cycles per decoded value
  108. avxcodec.uncompress(t,newbuf): 6.67 cycles per decoded value
  109. AVXDictCODEC::fastuncompress(t,newbuf): 3.58 cycles per decoded value
  110. AVXdecodetocache(&t,newbuf,bufsize): 3.56 cycles per decoded value
  111. AVX512DictCODEC::fastuncompress(t,newbuf): 2.55 cycles per decoded value
  112. AVX512decodetocache(&t,newbuf,bufsize): 2.35 cycles per decoded value
  113. testing with dictionary of size 8192
  114. Actual dict size: 8192
  115. scalarcodec.uncompress(t,newbuf): 8.66 cycles per decoded value
  116. decodetocache(&sc, &t,newbuf,bufsize): 8.27 cycles per decoded value
  117. avxcodec.uncompress(t,newbuf): 6.79 cycles per decoded value
  118. AVXDictCODEC::fastuncompress(t,newbuf): 3.92 cycles per decoded value
  119. AVXdecodetocache(&t,newbuf,bufsize): 3.86 cycles per decoded value
  120. AVX512DictCODEC::fastuncompress(t,newbuf): 2.80 cycles per decoded value
  121. AVX512decodetocache(&t,newbuf,bufsize): 2.54 cycles per decoded value
  122. testing with dictionary of size 16384
  123. Actual dict size: 16384
  124. scalarcodec.uncompress(t,newbuf): 8.85 cycles per decoded value
  125. decodetocache(&sc, &t,newbuf,bufsize): 8.55 cycles per decoded value
  126. avxcodec.uncompress(t,newbuf): 6.95 cycles per decoded value
  127. AVXDictCODEC::fastuncompress(t,newbuf): 4.05 cycles per decoded value
  128. AVXdecodetocache(&t,newbuf,bufsize): 3.87 cycles per decoded value
  129. AVX512DictCODEC::fastuncompress(t,newbuf): 3.14 cycles per decoded value
  130. AVX512decodetocache(&t,newbuf,bufsize): 2.96 cycles per decoded value
  131. testing with dictionary of size 32768
  132. Actual dict size: 32768
  133. scalarcodec.uncompress(t,newbuf): 6.75 cycles per decoded value
  134. decodetocache(&sc, &t,newbuf,bufsize): 6.81 cycles per decoded value
  135. avxcodec.uncompress(t,newbuf): 6.94 cycles per decoded value
  136. AVXDictCODEC::fastuncompress(t,newbuf): 3.68 cycles per decoded value
  137. AVXdecodetocache(&t,newbuf,bufsize): 3.58 cycles per decoded value
  138. AVX512DictCODEC::fastuncompress(t,newbuf): 3.41 cycles per decoded value
  139. AVX512decodetocache(&t,newbuf,bufsize): 3.24 cycles per decoded value
  140. testing with dictionary of size 65536
  141. Actual dict size: 65536
  142. scalarcodec.uncompress(t,newbuf): 11.75 cycles per decoded value
  143. decodetocache(&sc, &t,newbuf,bufsize): 13.76 cycles per decoded value
  144. avxcodec.uncompress(t,newbuf): 9.64 cycles per decoded value
  145. AVXDictCODEC::fastuncompress(t,newbuf): 5.29 cycles per decoded value
  146. AVXdecodetocache(&t,newbuf,bufsize): 5.50 cycles per decoded value
  147. AVX512DictCODEC::fastuncompress(t,newbuf): 4.54 cycles per decoded value
  148. AVX512decodetocache(&t,newbuf,bufsize): 4.66 cycles per decoded value
  149. testing with dictionary of size 131072
  150. Actual dict size: 131072
  151. scalarcodec.uncompress(t,newbuf): 19.07 cycles per decoded value
  152. decodetocache(&sc, &t,newbuf,bufsize): 19.53 cycles per decoded value
  153. avxcodec.uncompress(t,newbuf): 17.02 cycles per decoded value
  154. AVXDictCODEC::fastuncompress(t,newbuf): 11.02 cycles per decoded value
  155. AVXdecodetocache(&t,newbuf,bufsize): 11.01 cycles per decoded value
  156. AVX512DictCODEC::fastuncompress(t,newbuf): 8.03 cycles per decoded value
  157. AVX512decodetocache(&t,newbuf,bufsize): 8.01 cycles per decoded value
  158. testing with dictionary of size 262144
  159. Actual dict size: 262144
  160. scalarcodec.uncompress(t,newbuf): 22.84 cycles per decoded value
  161. decodetocache(&sc, &t,newbuf,bufsize): 23.12 cycles per decoded value
  162. avxcodec.uncompress(t,newbuf): 20.63 cycles per decoded value
  163. AVXDictCODEC::fastuncompress(t,newbuf): 16.57 cycles per decoded value
  164. AVXdecodetocache(&t,newbuf,bufsize): 16.45 cycles per decoded value
  165. AVX512DictCODEC::fastuncompress(t,newbuf): 13.68 cycles per decoded value
  166. AVX512decodetocache(&t,newbuf,bufsize): 13.69 cycles per decoded value
  167. testing with dictionary of size 524288
  168. Actual dict size: 524288
  169. scalarcodec.uncompress(t,newbuf): 22.34 cycles per decoded value
  170. decodetocache(&sc, &t,newbuf,bufsize): 22.54 cycles per decoded value
  171. avxcodec.uncompress(t,newbuf): 20.36 cycles per decoded value
  172. AVXDictCODEC::fastuncompress(t,newbuf): 16.30 cycles per decoded value
  173. AVXdecodetocache(&t,newbuf,bufsize): 16.34 cycles per decoded value
  174. AVX512DictCODEC::fastuncompress(t,newbuf): 14.91 cycles per decoded value
  175. AVX512decodetocache(&t,newbuf,bufsize): 14.94 cycles per decoded value
  176. testing with dictionary of size 1048576
  177. Actual dict size: 1048235
  178. scalarcodec.uncompress(t,newbuf): 21.93 cycles per decoded value
  179. decodetocache(&sc, &t,newbuf,bufsize): 22.11 cycles per decoded value
  180. avxcodec.uncompress(t,newbuf): 19.91 cycles per decoded value
  181. AVXDictCODEC::fastuncompress(t,newbuf): 16.33 cycles per decoded value
  182. AVXdecodetocache(&t,newbuf,bufsize): 16.30 cycles per decoded value
  183. AVX512DictCODEC::fastuncompress(t,newbuf): 15.32 cycles per decoded value
  184. AVX512decodetocache(&t,newbuf,bufsize): 15.31 cycles per decoded value

Limitations

  • We support just one dictionary. In practice, one might want to use horizontal partitions.
  • We do not have a realistic usage of the dictionary values (we use a uniform distribution).
  • For simplicity, we assume that the dictionary is made of 64-bit words. It is hard-coded in the code, but not a fundamental limitation: the code would be faster with smaller words.
  • This code is not meant to be use in production. It is a demo.
  • This code makes up its own convenient format. It is not meant to plug as-is into an existing framework.
  • We assume that the arrays are large. If you have tiny arrays… well…
  • We effectively measure steady-state throughput. So we ignore costs such as loading up the dictionary in CPU cache.

Authors

Daniel Lemire and Eric Daniel (motivated by parquet-cpp)

Other relevant libraries