Inspired by "Avoid character-by-character processing when performance matters"
Inspired by https://lemire.me/blog/2020/07/21/avoid-character-by-character-processing-when-performance-matters/
Test strings in the files allCountries.txt.gz
and cities500.txt.gz
are sourced from Geonames.org where I have extracted the second column (name
) of the corresponding files. The enwik8.gz
file is sourced from The Large Text Compression Benchmark. Testfiles have been gzipped in order to save space. These files have a nice mix of ASCII / non-ASCII data.
Benching allCountries.txt.gz
Lines : 7,522,986
Avg. length : 14.58
Max. length : 151
Non-Ascii lines : 23.52 %
Measuring methods... please be patient...
Regex Avg: 1.1677s Min: 1.1001s Max: 1.2178s 6,442,421 strings/sec
Branchy1 Avg: 0.0552s Min: 0.0497s Max: 0.0581s 136,181,375 strings/sec
Branchy2 Avg: 0.0533s Min: 0.0481s Max: 0.0615s 141,014,802 strings/sec
Branchless Avg: 0.0561s Min: 0.0519s Max: 0.0584s 134,095,915 strings/sec
Hybrid Avg: 0.0518s Min: 0.0483s Max: 0.0553s 145,139,683 strings/sec
Benching cities500.txt.gz
Lines : 165,957
Avg. length : 10.14
Max. length : 65
Non-Ascii lines : 20.12 %
Measuring methods... please be patient...
Regex Avg: 0.0224s Min: 0.0218s Max: 0.0231s 7,404,511 strings/sec
Branchy1 Avg: 0.0011s Min: 0.0008s Max: 0.0013s 152,820,546 strings/sec
Branchy2 Avg: 0.0011s Min: 0.0008s Max: 0.0013s 152,532,605 strings/sec
Branchless Avg: 0.0011s Min: 0.0009s Max: 0.0013s 148,741,642 strings/sec
Hybrid Avg: 0.0012s Min: 0.0008s Max: 0.0015s 142,158,282 strings/sec
Benching enwik8.gz
Lines : 1,128,024
Avg. length : 87.32
Max. length : 4,173
Non-Ascii lines : 6.35 %
Measuring methods... please be patient...
Regex Avg: 0.2885s Min: 0.2559s Max: 0.3619s 3,910,404 strings/sec
Branchy1 Avg: 0.0163s Min: 0.0158s Max: 0.0173s 69,345,382 strings/sec
Branchy2 Avg: 0.0150s Min: 0.0141s Max: 0.0168s 75,191,574 strings/sec
Branchless Avg: 0.0160s Min: 0.0156s Max: 0.0164s 70,479,739 strings/sec
Hybrid Avg: 0.0141s Min: 0.0134s Max: 0.0151s 80,107,376 strings/sec