Just a few weeks after the last update here I’m again to write the next one in which I’m going to show you the performance of my new hash table implementation.
It is based on the Leapfrog Probing article by Jeff Preshing, where the author explains a new probing algorithm for collision resolution when using open addressing.
My version is not multi-threaded but in return it is able to remove elements, a task that turned out to be quite more challenging than what I expected in the beginning. I also took the opportunity to change the default hash function to FNV-1a.
My previous implementation was based on separate chaining with list head cells, a pretty common technique that allows for unlimited number of stored elements in the face of many potential cache misses when the load factor increases.
The following results are a courtesy of the Google Benchmark library, the latest third-party software integrated in the nCine.
They have been recorded on an Intel i7-8550U, running Arch Linux with the
performance scaling governor and a fixed frequency of 800 MHz.
The tests have been performed on different implementations, they all have in common an
unsigned int for the key and the value and 1024 buckets:
StaticHashMapis the new open addressing hash map in a version that takes the number of buckets as a template argument, similarly to
HashMapis the new open addressing hash map in a more classic version that allocates on the heap
HashMapListis the old hash map with separate chaining
unordered_mapis the standard hash table implementation from the STL
Some tests have a number after their name representing either the number of elements already in the hash table, like for Copy, Retrieve or Clear, or the number of times the operation is repeated, like for Insert, Remove or ReverseRemove.
|Creation||443 ns||34106 ns||86589 ns||13143 ns|
|Copy/256||2859 ns||31501 ns||78531 ns||1656125 ns|
|Copy/512||2912 ns||31004 ns||745993 ns||3600840 ns|
|Copy/768||2895 ns||34312 ns||1075128 ns||4931709 ns|
|Insert/256||4725 ns||9670 ns||9313 ns||1581495 ns|
|Insert/512||8423 ns||19566 ns||674695 ns||3169524 ns|
|Insert/768||13896 ns||33050 ns||1007072 ns||6504032 ns|
|Retrieve/256||66 ns||67 ns||75 ns||72 ns|
|Retrieve/512||71 ns||74 ns||81 ns||72 ns|
|Retrieve/768||76 ns||77 ns||83 ns||73 ns|
|Clear/256||1733 ns||22215 ns||42527 ns||1541229 ns|
|Clear/512||1733 ns||21789 ns||701556 ns||3078237 ns|
|Clear/768||1734 ns||22076 ns||1025812 ns||4609838 ns|
|Remove/256||5083 ns||25895 ns||35680 ns||1565479 ns|
|Remove/512||9459 ns||38793 ns||703407 ns||3136167 ns|
|Remove/768||14828 ns||53607 ns||1040247 ns||4662149 ns|
|ReverseRemove/256||4906 ns||25564 ns||35849 ns||1560456 ns|
|ReverseRemove/512||9620 ns||38567 ns||700163 ns||3113046 ns|
|ReverseRemove/768||14527 ns||51159 ns||1031670 ns||4669154 ns|
Last but not least, I have replaced all the occurrences of the classic
rand() function with a new PCG32 random number generator based on the official minimal C implementation.
It has a higher statistical quality while also being a lot faster. In my microbenchmarks, with the same CPU settings as the hash map ones, it is able to generate 1024 integers in around 14203 ns, as opposed to 74303 ns when using simple