Personally I'm really surprised by the comments that test results like these are "only about X". Factors involved are, for example - using chaining, open addressing, or variants thereof - hash functions - max table loads - table sizes - probing strategy - data representation - memory usage - memory access patterns - cache locality - instruction pipelining patterns - inner loop optimization - function call inlining - ... Each one of these can in themselves strongly affect performance. Combined the difference can easily be an order of magnitude. That for example Google DenseHash can be a number of times faster than C++11 unordered_map has been documented many times. The statement that this "is just measuring the overhead of branch mispredictions and cache misses" is perhaps a bit unfortunate and does not make that much sense here. On Tue, Mar 18, 2014 at 9:09 PM, Dan Eloff <dan.eloff@xxxxxxxxx> wrote: > Looking at the second page of results you have is fascinating. I'm at a > loss to explain them. The dominant cost is cache misses, but looking at > just the C/C++ results for a moment to compare apples to apples, how can > one explain the 3x performance for dense/khash/dtype at 100M items? No > amount of difference instruction wise should be able to come close to the > minimum ~100 cycles per cache miss + ~100 cycles per TLB miss that all > implementations must pay for a random lookup. > > I would love to know why the speed difference with std::unordered_map. Do > they use separate chaining or a cache unfriendly collision algorithm, at 3x > the speed, they must somehow do some ~600 cycles extra which can be > possible if on average they're doing 3 additional cache misses (with > associated TLB misses) per lookup. No reasonable amount of function call > (even virtual function calls), branches, even an expensive divide > instruction and prime-sized hashtables can account for 600 cycles > difference, I think only cache misses can cost that much (even uncontended > locking wouldn't be that expensive.) > > None of that, in my mind, explains the speed difference for small numbers > of items. Perhaps std::unordered_map does some 4x the cost in instructions, > as well as having worse cache behavior? > > The size chosen may have some effect, depending on the fill ratio of the > hashtable implementations, e.g. if one table is nearly double the size of > the other. The conclusion cannot quite be drawn that std::unorded_map is 3x > slower, as it may just end up being a lot less advantageously sized for > power of 10 item counts. > > I think these benchmarks could be quite illuminating, but some work needs > to be done to determine the causes of the speed differences. Simply > utilizing the performance counters (see perf tool) for cache misses, > frontend and backend stalls and that kind of thing might help shed some > light on it, as well as checking the source for the collision resolution > mechanism and what the actual sizes of the allocated tables would be at > 100M items. > > Separating out the misses from the hits into two separate runs may also be > informative since the hashtable design can make tradeoffs between the two. > > Would you publish the entire benchmark code so that we can reproduce the > results? > > > > On Tue, Mar 18, 2014 at 12:29 PM, Fredrik Widlund < > fredrik.widlund@xxxxxxxxx> wrote: > >> If this is more than a casual off-hand remark kind of discussion I'll be >> happy to take part in it. >> >> The point of this benchmark is to measure the cost of a random lookup in >> a hash table of different sizes, with integers, and a hit probability of >> 50%. >> >> Will it measurement include how different environments handle >> a) Memory access patterns - Yes >> b) Instruction pipelining patterns - Yes >> >> Is this realistic? Yes. >> >> As far as I understand the reasoning here is around one of two things. >> >> 1) The use case. You shouldn't benchmark integers. You shouldn't have a >> hit probability of 50%. These are very subjective somewhat arbitrary >> arguments. The use cases of a hash table when doing numerical calculations, >> our handling large volumes of traffic in a router, are very different form >> mapping headers in a HTTP request. Unfortunately I don't have time to cover >> a very large range of use cases and this is enough for me to begin with, >> but looking at inserts/deletes, memory usage, different hit ratios and key >> types is of course relevant as well given some time. It is very common to >> have random integers being a test case when benchmarking hash tables >> though, one reason being that you could have an equally long, or longer, >> discussion around what an appropriate set of 100M string keys would be, and >> which hash functions that in turn are suitable for the set of keys. >> >> 2) This is the wrong way to measure (1). Given the use case in (1) how >> would you accurately measure the cost of a random lookup operation? Doing a >> large number of these operations and measure the time is a very common >> approach. If you have an approach to suggest that works in fairly similar >> ways for the different implementations I would be very happy to switch to >> using it. >> >> From the benchmark so far it seems Google DenseHash (C++) and khash (C) >> are in the order of 2x-3x times faster compared to doing lookups in Lua >> using LuaJIT. Is this fair? Yes, probably. A serious attempt to show >> otherwise would be interesting. >> >> It is interesting to look at inserts/deletes, memory usage, 5% or 95% hit >> ratios, strings, worst cases, and so forth? Yes. Feel free to pitch in, or >> give constructive suggestions. >> >> >> On Tue, Mar 18, 2014 at 3:38 PM, Dan Eloff <dan.eloff@xxxxxxxxx> wrote: >> >>> I agree with Mike, the dominant costs here is an unpredictable branch >>> (~17 or more cycles of a branch mispredition, 50% of the time) and the >>> cache miss (~100-400 cycles). Typical integer hashing (some bitwise ops and >>> maybe a multiply with a bitwise and for a power of 2 sized table) will be >>> less than the branch misprediction cost, and rounding error against the >>> cache miss. >>> >>> In fact I suspect that if you can enable huge pages for the Java heap >>> and make sure you have enough huge pages allocated on your test machine, >>> then Java should take the benchmark, because on the large tables you have >>> not just a cache miss but also a TLB miss, which means navigating the 4 >>> level page table which should be roughly 3 cache misses on average (1-4). >>> But the benchmark may not show that because all it's doing is hash lookups >>> and it's possible nearly the entire page table is in cache. In a real >>> program that won't be the case. >>> >>> I think if you really want to show hash table implementation differences >>> you need to eliminate the other effects. Ditch the branch misprediction, >>> and use a lookup set small enough that the elements you want to lookup will >>> all fit within the L2 cache and the TLB (e.g. lookup up to 512 random >>> elements. Look them up once before the timed loop, and then time it.) >>> >>> >>> On Mon, Mar 17, 2014 at 5:30 AM, Fredrik Widlund < >>> fredrik.widlund@xxxxxxxxx> wrote: >>> >>>> Sigh, the Lua program doesn't even use local variables in the >>>>> inner loop ... >>>> >>>> >>>> Updated the benchmark with declaring x local inside the inner loop >>>> instead of just before, which made for some 10% improvements in >>>> small/medium sized hashes. >>>> >>>> Oh, and before anyone thinks this really measures hash table >>>>> implementation quality: it's just measuring the overhead of branch >>>>> mispredictions and cache misses for an atypical use case of hash >>>>> tables. >>>> >>>> >>>> Currently it's measuring different hash algorithms integer lookup >>>> performance with a hit rate of 50%. It is a naive benchmark for now because >>>> this is currently sufficient to be able to make a point later on. >>>> >>>> If have some extra time I'll test some variables later on I will test >>>> different hit rates and key types. If you have a reference to a source >>>> where statistics around hash table use cases is investigated I'll use >>>> that as a basis. >>>> >>>> >>>> >>>> On Sun, Mar 16, 2014 at 11:57 PM, Mike Pall <mike-1403@xxxxxxxxxx>wrote: >>>> >>>>> Fredrik Widlund wrote: >>>>> > Benchmark of LuaJIT tables vs Java HashMap vs C++ unordered_map >>>>> >>>>> Sigh, the Lua program doesn't even use local variables in the >>>>> inner loop ... >>>>> >>>>> Oh, and before anyone thinks this really measures hash table >>>>> implementation quality: it's just measuring the overhead of branch >>>>> mispredictions and cache misses for an atypical use case of hash >>>>> tables. >>>>> >>>>> [Typical hash table accesses use a) string keys with b) a high >>>>> temporal correlation and c) they usually have a high hit-rate or a >>>>> high miss-rate, but it's rarely 50/50.] >>>>> >>>>> --Mike >>>>> >>>>> >>>> >>> >> >