On a different note GC or memory management in LuaJIT seems perhaps to be less reliable. Running the benchmark mentioned in 10 iterations work well in Lua, while it crashes on 2 with LuaJIT. $ luajit benchmark.lua 10000000 1000000 1 12.79 MOPS $ luajit benchmark.lua 10000000 1000000 3 PANIC: unprotected error in call to Lua API (not enough memory) $ lua benchmark.lua 10000000 1000000 10 3.43 MOPS On Tue, Mar 18, 2014 at 10:02 PM, Fredrik Widlund <fredrik.widlund@xxxxxxxxx > wrote: > https://code.google.com/p/hash-table-benchmarks/ > > I believe C++11 uses open addressing, and for integer keys the identity > hash function. khash and dtype does for sure, uses quadratic probing, and > power of two capacities. Memory usage is definitely different between > implementations. > > Yes, looking separately at misses and hits makes sense. > > > On Tue, Mar 18, 2014 at 9:09 PM, Dan Eloff <dan.eloff@xxxxxxxxx> wrote: > >> Looking at the second page of results you have is fascinating. I'm at a >> loss to explain them. The dominant cost is cache misses, but looking at >> just the C/C++ results for a moment to compare apples to apples, how can >> one explain the 3x performance for dense/khash/dtype at 100M items? No >> amount of difference instruction wise should be able to come close to the >> minimum ~100 cycles per cache miss + ~100 cycles per TLB miss that all >> implementations must pay for a random lookup. >> >> I would love to know why the speed difference with std::unordered_map. Do >> they use separate chaining or a cache unfriendly collision algorithm, at 3x >> the speed, they must somehow do some ~600 cycles extra which can be >> possible if on average they're doing 3 additional cache misses (with >> associated TLB misses) per lookup. No reasonable amount of function call >> (even virtual function calls), branches, even an expensive divide >> instruction and prime-sized hashtables can account for 600 cycles >> difference, I think only cache misses can cost that much (even uncontended >> locking wouldn't be that expensive.) >> >> None of that, in my mind, explains the speed difference for small numbers >> of items. Perhaps std::unordered_map does some 4x the cost in instructions, >> as well as having worse cache behavior? >> >> The size chosen may have some effect, depending on the fill ratio of the >> hashtable implementations, e.g. if one table is nearly double the size of >> the other. The conclusion cannot quite be drawn that std::unorded_map is 3x >> slower, as it may just end up being a lot less advantageously sized for >> power of 10 item counts. >> >> I think these benchmarks could be quite illuminating, but some work needs >> to be done to determine the causes of the speed differences. Simply >> utilizing the performance counters (see perf tool) for cache misses, >> frontend and backend stalls and that kind of thing might help shed some >> light on it, as well as checking the source for the collision resolution >> mechanism and what the actual sizes of the allocated tables would be at >> 100M items. >> >> Separating out the misses from the hits into two separate runs may also >> be informative since the hashtable design can make tradeoffs between the >> two. >> >> Would you publish the entire benchmark code so that we can reproduce the >> results? >> >> >> >> On Tue, Mar 18, 2014 at 12:29 PM, Fredrik Widlund < >> fredrik.widlund@xxxxxxxxx> wrote: >> >>> If this is more than a casual off-hand remark kind of discussion I'll be >>> happy to take part in it. >>> >>> The point of this benchmark is to measure the cost of a random lookup in >>> a hash table of different sizes, with integers, and a hit probability of >>> 50%. >>> >>> Will it measurement include how different environments handle >>> a) Memory access patterns - Yes >>> b) Instruction pipelining patterns - Yes >>> >>> Is this realistic? Yes. >>> >>> As far as I understand the reasoning here is around one of two things. >>> >>> 1) The use case. You shouldn't benchmark integers. You shouldn't have a >>> hit probability of 50%. These are very subjective somewhat arbitrary >>> arguments. The use cases of a hash table when doing numerical calculations, >>> our handling large volumes of traffic in a router, are very different form >>> mapping headers in a HTTP request. Unfortunately I don't have time to cover >>> a very large range of use cases and this is enough for me to begin with, >>> but looking at inserts/deletes, memory usage, different hit ratios and key >>> types is of course relevant as well given some time. It is very common to >>> have random integers being a test case when benchmarking hash tables >>> though, one reason being that you could have an equally long, or longer, >>> discussion around what an appropriate set of 100M string keys would be, and >>> which hash functions that in turn are suitable for the set of keys. >>> >>> 2) This is the wrong way to measure (1). Given the use case in (1) how >>> would you accurately measure the cost of a random lookup operation? Doing a >>> large number of these operations and measure the time is a very common >>> approach. If you have an approach to suggest that works in fairly similar >>> ways for the different implementations I would be very happy to switch to >>> using it. >>> >>> From the benchmark so far it seems Google DenseHash (C++) and khash (C) >>> are in the order of 2x-3x times faster compared to doing lookups in Lua >>> using LuaJIT. Is this fair? Yes, probably. A serious attempt to show >>> otherwise would be interesting. >>> >>> It is interesting to look at inserts/deletes, memory usage, 5% or 95% >>> hit ratios, strings, worst cases, and so forth? Yes. Feel free to pitch in, >>> or give constructive suggestions. >>> >>> >>> On Tue, Mar 18, 2014 at 3:38 PM, Dan Eloff <dan.eloff@xxxxxxxxx> wrote: >>> >>>> I agree with Mike, the dominant costs here is an unpredictable branch >>>> (~17 or more cycles of a branch mispredition, 50% of the time) and the >>>> cache miss (~100-400 cycles). Typical integer hashing (some bitwise ops and >>>> maybe a multiply with a bitwise and for a power of 2 sized table) will be >>>> less than the branch misprediction cost, and rounding error against the >>>> cache miss. >>>> >>>> In fact I suspect that if you can enable huge pages for the Java heap >>>> and make sure you have enough huge pages allocated on your test machine, >>>> then Java should take the benchmark, because on the large tables you have >>>> not just a cache miss but also a TLB miss, which means navigating the 4 >>>> level page table which should be roughly 3 cache misses on average (1-4). >>>> But the benchmark may not show that because all it's doing is hash lookups >>>> and it's possible nearly the entire page table is in cache. In a real >>>> program that won't be the case. >>>> >>>> I think if you really want to show hash table implementation >>>> differences you need to eliminate the other effects. Ditch the branch >>>> misprediction, and use a lookup set small enough that the elements you want >>>> to lookup will all fit within the L2 cache and the TLB (e.g. lookup up to >>>> 512 random elements. Look them up once before the timed loop, and then time >>>> it.) >>>> >>>> >>>> On Mon, Mar 17, 2014 at 5:30 AM, Fredrik Widlund < >>>> fredrik.widlund@xxxxxxxxx> wrote: >>>> >>>>> Sigh, the Lua program doesn't even use local variables in the >>>>>> inner loop ... >>>>> >>>>> >>>>> Updated the benchmark with declaring x local inside the inner loop >>>>> instead of just before, which made for some 10% improvements in >>>>> small/medium sized hashes. >>>>> >>>>> Oh, and before anyone thinks this really measures hash table >>>>>> implementation quality: it's just measuring the overhead of branch >>>>>> mispredictions and cache misses for an atypical use case of hash >>>>>> tables. >>>>> >>>>> >>>>> Currently it's measuring different hash algorithms integer lookup >>>>> performance with a hit rate of 50%. It is a naive benchmark for now >>>>> because >>>>> this is currently sufficient to be able to make a point later on. >>>>> >>>>> If have some extra time I'll test some variables later on I will test >>>>> different hit rates and key types. If you have a reference to a source >>>>> where statistics around hash table use cases is investigated I'll use >>>>> that as a basis. >>>>> >>>>> >>>>> >>>>> On Sun, Mar 16, 2014 at 11:57 PM, Mike Pall <mike-1403@xxxxxxxxxx>wrote: >>>>> >>>>>> Fredrik Widlund wrote: >>>>>> > Benchmark of LuaJIT tables vs Java HashMap vs C++ unordered_map >>>>>> >>>>>> Sigh, the Lua program doesn't even use local variables in the >>>>>> inner loop ... >>>>>> >>>>>> Oh, and before anyone thinks this really measures hash table >>>>>> implementation quality: it's just measuring the overhead of branch >>>>>> mispredictions and cache misses for an atypical use case of hash >>>>>> tables. >>>>>> >>>>>> [Typical hash table accesses use a) string keys with b) a high >>>>>> temporal correlation and c) they usually have a high hit-rate or a >>>>>> high miss-rate, but it's rarely 50/50.] >>>>>> >>>>>> --Mike >>>>>> >>>>>> >>>>> >>>> >>> >> >