Re: LuaJIT benchmark

From: Fredrik Widlund <fredrik.widlund@xxxxxxxxx>
To: luajit@xxxxxxxxxxxxx
Date: Tue, 18 Mar 2014 22:02:11 +0100
https://code.google.com/p/hash-table-benchmarks/

I believe C++11 uses open addressing, and for integer keys the identity
hash function. khash and dtype does for sure, uses quadratic probing, and
power of two capacities. Memory usage is definitely different between
implementations.

Yes, looking separately at misses and hits makes sense.


On Tue, Mar 18, 2014 at 9:09 PM, Dan Eloff <dan.eloff@xxxxxxxxx> wrote:

> Looking at the second page of results you have is fascinating. I'm at a
> loss to explain them. The dominant cost is cache misses, but looking at
> just the C/C++ results for a moment to compare apples to apples, how can
> one explain the 3x performance for dense/khash/dtype at 100M items? No
> amount of difference instruction wise should be able to come close to the
> minimum ~100 cycles per cache miss + ~100 cycles per TLB miss that all
> implementations must pay for a random lookup.
>
> I would love to know why the speed difference with std::unordered_map. Do
> they use separate chaining or a cache unfriendly collision algorithm, at 3x
> the speed, they must somehow do some ~600 cycles extra which can be
> possible if on average they're doing 3 additional cache misses (with
> associated TLB misses) per lookup. No reasonable amount of function call
> (even virtual function calls), branches, even an expensive divide
> instruction and prime-sized hashtables can account for 600 cycles
> difference, I think only cache misses can cost that much (even uncontended
> locking wouldn't be that expensive.)
>
> None of that, in my mind, explains the speed difference for small numbers
> of items. Perhaps std::unordered_map does some 4x the cost in instructions,
> as well as having worse cache behavior?
>
> The size chosen may have some effect, depending on the fill ratio of the
> hashtable implementations, e.g. if one table is nearly double the size of
> the other. The conclusion cannot quite be drawn that std::unorded_map is 3x
> slower, as it may just end up being a lot less advantageously sized for
> power of 10 item counts.
>
> I think these benchmarks could be quite illuminating, but some work needs
> to be done to determine the causes of the speed differences. Simply
> utilizing the performance counters (see perf tool) for cache misses,
> frontend and backend stalls and that kind of thing might help shed some
> light on it, as well as checking the source for the collision resolution
> mechanism and what the actual sizes of the allocated tables would be at
> 100M items.
>
> Separating out the misses from the hits into two separate runs may also be
> informative since the hashtable design can make tradeoffs between the two.
>
> Would you publish the entire benchmark code so that we can reproduce the
> results?
>
>
>
> On Tue, Mar 18, 2014 at 12:29 PM, Fredrik Widlund <
> fredrik.widlund@xxxxxxxxx> wrote:
>
>> If this is more than a casual off-hand remark kind of discussion I'll be
>> happy to take part in it.
>>
>> The point of this benchmark is to measure the cost of a random lookup in
>> a hash table of different sizes, with integers, and a hit probability of
>> 50%.
>>
>> Will it measurement include how different environments handle
>> a) Memory access patterns - Yes
>> b) Instruction pipelining patterns - Yes
>>
>> Is this realistic? Yes.
>>
>> As far as I understand the reasoning here is around one of two things.
>>
>> 1) The use case. You shouldn't benchmark integers. You shouldn't have a
>> hit probability of 50%. These are very subjective somewhat arbitrary
>> arguments. The use cases of a hash table when doing numerical calculations,
>> our handling large volumes of traffic in a router, are very different form
>> mapping headers in a HTTP request. Unfortunately I don't have time to cover
>> a very large range of use cases and this is enough for me to begin with,
>> but looking at inserts/deletes, memory usage, different hit ratios and key
>> types is of course relevant as well given some time. It is very common to
>> have random integers being a test case when benchmarking hash tables
>> though, one reason being that you could have an equally long, or longer,
>> discussion around what an appropriate set of 100M string keys would be, and
>> which hash functions that in turn are suitable for the set of keys.
>>
>> 2) This is the wrong way to measure (1). Given the use case in (1) how
>> would you accurately measure the cost of a random lookup operation? Doing a
>> large number of these operations and measure the time is a very common
>> approach. If you have an approach to suggest that works in fairly similar
>> ways for the different implementations I would be very happy to switch to
>> using it.
>>
>> From the benchmark so far it seems Google DenseHash (C++) and khash (C)
>> are in the order of 2x-3x times faster compared to doing lookups in Lua
>> using LuaJIT. Is this fair? Yes, probably. A serious attempt to show
>> otherwise would be interesting.
>>
>> It is interesting to look at inserts/deletes, memory usage, 5% or 95% hit
>> ratios, strings, worst cases, and so forth? Yes. Feel free to pitch in, or
>> give constructive suggestions.
>>
>>
>> On Tue, Mar 18, 2014 at 3:38 PM, Dan Eloff <dan.eloff@xxxxxxxxx> wrote:
>>
>>> I agree with Mike, the dominant costs here is an unpredictable branch
>>> (~17 or more cycles of a branch mispredition, 50% of the time) and the
>>> cache miss (~100-400 cycles). Typical integer hashing (some bitwise ops and
>>> maybe a multiply with a bitwise and for a power of 2 sized table) will be
>>> less than the branch misprediction cost, and rounding error against the
>>> cache miss.
>>>
>>> In fact I suspect that if you can enable huge pages for the Java heap
>>> and make sure you have enough huge pages allocated on your test machine,
>>> then Java should take the benchmark, because on the large tables you have
>>> not just a cache miss but also a TLB miss, which means navigating the 4
>>> level page table which should be roughly 3 cache misses on average (1-4).
>>> But the benchmark may not show that because all it's doing is hash lookups
>>> and it's possible nearly the entire page table is in cache. In a real
>>> program that won't be the case.
>>>
>>> I think if you really want to show hash table implementation differences
>>> you need to eliminate the other effects. Ditch the branch misprediction,
>>> and use a lookup set small enough that the elements you want to lookup will
>>> all fit within the L2 cache and the TLB (e.g. lookup up to 512 random
>>> elements. Look them up once before the timed loop, and then time it.)
>>>
>>>
>>> On Mon, Mar 17, 2014 at 5:30 AM, Fredrik Widlund <
>>> fredrik.widlund@xxxxxxxxx> wrote:
>>>
>>>> Sigh, the Lua program doesn't even use local variables in the
>>>>> inner loop ...
>>>>
>>>>
>>>> Updated the benchmark with declaring x local inside the inner loop
>>>> instead of just before, which made for some 10% improvements in
>>>> small/medium sized hashes.
>>>>
>>>> Oh, and before anyone thinks this really measures hash table
>>>>> implementation quality: it's just measuring the overhead of branch
>>>>> mispredictions and cache misses for an atypical use case of hash
>>>>> tables.
>>>>
>>>>
>>>> Currently it's measuring different hash algorithms integer lookup
>>>> performance with a hit rate of 50%. It is a naive benchmark for now because
>>>> this is currently sufficient to be able to make a point later on.
>>>>
>>>> If have some extra time I'll test some variables later on I will test
>>>> different hit rates and key types. If you have a reference to a source
>>>> where statistics around hash table use cases is investigated I'll use
>>>> that as a basis.
>>>>
>>>>
>>>>
>>>> On Sun, Mar 16, 2014 at 11:57 PM, Mike Pall <mike-1403@xxxxxxxxxx>wrote:
>>>>
>>>>> Fredrik Widlund wrote:
>>>>> > Benchmark of LuaJIT tables vs Java HashMap vs C++ unordered_map
>>>>>
>>>>> Sigh, the Lua program doesn't even use local variables in the
>>>>> inner loop ...
>>>>>
>>>>> Oh, and before anyone thinks this really measures hash table
>>>>> implementation quality: it's just measuring the overhead of branch
>>>>> mispredictions and cache misses for an atypical use case of hash
>>>>> tables.
>>>>>
>>>>> [Typical hash table accesses use a) string keys with b) a high
>>>>> temporal correlation and c) they usually have a high hit-rate or a
>>>>> high miss-rate, but it's rarely 50/50.]
>>>>>
>>>>> --Mike
>>>>>
>>>>>
>>>>
>>>
>>
>
Follow-Ups:
- Re: LuaJIT benchmark
  - From: Fredrik Widlund
References:
- LuaJIT benchmark
  - From: Fredrik Widlund
- Re: LuaJIT benchmark
  - From: Mike Pall
- Re: LuaJIT benchmark
  - From: Fredrik Widlund
- Re: LuaJIT benchmark
  - From: Dan Eloff
- Re: LuaJIT benchmark
  - From: Fredrik Widlund
- Re: LuaJIT benchmark
  - From: Dan Eloff
Re: LuaJIT benchmark

Other related posts: