Re: LuaJIT benchmark

From: Fredrik Widlund <fredrik.widlund@xxxxxxxxx>
To: luajit@xxxxxxxxxxxxx
Date: Thu, 20 Mar 2014 11:27:17 +0100
Personally I'm really surprised by the comments that test results like
these are "only about X".

Factors involved are, for example
- using chaining, open addressing, or variants thereof
- hash functions
- max table loads
- table sizes
- probing strategy
- data representation
- memory usage
- memory access patterns
- cache locality
- instruction pipelining patterns
- inner loop optimization
- function call inlining
- ...

Each one of these can in themselves strongly affect performance. Combined
the difference can easily be an order of magnitude. That for example Google
DenseHash can be a number of times faster than C++11 unordered_map has been
documented many times.

The statement that this "is just measuring the overhead of branch
mispredictions
and cache misses" is perhaps a bit unfortunate and does not make that much
sense here.



On Tue, Mar 18, 2014 at 9:09 PM, Dan Eloff <dan.eloff@xxxxxxxxx> wrote:

> Looking at the second page of results you have is fascinating. I'm at a
> loss to explain them. The dominant cost is cache misses, but looking at
> just the C/C++ results for a moment to compare apples to apples, how can
> one explain the 3x performance for dense/khash/dtype at 100M items? No
> amount of difference instruction wise should be able to come close to the
> minimum ~100 cycles per cache miss + ~100 cycles per TLB miss that all
> implementations must pay for a random lookup.
>
> I would love to know why the speed difference with std::unordered_map. Do
> they use separate chaining or a cache unfriendly collision algorithm, at 3x
> the speed, they must somehow do some ~600 cycles extra which can be
> possible if on average they're doing 3 additional cache misses (with
> associated TLB misses) per lookup. No reasonable amount of function call
> (even virtual function calls), branches, even an expensive divide
> instruction and prime-sized hashtables can account for 600 cycles
> difference, I think only cache misses can cost that much (even uncontended
> locking wouldn't be that expensive.)
>
> None of that, in my mind, explains the speed difference for small numbers
> of items. Perhaps std::unordered_map does some 4x the cost in instructions,
> as well as having worse cache behavior?
>
> The size chosen may have some effect, depending on the fill ratio of the
> hashtable implementations, e.g. if one table is nearly double the size of
> the other. The conclusion cannot quite be drawn that std::unorded_map is 3x
> slower, as it may just end up being a lot less advantageously sized for
> power of 10 item counts.
>
> I think these benchmarks could be quite illuminating, but some work needs
> to be done to determine the causes of the speed differences. Simply
> utilizing the performance counters (see perf tool) for cache misses,
> frontend and backend stalls and that kind of thing might help shed some
> light on it, as well as checking the source for the collision resolution
> mechanism and what the actual sizes of the allocated tables would be at
> 100M items.
>
> Separating out the misses from the hits into two separate runs may also be
> informative since the hashtable design can make tradeoffs between the two.
>
> Would you publish the entire benchmark code so that we can reproduce the
> results?
>
>
>
> On Tue, Mar 18, 2014 at 12:29 PM, Fredrik Widlund <
> fredrik.widlund@xxxxxxxxx> wrote:
>
>> If this is more than a casual off-hand remark kind of discussion I'll be
>> happy to take part in it.
>>
>> The point of this benchmark is to measure the cost of a random lookup in
>> a hash table of different sizes, with integers, and a hit probability of
>> 50%.
>>
>> Will it measurement include how different environments handle
>> a) Memory access patterns - Yes
>> b) Instruction pipelining patterns - Yes
>>
>> Is this realistic? Yes.
>>
>> As far as I understand the reasoning here is around one of two things.
>>
>> 1) The use case. You shouldn't benchmark integers. You shouldn't have a
>> hit probability of 50%. These are very subjective somewhat arbitrary
>> arguments. The use cases of a hash table when doing numerical calculations,
>> our handling large volumes of traffic in a router, are very different form
>> mapping headers in a HTTP request. Unfortunately I don't have time to cover
>> a very large range of use cases and this is enough for me to begin with,
>> but looking at inserts/deletes, memory usage, different hit ratios and key
>> types is of course relevant as well given some time. It is very common to
>> have random integers being a test case when benchmarking hash tables
>> though, one reason being that you could have an equally long, or longer,
>> discussion around what an appropriate set of 100M string keys would be, and
>> which hash functions that in turn are suitable for the set of keys.
>>
>> 2) This is the wrong way to measure (1). Given the use case in (1) how
>> would you accurately measure the cost of a random lookup operation? Doing a
>> large number of these operations and measure the time is a very common
>> approach. If you have an approach to suggest that works in fairly similar
>> ways for the different implementations I would be very happy to switch to
>> using it.
>>
>> From the benchmark so far it seems Google DenseHash (C++) and khash (C)
>> are in the order of 2x-3x times faster compared to doing lookups in Lua
>> using LuaJIT. Is this fair? Yes, probably. A serious attempt to show
>> otherwise would be interesting.
>>
>> It is interesting to look at inserts/deletes, memory usage, 5% or 95% hit
>> ratios, strings, worst cases, and so forth? Yes. Feel free to pitch in, or
>> give constructive suggestions.
>>
>>
>> On Tue, Mar 18, 2014 at 3:38 PM, Dan Eloff <dan.eloff@xxxxxxxxx> wrote:
>>
>>> I agree with Mike, the dominant costs here is an unpredictable branch
>>> (~17 or more cycles of a branch mispredition, 50% of the time) and the
>>> cache miss (~100-400 cycles). Typical integer hashing (some bitwise ops and
>>> maybe a multiply with a bitwise and for a power of 2 sized table) will be
>>> less than the branch misprediction cost, and rounding error against the
>>> cache miss.
>>>
>>> In fact I suspect that if you can enable huge pages for the Java heap
>>> and make sure you have enough huge pages allocated on your test machine,
>>> then Java should take the benchmark, because on the large tables you have
>>> not just a cache miss but also a TLB miss, which means navigating the 4
>>> level page table which should be roughly 3 cache misses on average (1-4).
>>> But the benchmark may not show that because all it's doing is hash lookups
>>> and it's possible nearly the entire page table is in cache. In a real
>>> program that won't be the case.
>>>
>>> I think if you really want to show hash table implementation differences
>>> you need to eliminate the other effects. Ditch the branch misprediction,
>>> and use a lookup set small enough that the elements you want to lookup will
>>> all fit within the L2 cache and the TLB (e.g. lookup up to 512 random
>>> elements. Look them up once before the timed loop, and then time it.)
>>>
>>>
>>> On Mon, Mar 17, 2014 at 5:30 AM, Fredrik Widlund <
>>> fredrik.widlund@xxxxxxxxx> wrote:
>>>
>>>> Sigh, the Lua program doesn't even use local variables in the
>>>>> inner loop ...
>>>>
>>>>
>>>> Updated the benchmark with declaring x local inside the inner loop
>>>> instead of just before, which made for some 10% improvements in
>>>> small/medium sized hashes.
>>>>
>>>> Oh, and before anyone thinks this really measures hash table
>>>>> implementation quality: it's just measuring the overhead of branch
>>>>> mispredictions and cache misses for an atypical use case of hash
>>>>> tables.
>>>>
>>>>
>>>> Currently it's measuring different hash algorithms integer lookup
>>>> performance with a hit rate of 50%. It is a naive benchmark for now because
>>>> this is currently sufficient to be able to make a point later on.
>>>>
>>>> If have some extra time I'll test some variables later on I will test
>>>> different hit rates and key types. If you have a reference to a source
>>>> where statistics around hash table use cases is investigated I'll use
>>>> that as a basis.
>>>>
>>>>
>>>>
>>>> On Sun, Mar 16, 2014 at 11:57 PM, Mike Pall <mike-1403@xxxxxxxxxx>wrote:
>>>>
>>>>> Fredrik Widlund wrote:
>>>>> > Benchmark of LuaJIT tables vs Java HashMap vs C++ unordered_map
>>>>>
>>>>> Sigh, the Lua program doesn't even use local variables in the
>>>>> inner loop ...
>>>>>
>>>>> Oh, and before anyone thinks this really measures hash table
>>>>> implementation quality: it's just measuring the overhead of branch
>>>>> mispredictions and cache misses for an atypical use case of hash
>>>>> tables.
>>>>>
>>>>> [Typical hash table accesses use a) string keys with b) a high
>>>>> temporal correlation and c) they usually have a high hit-rate or a
>>>>> high miss-rate, but it's rarely 50/50.]
>>>>>
>>>>> --Mike
>>>>>
>>>>>
>>>>
>>>
>>
>
Follow-Ups:
- Re: LuaJIT benchmark
  - From: Mike Pall
References:
- LuaJIT benchmark
  - From: Fredrik Widlund
- Re: LuaJIT benchmark
  - From: Mike Pall
- Re: LuaJIT benchmark
  - From: Fredrik Widlund
- Re: LuaJIT benchmark
  - From: Dan Eloff
- Re: LuaJIT benchmark
  - From: Fredrik Widlund
- Re: LuaJIT benchmark
  - From: Dan Eloff
Re: LuaJIT benchmark

Other related posts: