Re: LuaJIT benchmark

From: Fredrik Widlund <fredrik.widlund@xxxxxxxxx>
To: luajit@xxxxxxxxxxxxx
Date: Tue, 18 Mar 2014 22:39:25 +0100
On a different note GC or memory management in LuaJIT seems perhaps to be
less reliable.

Running the benchmark mentioned in 10 iterations work well in Lua, while it
crashes on 2 with LuaJIT.

$ luajit benchmark.lua 10000000 1000000 1

12.79 MOPS

$  luajit benchmark.lua 10000000 1000000 3

PANIC: unprotected error in call to Lua API (not enough memory)

$ lua benchmark.lua 10000000 1000000 10

3.43 MOPS



On Tue, Mar 18, 2014 at 10:02 PM, Fredrik Widlund <fredrik.widlund@xxxxxxxxx
> wrote:

> https://code.google.com/p/hash-table-benchmarks/
>
> I believe C++11 uses open addressing, and for integer keys the identity
> hash function. khash and dtype does for sure, uses quadratic probing, and
> power of two capacities. Memory usage is definitely different between
> implementations.
>
> Yes, looking separately at misses and hits makes sense.
>
>
> On Tue, Mar 18, 2014 at 9:09 PM, Dan Eloff <dan.eloff@xxxxxxxxx> wrote:
>
>> Looking at the second page of results you have is fascinating. I'm at a
>> loss to explain them. The dominant cost is cache misses, but looking at
>> just the C/C++ results for a moment to compare apples to apples, how can
>> one explain the 3x performance for dense/khash/dtype at 100M items? No
>> amount of difference instruction wise should be able to come close to the
>> minimum ~100 cycles per cache miss + ~100 cycles per TLB miss that all
>> implementations must pay for a random lookup.
>>
>> I would love to know why the speed difference with std::unordered_map. Do
>> they use separate chaining or a cache unfriendly collision algorithm, at 3x
>> the speed, they must somehow do some ~600 cycles extra which can be
>> possible if on average they're doing 3 additional cache misses (with
>> associated TLB misses) per lookup. No reasonable amount of function call
>> (even virtual function calls), branches, even an expensive divide
>> instruction and prime-sized hashtables can account for 600 cycles
>> difference, I think only cache misses can cost that much (even uncontended
>> locking wouldn't be that expensive.)
>>
>> None of that, in my mind, explains the speed difference for small numbers
>> of items. Perhaps std::unordered_map does some 4x the cost in instructions,
>> as well as having worse cache behavior?
>>
>> The size chosen may have some effect, depending on the fill ratio of the
>> hashtable implementations, e.g. if one table is nearly double the size of
>> the other. The conclusion cannot quite be drawn that std::unorded_map is 3x
>> slower, as it may just end up being a lot less advantageously sized for
>> power of 10 item counts.
>>
>> I think these benchmarks could be quite illuminating, but some work needs
>> to be done to determine the causes of the speed differences. Simply
>> utilizing the performance counters (see perf tool) for cache misses,
>> frontend and backend stalls and that kind of thing might help shed some
>> light on it, as well as checking the source for the collision resolution
>> mechanism and what the actual sizes of the allocated tables would be at
>> 100M items.
>>
>> Separating out the misses from the hits into two separate runs may also
>> be informative since the hashtable design can make tradeoffs between the
>> two.
>>
>> Would you publish the entire benchmark code so that we can reproduce the
>> results?
>>
>>
>>
>> On Tue, Mar 18, 2014 at 12:29 PM, Fredrik Widlund <
>> fredrik.widlund@xxxxxxxxx> wrote:
>>
>>> If this is more than a casual off-hand remark kind of discussion I'll be
>>> happy to take part in it.
>>>
>>> The point of this benchmark is to measure the cost of a random lookup in
>>> a hash table of different sizes, with integers, and a hit probability of
>>> 50%.
>>>
>>> Will it measurement include how different environments handle
>>> a) Memory access patterns - Yes
>>> b) Instruction pipelining patterns - Yes
>>>
>>> Is this realistic? Yes.
>>>
>>> As far as I understand the reasoning here is around one of two things.
>>>
>>> 1) The use case. You shouldn't benchmark integers. You shouldn't have a
>>> hit probability of 50%. These are very subjective somewhat arbitrary
>>> arguments. The use cases of a hash table when doing numerical calculations,
>>> our handling large volumes of traffic in a router, are very different form
>>> mapping headers in a HTTP request. Unfortunately I don't have time to cover
>>> a very large range of use cases and this is enough for me to begin with,
>>> but looking at inserts/deletes, memory usage, different hit ratios and key
>>> types is of course relevant as well given some time. It is very common to
>>> have random integers being a test case when benchmarking hash tables
>>> though, one reason being that you could have an equally long, or longer,
>>> discussion around what an appropriate set of 100M string keys would be, and
>>> which hash functions that in turn are suitable for the set of keys.
>>>
>>> 2) This is the wrong way to measure (1). Given the use case in (1) how
>>> would you accurately measure the cost of a random lookup operation? Doing a
>>> large number of these operations and measure the time is a very common
>>> approach. If you have an approach to suggest that works in fairly similar
>>> ways for the different implementations I would be very happy to switch to
>>> using it.
>>>
>>> From the benchmark so far it seems Google DenseHash (C++) and khash (C)
>>> are in the order of 2x-3x times faster compared to doing lookups in Lua
>>> using LuaJIT. Is this fair? Yes, probably. A serious attempt to show
>>> otherwise would be interesting.
>>>
>>> It is interesting to look at inserts/deletes, memory usage, 5% or 95%
>>> hit ratios, strings, worst cases, and so forth? Yes. Feel free to pitch in,
>>> or give constructive suggestions.
>>>
>>>
>>> On Tue, Mar 18, 2014 at 3:38 PM, Dan Eloff <dan.eloff@xxxxxxxxx> wrote:
>>>
>>>> I agree with Mike, the dominant costs here is an unpredictable branch
>>>> (~17 or more cycles of a branch mispredition, 50% of the time) and the
>>>> cache miss (~100-400 cycles). Typical integer hashing (some bitwise ops and
>>>> maybe a multiply with a bitwise and for a power of 2 sized table) will be
>>>> less than the branch misprediction cost, and rounding error against the
>>>> cache miss.
>>>>
>>>> In fact I suspect that if you can enable huge pages for the Java heap
>>>> and make sure you have enough huge pages allocated on your test machine,
>>>> then Java should take the benchmark, because on the large tables you have
>>>> not just a cache miss but also a TLB miss, which means navigating the 4
>>>> level page table which should be roughly 3 cache misses on average (1-4).
>>>> But the benchmark may not show that because all it's doing is hash lookups
>>>> and it's possible nearly the entire page table is in cache. In a real
>>>> program that won't be the case.
>>>>
>>>> I think if you really want to show hash table implementation
>>>> differences you need to eliminate the other effects. Ditch the branch
>>>> misprediction, and use a lookup set small enough that the elements you want
>>>> to lookup will all fit within the L2 cache and the TLB (e.g. lookup up to
>>>> 512 random elements. Look them up once before the timed loop, and then time
>>>> it.)
>>>>
>>>>
>>>> On Mon, Mar 17, 2014 at 5:30 AM, Fredrik Widlund <
>>>> fredrik.widlund@xxxxxxxxx> wrote:
>>>>
>>>>> Sigh, the Lua program doesn't even use local variables in the
>>>>>> inner loop ...
>>>>>
>>>>>
>>>>> Updated the benchmark with declaring x local inside the inner loop
>>>>> instead of just before, which made for some 10% improvements in
>>>>> small/medium sized hashes.
>>>>>
>>>>> Oh, and before anyone thinks this really measures hash table
>>>>>> implementation quality: it's just measuring the overhead of branch
>>>>>> mispredictions and cache misses for an atypical use case of hash
>>>>>> tables.
>>>>>
>>>>>
>>>>> Currently it's measuring different hash algorithms integer lookup
>>>>> performance with a hit rate of 50%. It is a naive benchmark for now 
>>>>> because
>>>>> this is currently sufficient to be able to make a point later on.
>>>>>
>>>>> If have some extra time I'll test some variables later on I will test
>>>>> different hit rates and key types. If you have a reference to a source
>>>>> where statistics around hash table use cases is investigated I'll use
>>>>> that as a basis.
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Mar 16, 2014 at 11:57 PM, Mike Pall <mike-1403@xxxxxxxxxx>wrote:
>>>>>
>>>>>> Fredrik Widlund wrote:
>>>>>> > Benchmark of LuaJIT tables vs Java HashMap vs C++ unordered_map
>>>>>>
>>>>>> Sigh, the Lua program doesn't even use local variables in the
>>>>>> inner loop ...
>>>>>>
>>>>>> Oh, and before anyone thinks this really measures hash table
>>>>>> implementation quality: it's just measuring the overhead of branch
>>>>>> mispredictions and cache misses for an atypical use case of hash
>>>>>> tables.
>>>>>>
>>>>>> [Typical hash table accesses use a) string keys with b) a high
>>>>>> temporal correlation and c) they usually have a high hit-rate or a
>>>>>> high miss-rate, but it's rarely 50/50.]
>>>>>>
>>>>>> --Mike
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Follow-Ups:
- Re: LuaJIT benchmark
  - From: Javier Guerra Giraldez
References:
- LuaJIT benchmark
  - From: Fredrik Widlund
- Re: LuaJIT benchmark
  - From: Mike Pall
- Re: LuaJIT benchmark
  - From: Fredrik Widlund
- Re: LuaJIT benchmark
  - From: Dan Eloff
- Re: LuaJIT benchmark
  - From: Fredrik Widlund
- Re: LuaJIT benchmark
  - From: Dan Eloff
- Re: LuaJIT benchmark
  - From: Fredrik Widlund
Re: LuaJIT benchmark

Other related posts: