Re: FFI hashtable?

  • From: Reini Urban <reini@xxxxxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Sun, 15 Mar 2015 13:39:21 +0100

On 03/15/2015 04:38 AM, Javier Guerra Giraldez wrote:
> On Sat, Mar 14, 2015 at 8:16 AM, Reini Urban <reini@xxxxxxxxxx> wrote:
>> But why is still nobody using the faster and better iSCSI CRC32-C hash
>> function which is builtin into hardware on modern intel processors for a
>> decade or now also armv7. You cannot beat 5 cycles and the much better
>> characteristics of CRC32-C vs. fnv, murmur or city.
> 
> 
> this seems to be a highly controversial issue, with lots of people
> insulting each other's work, so it might be better not to jump to
> conclusions.  as always, it's best to test each option with the
> intended use and remember there's no one "best for everything"
> 
> Personally, i've traditionally found CRCs to be OK for error detection
> but not for uniqueness nor for hashtables.  I've just checked the -C
> variant and seems to be much better than others, so it might be worth
> a new try.  on the other hand, it's revealing that all the uses
> mentioned in wikipedia (iSCSI, SCTP, G.hn payload, Btrfs, ext4, Ceph)
> are for error detection.
> 
> for hashtables, modern hashes like murmur3 or city are quite good, but
> when hashing network-input data, you're quite open to DoS attacks, and
> you should at least consider a crypto-based hash.  In my tests with
> small bytestrings input (in the range of 5-30 bytes), I've found
> SIPhash to take between 1.5 and 1.7x the time of murmur3, so it's
> usable on any but the most performance-constraint cases.  At the same
> time, it presents much less collisions on all the real-world datasets
> i've thrown it, so in the end it tends to perform better than either
> murmur or city.

You shalt not mixup hash functions with hash tables on security concerns.
All hash functions are by definition insecure, they need to be fast
and not secure.
Security by a stronger hash function is completely misguided.
Security starts with 256 bits, hash tables use 32 or 64 bit, and
typically use 5-8 bits on small hash tables.

That said, CRC32-C is like all those CRC functions one of the easiest to
break. Just add the calculated CRC value to the end of the key, and you
will always get 0, thus you can easily DOS it, when you don't use a
seed. But all other fast hash functions are also crackable within
milliseconds.

So when you use a non-random seed, you need to think of better ways to
protect against malicious complexity attacks. Double hashing is clearly
superior over the usual simple linked list implementations, so khash and
its variants or glibc hsearch are fine, but ordered hash tables (sorted
buckets, typical max buckets size in real world hash tables is 8, so you
can stay with insertion sort), universal hashing (as I linked to) or
even a simple sleep on >10 collisions as in djb's dns server should be
good enough.

Using the glibc hsearch would be the best idea. It is available almost
everywhere, and it also uses double hashing similar to khash:
http://code.metager.de/source/xref/gnu/glibc/misc/hsearch_r.c


Attachment: signature.asc
Description: OpenPGP digital signature

Other related posts: