Re: why can't we just change the hash function

  • From: soumith <soumith@xxxxxxxxx>
  • To: luajit <luajit@xxxxxxxxxxxxx>
  • Date: Mon, 31 Aug 2015 09:24:52 -0700

FWIW in Torch, we've modified the hash function to use TommyDS's hasher and
we are quite happy (with no decrease in performance reported by our ~=
O(1k) users).
Loading large text files with each line being a file path, now no longer
results in hash collisions every time, and when doing such operations we
see > 100x speedup, and closer to C speeds.

The two commits are:
https://github.com/torch/luajit-rocks/commit/cafad2e668764467d7846719ee964172c4e74b0b
https://github.com/torch/luajit-rocks/commit/2a68100afa22095ff3bc29fcbbf08d69758c5768

On Mon, Aug 31, 2015 at 6:38 AM, William Adams <william_a_adams@xxxxxxxxxxx>
wrote:


----------------------------------------
Date: Mon, 31 Aug 2015 11:26:52 +0300
From: kostja@xxxxxxxxxxxxx
To: luajit@xxxxxxxxxxxxx
Subject: Re: why can't we just change the hash function

* Mike Pall <mikelj-1508@xxxxxxx> [15/08/31 10:52]:
The main issues is that selecting a suitable hash function depends
on many, many factors. This is not just a matter of performance
vs. hash quality or sparse vs. dense hashing, but also a matter of
practicality. You have to consider effects on branch prediction,
register pressure, I-cache usage, portability, and so on ...

This rules out many of those ever-helpful "Oh, just use FOO-hash,
I heard good things about it on reddit" suggestions. Especially if
FOO is an unportable machine instruction that only runs on cutting
edge CPUs. Or if FOO is a C++ library whose dependencies are ten
times larger than LuaJIT itself. Or if FOO is a high-quality hash,
but takes forever and kills the I-cache as collateral damage, too.

Mike, no matter what the obstacles are of choosing a different
hash function, better choosing an imperfect one, and giving it
a try, and maybe trying another one down the road, than
being stuck with the existing function.

Unfortunately, as your project grows, the formula of "being very
good for 99.9% of users" and screwing up the unlucky 0.1% stops
working, since these unlucky people end up being the ones who're
the most loud.


As we have a budding new community of maintainers, I assume this means
you'll try out a new variant of murmur hash and then present the various
results for all to see. If the alternative makes some good overall
tradeoffs, then it has a chance of making it into the source right? I'm
not sure there's any point making a debate about it when you can just
change the code and get others to 'vote' on it.

I take Mike's comments as a guide for what test cases I might want to
create to ensure that any new hash function is satisfactory across a broad
spectrum of measures.



Other related posts: