Re: Data-dependent slowdown in loop involving io.lines()

From: soumith <soumith@xxxxxxxxx>
To: luajit <luajit@xxxxxxxxxxxxx>
Date: Tue, 11 Nov 2014 17:23:11 -0500

There's a few people who ran into this issue as well:
https://www.freelists.org/post/luajit/Trouble-with-LuaJIT-being-slower-than-Lua
//www.freelists.org/post/luajit/sometimes-match-or-gmatch-are-very-slow-under-luajit-compared-to-raw-lua
https://www.freelists.org/post/luajit/Poor-performance-test-case

But more importantly, this example that was given is not something that was
created to break the LuaJIT string hasher, the same issue applies to
reading any long filepaths with the same folder prefix.

Considering this as a non-issue is quite irresponsible especially when the
case is quite general and would be encountered by quite a few people.
The slowness was only noticed at scale (who reads in paths of 1million+
files every day) but regardless it sucks and the sparse hasher shipped by
LuaJIT is too simplistic.

We patched it internally with a hash that is almost as performant but has
better guarantees to not break in such trivial cases.
Hope mike sees this in a slightly different light than brushing it off.
--S

On Sat, Nov 8, 2014 at 6:14 PM, Peter Cawley <corsix@xxxxxxxxxx> wrote:

> A potential idea: for strings longer than 12 bytes, we define both a
> sparse hash and a dense hash. In lj_str_new, we start with the sparse
> hash, but after finding N existing interned strings with the same
> length and same hash value, we start again with the dense hash (N=2
> seems reasonable).
>
>
>
> On Sat, Nov 8, 2014 at 10:51 PM, Coda Highland <chighland@xxxxxxxxx>
> wrote:
> > On Sat, Nov 8, 2014 at 2:29 PM, Florian Weimer <fw@xxxxxxxxxxxxx> wrote:
> >> * Tudor Bosman:
> >>
> >>> Couldn't LuaJIT, for example, do full (dense) hashing for strings of
> length
> >>> of less than 100 or so bytes, and assume that all longer strings are
> data
> >>> and therefore likely to be unique (and intern them without any hashing
> /
> >>> comparison check)? Is there an assumption that two strings are equal if
> >>> *and only if* their interned ids are the same?
> >>
> >> Once you have uninterned strings anywhere, you cannot use address
> >> comparisons to check for string equality anymore, slowing everything
> >> down.
> >>
> >> Maybe it is possible to use a different collision resolution strategy
> >> (e.g., a crit-bit tree per hash bucket).  That would only affect
> >> string interning, and would not create changes throughout the VM.  I
> >> think that in terms of performance, its performance impact would be
> >> mostly restricted to strings whose hashes collide (where the benefits
> >> would greatly outweight the costs), although there will be the
> >> occasional hash bucket collision between strings of different hash
> >> value.  All this seems to involve very difficult trade-offs.
> >
> > I think the conclusion to be made here is that you should drag in
> > fopen and fread through the FFI and treat the content as char arrays
> > instead of using lines() if you need that kind of performance.
> >
> > /s/ Adam
> >
>
>

Follow-Ups:
- Re: Data-dependent slowdown in loop involving io.lines()
  - From: Coda Highland

References:
- Data-dependent slowdown in loop involving io.lines()
  - From: Tudor Bosman
- Re: Data-dependent slowdown in loop involving io.lines()
  - From: Mike Pall
- Re: Data-dependent slowdown in loop involving io.lines()
  - From: Tudor Bosman
- Re: Data-dependent slowdown in loop involving io.lines()
  - From: Florian Weimer
- Re: Data-dependent slowdown in loop involving io.lines()
  - From: Coda Highland
- Re: Data-dependent slowdown in loop involving io.lines()
  - From: Peter Cawley

Re: Data-dependent slowdown in loop involving io.lines()

Other related posts: