Re: Data-dependent slowdown in loop involving io.lines()

  • From: Peter Cawley <corsix@xxxxxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Sat, 8 Nov 2014 23:14:34 +0000

A potential idea: for strings longer than 12 bytes, we define both a
sparse hash and a dense hash. In lj_str_new, we start with the sparse
hash, but after finding N existing interned strings with the same
length and same hash value, we start again with the dense hash (N=2
seems reasonable).



On Sat, Nov 8, 2014 at 10:51 PM, Coda Highland <chighland@xxxxxxxxx> wrote:
> On Sat, Nov 8, 2014 at 2:29 PM, Florian Weimer <fw@xxxxxxxxxxxxx> wrote:
>> * Tudor Bosman:
>>
>>> Couldn't LuaJIT, for example, do full (dense) hashing for strings of length
>>> of less than 100 or so bytes, and assume that all longer strings are data
>>> and therefore likely to be unique (and intern them without any hashing /
>>> comparison check)? Is there an assumption that two strings are equal if
>>> *and only if* their interned ids are the same?
>>
>> Once you have uninterned strings anywhere, you cannot use address
>> comparisons to check for string equality anymore, slowing everything
>> down.
>>
>> Maybe it is possible to use a different collision resolution strategy
>> (e.g., a crit-bit tree per hash bucket).  That would only affect
>> string interning, and would not create changes throughout the VM.  I
>> think that in terms of performance, its performance impact would be
>> mostly restricted to strings whose hashes collide (where the benefits
>> would greatly outweight the costs), although there will be the
>> occasional hash bucket collision between strings of different hash
>> value.  All this seems to involve very difficult trade-offs.
>
> I think the conclusion to be made here is that you should drag in
> fopen and fread through the FFI and treat the content as char arrays
> instead of using lines() if you need that kind of performance.
>
> /s/ Adam
>

Other related posts: