A potential idea: for strings longer than 12 bytes, we define both a sparse hash and a dense hash. In lj_str_new, we start with the sparse hash, but after finding N existing interned strings with the same length and same hash value, we start again with the dense hash (N=2 seems reasonable). On Sat, Nov 8, 2014 at 10:51 PM, Coda Highland <chighland@xxxxxxxxx> wrote: > On Sat, Nov 8, 2014 at 2:29 PM, Florian Weimer <fw@xxxxxxxxxxxxx> wrote: >> * Tudor Bosman: >> >>> Couldn't LuaJIT, for example, do full (dense) hashing for strings of length >>> of less than 100 or so bytes, and assume that all longer strings are data >>> and therefore likely to be unique (and intern them without any hashing / >>> comparison check)? Is there an assumption that two strings are equal if >>> *and only if* their interned ids are the same? >> >> Once you have uninterned strings anywhere, you cannot use address >> comparisons to check for string equality anymore, slowing everything >> down. >> >> Maybe it is possible to use a different collision resolution strategy >> (e.g., a crit-bit tree per hash bucket). That would only affect >> string interning, and would not create changes throughout the VM. I >> think that in terms of performance, its performance impact would be >> mostly restricted to strings whose hashes collide (where the benefits >> would greatly outweight the costs), although there will be the >> occasional hash bucket collision between strings of different hash >> value. All this seems to involve very difficult trade-offs. > > I think the conclusion to be made here is that you should drag in > fopen and fread through the FFI and treat the content as char arrays > instead of using lines() if you need that kind of performance. > > /s/ Adam >