Data-dependent slowdown in loop involving io.lines()

  • From: Tudor Bosman <tudorb@xxxxxxxxx>
  • To: "luajit@xxxxxxxxxxxxx" <luajit@xxxxxxxxxxxxx>
  • Date: Fri, 07 Nov 2014 01:24:10 +0000

This is a reduced test case from production code; we noticed that looping
over a large list of filenames was taking a long time, so we decided to dig
deeper. I tested this with LuaJIT 2.0.2 and 2.1, on Linux x86_64.

I'm attaching two Lua files.

gen.lua generates a 56MB file with 1 million lines. It can generate the
file in one of two formats that only differ in the last few characters on
each line (corresponding lines are of the same length in both formats). Run
as luajit gen.lua 1 > /tmp/file1, luajit gen.lua 2 > /tmp/file2.

wc.lua counts lines in stdin, similarly to running the Unix command "wc
-l". Run as luajit wc.lua < /tmp/file1, luajit wc.lua < /tmp/file2.

Running wc.lua on the first file (where lines end in JPEG.1) takes 0.4
seconds (on my machine). Running wc.lua on the second file (where lines end
in 1.JPEG) takes over 4 seconds (it's faster with LuaJIT 2.0.2, 4.2
seconds, than with LuaJIT 2.1, 5.3 seconds).

Moreover, if we allocate a lot of memory on the heap before the loop (a 10
million entry, integer-keyed, integer-valued table), the first file takes a
bit longer (3.2 seconds) whereas the second file... well, I interrupted
after 30 seconds.

I suspect this has something to do with the way strings are hashed, but I
haven't dug any deeper.

Thanks,
-Tudor.

Attachment: gen.lua
Description: Binary data

Attachment: wc.lua
Description: Binary data

Other related posts: