demetri wrote: > Thanks Dimiter, that gets us down to 5.5x on our test machine; First, the C and the LuaJIT files are not doing the same thing (x128 etc. isn't used at all). Also, LuaJIT doesn't need any warm-up time, so you can omit the first loop. And casts to scalar number types are rarely helpful -- use the semantics of bit.* to constrain numbers to integers. Fixed LuaJIT benchmark attached. Runs at about the same speed as the C code. BTW: Please use locals *everywhere* and *everytime*. Make it a habit. I mean ... you're writing to 8 globals in that short piece of code ... and it's supposed to be a benchmark ... I cringe everytime I see this: ffi = require("ffi") -- YUCK! The FFI module doesn't set a global on purpose. And then you store the result of require in a global, where it's happily overwritten by the next user of the FFI module ... wheeee. Or someone doesn't explicitly require it, but uses it and you'll never notice. The only acceptable way to require the FFI module is this: local ffi = require("ffi") Which is incidentally explained right at the top of: http://luajit.org/ext_ffi_tutorial.html --Mike
local ffi = require "ffi" local bit = require "bit" local tobit, shr = bit.tobit, bit.rshift ffi.cdef [[ typedef uint16_t Dt; ]] local function rcadd(r, x, y, n) local c = 0 for i=0,n-1 do c = tobit(shr(c, 16) + x[i] + y[i]) r[i] = c end end local x128 = ffi.new("Dt[128]") local y128 = ffi.new("Dt[128]") local r128 = ffi.new("Dt[128]") for i=1,1e7 do rcadd(r128, x128, y128, 128) end