Re: FFI array performance

  • From: Simon Cooke <sjcfwd@xxxxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Wed, 30 May 2012 13:19:14 -0400

On Fri, May 25, 2012 at 4:12 PM, Mike Pall <mike-1205@xxxxxxxxxx> wrote:
> Simon Cooke wrote:
>> I've been trying out the FFI library recently, and have tested the
>> variable-length array feature with mixed performance results. For
>> native types (e.g. float, double) it works very efficiently, but for
>> simple structs the performance drops dramatically, by ~ 50x.
>
> http://luajit.org/ext_ffi_semantics.html#status
>
>  [...]
>  The following operations are currently not compiled and may
>  exhibit suboptimal performance, especially when used in inner
>  loops:
>
>  * Array/struct copies and bulk initializations.
>  [...]
>

Thanks for the pointers. My actual use case is for arrays of
fixed-length vectors {x, y, z}. I managed to find a workaround for now
by adding metatables and performing the copy manually:

-----------------------------------------------------------------
local ffi = require("ffi")

ffi.cdef[[ typedef struct { float x; } boxed; ]]

local boxed = ffi.metatype( 'boxed', {
    __index = { copy_to = function(self,p) p.x = self.x end },
    __tostring = function(self) return '('..self.x..')' end,
    })

local array = ffi.metatype([[ struct { boxed p[?]; } ]], {
    __newindex  = function(self,i,v) v:copy_to(self.p+i) end,
    __index     = function(self,i)   return self.p[i] end,
    })

local function test(s,N,a,c)
    local t0 = os.clock()
    for i = 0,N-1 do a[i] = c end
    print(s..' : '..os.clock()-t0 ..'s   '..(os.clock()-t0)/N*1e9 ..'
ns/element')
end

local N = 2^25
test('array(N)',N, array(N), boxed(10))
test('float[N]',N, ffi.new('float[?]',N), ffi.new('float',10) )
test('boxed[N]',N, ffi.new('boxed[?]',N), ffi.new('boxed',10) )
-----------------------------------------------------------------

The first test uses the new array, which gives performance equal to
the native float array:

array(N) : 0.029s   0.86426734924316 ns/element
float[N] : 0.029s   0.86426734924316 ns/element
boxed[N] : 2.661s   79.303979873657 ns/element

However, I find that when I reorder the tests I get very different results:

float[N] : 0.029s   0.86426734924316 ns/element
boxed[N] : 2.596s   77.366828918457 ns/element
array(N) : 2.864s   85.353851318359 ns/element

Running with -jv I get for the first case:

[TRACE   1 ffi_test3.lua:17 loop]
array(N) : 0.03s   0.89406967163086 ns/element
[TRACE   2 (1/0) ffi_test3.lua:17 loop]
float[N] : 0.03s   0.89406967163086 ns/element
[TRACE --- (2/0) ffi_test3.lua:17 -- NYI: unsupported C type conversion]
[TRACE --- (2/0) ffi_test3.lua:17 -- NYI: unsupported C type conversion]
[TRACE --- (2/0) ffi_test3.lua:17 -- NYI: unsupported C type conversion]
[TRACE --- (2/0) ffi_test3.lua:17 -- NYI: unsupported C type conversion]
[TRACE   3 (2/0) ffi_test3.lua:17 -- fallback to interpreter]
boxed[N] : 2.83s   84.340572357178 ns/element

as expected, but for the second:

[TRACE   1 ffi_test3.lua:17 loop]
float[N] : 0.03s   0.89406967163086 ns/element
[TRACE --- (1/0) ffi_test3.lua:17 -- NYI: unsupported C type conversion]
[TRACE --- (1/0) ffi_test3.lua:17 -- NYI: unsupported C type conversion]
[TRACE --- (1/0) ffi_test3.lua:17 -- NYI: unsupported C type conversion]
[TRACE --- (1/0) ffi_test3.lua:17 -- NYI: unsupported C type conversion]
[TRACE   2 (1/0) ffi_test3.lua:17 -- fallback to interpreter]
boxed[N] : 2.667s   79.482793807983 ns/element
[TRACE   3 ffi_test3.lua:11 return]
array(N) : 2.699s   80.43646812439 ns/element

What could be causing the slower performance here?

Simon

Other related posts: