Thanks for a perfect explanation Mike! I especially value the code dumps. I'm writing this library for the raspberry PI and was somewhat performance conscience (I do hope luajit works on that older arm). Mainly I just wanted to know the best practice for wrapping ffi in this way. On Wed, Jun 6, 2012 at 10:00 AM, Mike Pall <mike-1206@xxxxxxxxxx> wrote: > Tim Caswell wrote: > > But I want to wrap some functions that are hairy to deal with > > (outargs, strings, structs -> tables, etc..) Is the best way to > > accomplish this to wrap in a lua function and use the closure to the > > gles object? > > > > local function glViewport(x, y, width, height) > > gles.glViewport(x, y, width, height) > > end > > If the wrapper actually performs something (e.g. translating > outargs into extra results), then the above pattern is the best > way to do it. > > But if the wrapper function really does nothing, then it's kind of > pointless. You might as well pass the C function pointer. This is > trading off specialization to a Lua function and a namespace vs. > an indirect C call. > > Here's a simple example: > > local ffi=require("ffi") > ffi.cdef[[int getpid(void);]] > local C = ffi.C > local function wrap() return C.getpid() end > local nowrap = C.getpid > for i=1,100 do C.getpid() end -- Optimal, but not for your question. > for i=1,100 do wrap() end > for i=1,100 do nowrap() end > > Have a look at -jdump=m on x86: > > ---- TRACE 1 start a.lua:6 > ---- TRACE 1 mcode 73 > f7534fa9 mov dword [0xf76da2bc], 0x1 > f7534fb3 cvtsd2si edi, [edx+0x20] > f7534fb8 cmp dword [edx+0xc], -0x0d > f7534fbc jnz 0xf752d008 ->0 > f7534fc2 cmp dword [edx+0x8], 0xf76e4460 > f7534fc9 jnz 0xf752d008 ->0 > f7534fcf call 0xf75d0a70 > f7534fd4 add edi, +0x01 > f7534fd7 cmp edi, +0x64 > f7534fda jg 0xf752d00c ->1 > ->LOOP: > f7534fe0 call 0xf75d0a70 <-- direct call > f7534fe5 add edi, +0x01 > f7534fe8 cmp edi, +0x64 > f7534feb jle 0xf7534fe0 ->LOOP > f7534fed jmp 0xf752d014 ->3 > ---- TRACE 1 stop -> loop > > ---- TRACE 2 start a.lua:7 > ---- TRACE 2 mcode 96 > f7534f42 mov dword [0xf76da2bc], 0x2 > f7534f4c cvtsd2si edi, [edx+0x20] > f7534f51 cmp dword [edx+0x14], -0x09 > f7534f55 jnz 0xf752d008 ->0 > f7534f5b cmp dword [edx+0x10], 0xf76e8338 > f7534f62 jnz 0xf752d008 ->0 > f7534f68 cmp dword [edx+0xc], -0x0d > f7534f6c jnz 0xf752d008 ->0 > f7534f72 cmp dword [edx+0x8], 0xf76e4460 > f7534f79 jnz 0xf752d008 ->0 > f7534f7f call 0xf75d0a70 > f7534f84 add edi, +0x01 > f7534f87 cmp edi, +0x64 > f7534f8a jg 0xf752d00c ->1 > ->LOOP: > f7534f90 call 0xf75d0a70 <-- direct call > f7534f95 add edi, +0x01 > f7534f98 cmp edi, +0x64 > f7534f9b jle 0xf7534f90 ->LOOP > f7534f9d jmp 0xf752d014 ->3 > ---- TRACE 2 stop -> loop > > ---- TRACE 3 start a.lua:8 > ---- TRACE 3 mcode 71 > f7534ef8 mov dword [0xf76da2bc], 0x3 > f7534f02 cvtsd2si edi, [edx+0x20] > f7534f07 cmp dword [edx+0x1c], -0x0b > f7534f0b jnz 0xf752d008 ->0 > f7534f11 mov ebp, [edx+0x18] > f7534f14 cmp word [ebp+0x6], +0x5f > f7534f19 jnz 0xf752d008 ->0 > f7534f1f mov ebx, [ebp+0x8] > f7534f22 call ebx > f7534f24 add edi, +0x01 > f7534f27 cmp edi, +0x64 > f7534f2a jg 0xf752d00c ->1 > ->LOOP: > f7534f30 call ebx <-- indirect call > f7534f32 add edi, +0x01 > f7534f35 cmp edi, +0x64 > f7534f38 jle 0xf7534f30 ->LOOP > f7534f3a jmp 0xf752d014 ->3 > ---- TRACE 3 stop -> loop > > The first loop is definitely the best way, but it's not applicable > to your question, because you don't want to return the namespace > directly. > > The initial setup overhead in the wrapped case is higher, but the > loop gets to use a direct C call. In the non-wrapped case you > always get an indirect call inside the loop. > > All modern x86/x64 CPUs predict indirect calls like any other > branch, so this doesn't matter much. But on other architectures > you may get a costly pipeline stall for every indirect call. > > Not that you should worry too much about this. Except if you > really, really care about the performance of that particular call. > > --Mike > >