Re: Cost of wrapping ffi functions

  • From: Tim Caswell <tim@xxxxxxxxxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Wed, 6 Jun 2012 10:14:46 -0500

Thanks for a perfect explanation Mike!  I especially value the code dumps.
 I'm writing this library for the raspberry PI and was somewhat performance
conscience (I do hope luajit works on that older arm).  Mainly I just
wanted to know the best practice for wrapping ffi in this way.

On Wed, Jun 6, 2012 at 10:00 AM, Mike Pall <mike-1206@xxxxxxxxxx> wrote:

> Tim Caswell wrote:
> > But I want to wrap some functions that are hairy to deal with
> > (outargs, strings, structs -> tables, etc..)  Is the best way to
> > accomplish this to wrap in a lua function and use the closure to the
> > gles object?
> >
> >   local function glViewport(x, y, width, height)
> >     gles.glViewport(x, y, width, height)
> >   end
>
> If the wrapper actually performs something (e.g. translating
> outargs into extra results), then the above pattern is the best
> way to do it.
>
> But if the wrapper function really does nothing, then it's kind of
> pointless. You might as well pass the C function pointer. This is
> trading off specialization to a Lua function and a namespace vs.
> an indirect C call.
>
> Here's a simple example:
>
>  local ffi=require("ffi")
>  ffi.cdef[[int getpid(void);]]
>  local C = ffi.C
>  local function wrap() return C.getpid() end
>  local nowrap = C.getpid
>  for i=1,100 do C.getpid() end -- Optimal, but not for your question.
>  for i=1,100 do wrap() end
>  for i=1,100 do nowrap() end
>
> Have a look at -jdump=m on x86:
>
> ---- TRACE 1 start a.lua:6
> ---- TRACE 1 mcode 73
> f7534fa9  mov dword [0xf76da2bc], 0x1
> f7534fb3  cvtsd2si edi, [edx+0x20]
> f7534fb8  cmp dword [edx+0xc], -0x0d
> f7534fbc  jnz 0xf752d008        ->0
> f7534fc2  cmp dword [edx+0x8], 0xf76e4460
> f7534fc9  jnz 0xf752d008        ->0
> f7534fcf  call 0xf75d0a70
> f7534fd4  add edi, +0x01
> f7534fd7  cmp edi, +0x64
> f7534fda  jg 0xf752d00c ->1
> ->LOOP:
> f7534fe0  call 0xf75d0a70       <-- direct call
> f7534fe5  add edi, +0x01
> f7534fe8  cmp edi, +0x64
> f7534feb  jle 0xf7534fe0        ->LOOP
> f7534fed  jmp 0xf752d014        ->3
> ---- TRACE 1 stop -> loop
>
> ---- TRACE 2 start a.lua:7
> ---- TRACE 2 mcode 96
> f7534f42  mov dword [0xf76da2bc], 0x2
> f7534f4c  cvtsd2si edi, [edx+0x20]
> f7534f51  cmp dword [edx+0x14], -0x09
> f7534f55  jnz 0xf752d008        ->0
> f7534f5b  cmp dword [edx+0x10], 0xf76e8338
> f7534f62  jnz 0xf752d008        ->0
> f7534f68  cmp dword [edx+0xc], -0x0d
> f7534f6c  jnz 0xf752d008        ->0
> f7534f72  cmp dword [edx+0x8], 0xf76e4460
> f7534f79  jnz 0xf752d008        ->0
> f7534f7f  call 0xf75d0a70
> f7534f84  add edi, +0x01
> f7534f87  cmp edi, +0x64
> f7534f8a  jg 0xf752d00c ->1
> ->LOOP:
> f7534f90  call 0xf75d0a70       <-- direct call
> f7534f95  add edi, +0x01
> f7534f98  cmp edi, +0x64
> f7534f9b  jle 0xf7534f90        ->LOOP
> f7534f9d  jmp 0xf752d014        ->3
> ---- TRACE 2 stop -> loop
>
> ---- TRACE 3 start a.lua:8
> ---- TRACE 3 mcode 71
> f7534ef8  mov dword [0xf76da2bc], 0x3
> f7534f02  cvtsd2si edi, [edx+0x20]
> f7534f07  cmp dword [edx+0x1c], -0x0b
> f7534f0b  jnz 0xf752d008        ->0
> f7534f11  mov ebp, [edx+0x18]
> f7534f14  cmp word [ebp+0x6], +0x5f
> f7534f19  jnz 0xf752d008        ->0
> f7534f1f  mov ebx, [ebp+0x8]
> f7534f22  call ebx
> f7534f24  add edi, +0x01
> f7534f27  cmp edi, +0x64
> f7534f2a  jg 0xf752d00c ->1
> ->LOOP:
> f7534f30  call ebx              <-- indirect call
> f7534f32  add edi, +0x01
> f7534f35  cmp edi, +0x64
> f7534f38  jle 0xf7534f30        ->LOOP
> f7534f3a  jmp 0xf752d014        ->3
> ---- TRACE 3 stop -> loop
>
> The first loop is definitely the best way, but it's not applicable
> to your question, because you don't want to return the namespace
> directly.
>
> The initial setup overhead in the wrapped case is higher, but the
> loop gets to use a direct C call. In the non-wrapped case you
> always get an indirect call inside the loop.
>
> All modern x86/x64 CPUs predict indirect calls like any other
> branch, so this doesn't matter much. But on other architectures
> you may get a costly pipeline stall for every indirect call.
>
> Not that you should worry too much about this. Except if you
> really, really care about the performance of that particular call.
>
> --Mike
>
>

Other related posts: