Re: Performance of ffi.new v/s malloc() arrays

From: David Given <dg@xxxxxxxxxxx>
To: luajit@xxxxxxxxxxxxx
Date: Tue, 21 Feb 2017 00:06:02 +0100

On at least *some* OSX systems there's pathological behaviour when using
calloc() and certain sizes of allocations:

http://www.pybloggers.com/2016/12/debugging-your-operating-system-a-lesson-in-memory-allocation/

Although the article claims that the problematic range is between 127kB and
125MB, and you seem to be allocating 400MB, so that's probably not the
problem here --- but it may be worth investigating.

On 20 February 2017 at 03:13, Ammar Hakim <a.hakim777@xxxxxxxxx> wrote:

Hi All,

I have been working on a computational physics code that mostly uses
LuaJIT, with a few pieces written in C. The code performs very well, in
fact, for solution of some equations is actually 3x faster than
corresponding Fortran code (not written by me).

Anyway, I have found a strange issue on my Mac. Basically, we deal with
huge arrays and have to hence use malloc/calloc to manage the memory
ourselves. However, it seems that the performance of ffi.new() v/s
malloc/calloc allocated fields is different. I don't mean the allocator
efficiency which is not a big deal as most fields we allocate live for the
lifetime of the application.

I managed to boil the problem down to the following example pasted below.
If I run it with the "useFFIAlloc" flag set to "true" the code runs about
3x faster on my Mac! This is with LJ 2.1 beta2. It seems the difference is
not so much on a Linux box, but I have not tested on Linux extensively (I
do most of my dev work on a Mac). If anyone has any ideas or perhaps can
point to something I am not doing properly, it will be great.

local ffi = require "ffi"
local os = require "os"

useFFIAlloc = false

ffi.cdef [[
  void* calloc(size_t nitems, size_t size);
  void free(void *ptr);
]]

nelem = 50000000

if useFFIAlloc then
   fieldInp = ffi.new("double [?]", nelem)
   fieldOut = ffi.new("double [?]", nelem)
else
   fieldInp = ffi.gc(ffi.cast("double*", ffi.C.calloc(nelem,
ffi.sizeof("double"))), ffi.C.free)
   fieldOut = ffi.gc(ffi.cast("double*", ffi.C.calloc(nelem,
ffi.sizeof("double"))), ffi.C.free)
end

local tStart = os.clock()
for i = 0, nelem-1 do
   fieldOut[i] = fieldInp[i]
end
local tEnd = os.clock()
print(tEnd - tStart)

--
┌─── ｄｇ＠ｃｏｗｌａｒｋ．ｃｏｍ ───── http://www.cowlark.com ─────
│ "I have always wished for my computer to be as easy to use as my
│ telephone; my wish has come true because I can no longer figure out
│ how to use my telephone." --- Bjarne Stroustrup

Follow-Ups:
- Re: Performance of ffi.new v/s malloc() arrays
  - From: Patric Ljung

References:
- Performance of ffi.new v/s malloc() arrays
  - From: Ammar Hakim

Re: Performance of ffi.new v/s malloc() arrays

Other related posts: