On at least *some* OSX systems there's pathological behaviour when using
calloc() and certain sizes of allocations:
http://www.pybloggers.com/2016/12/debugging-your-operating-system-a-lesson-in-memory-allocation/
Although the article claims that the problematic range is between 127kB and
125MB, and you seem to be allocating 400MB, so that's probably not the
problem here --- but it may be worth investigating.
On 20 February 2017 at 03:13, Ammar Hakim <a.hakim777@xxxxxxxxx> wrote:
Hi All,
I have been working on a computational physics code that mostly uses
LuaJIT, with a few pieces written in C. The code performs very well, in
fact, for solution of some equations is actually 3x faster than
corresponding Fortran code (not written by me).
Anyway, I have found a strange issue on my Mac. Basically, we deal with
huge arrays and have to hence use malloc/calloc to manage the memory
ourselves. However, it seems that the performance of ffi.new() v/s
malloc/calloc allocated fields is different. I don't mean the allocator
efficiency which is not a big deal as most fields we allocate live for the
lifetime of the application.
I managed to boil the problem down to the following example pasted below.
If I run it with the "useFFIAlloc" flag set to "true" the code runs about
3x faster on my Mac! This is with LJ 2.1 beta2. It seems the difference is
not so much on a Linux box, but I have not tested on Linux extensively (I
do most of my dev work on a Mac). If anyone has any ideas or perhaps can
point to something I am not doing properly, it will be great.
local ffi = require "ffi"
local os = require "os"
useFFIAlloc = false
ffi.cdef [[
void* calloc(size_t nitems, size_t size);
void free(void *ptr);
]]
nelem = 50000000
if useFFIAlloc then
fieldInp = ffi.new("double [?]", nelem)
fieldOut = ffi.new("double [?]", nelem)
else
fieldInp = ffi.gc(ffi.cast("double*", ffi.C.calloc(nelem,
ffi.sizeof("double"))), ffi.C.free)
fieldOut = ffi.gc(ffi.cast("double*", ffi.C.calloc(nelem,
ffi.sizeof("double"))), ffi.C.free)
end
local tStart = os.clock()
for i = 0, nelem-1 do
fieldOut[i] = fieldInp[i]
end
local tEnd = os.clock()
print(tEnd - tStart)