Re: Array performance with 2.0.0-beta10 versus git HEAD

  • From: Peter Colberg <peter@xxxxxxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Tue, 28 Aug 2012 18:26:17 -0400

On Tue, Aug 28, 2012 at 11:37:37PM +0200, Mike Pall wrote:
> Peter Colberg wrote:
> > Further the array can be passed directly to
> > C functions, and the malloc'ed pointer is kept alive as an upvalue
> > of the function passed to ffi.gc().
> 
> The finalizer function receives the object to be finalized as a
> parameter (i.e. 'array'). So simply run ffi.C.free(array.data).
> That avoids creating a new closure for every allocation.

Thanks, of course, there is no need to keep "data" alive.

> > The plan is to implement core algorithms of a molecular simulation
> > using OpenCL (on the GPU) or C (on the host) code, therefore the
> > convenience of bound checking and 1-based indexing in the LuaJIT
> > part should outweigh the performance penalty.
> 
> 1-based indexing as a feature? How quaint!

Well, I do not want to appear quaint…

I will use 0-based indexing then. I was worried about possible confusion
between 0-based arrays and 1-based tables, but since the program would
contain C and OpenCL modules, both with 0-based indexing, the latter
confusion would be much worse. Further the output is stored in HDF5,
and analysed with numpy scripts, which again use 0-based indexing.

> 
> > typedef struct vec3_array { vec3 *data; size_t size; } vec3_array;
> 
> Avoid size_t, unless you need it for interfacing to existing code.
> Better use int.

The reason for prefering int over unsigned int being that a possible
overflow of an unsigned loop variable could hinder the compiler in
optimising a loop?

> Also, you could use a VLS to inline the allocation, that avoids
> all of that ffi.gc() stuff:
> 
> typedef struct vec3_array { int size; vec3 data[?]; } vec3_array;
> ...
> local array = vec3_array(size, size)

But then the program would be subject to the 2GB limit, no?

In a molecular dynamnics simulation, each array is much less than
2GB, since larger systems would need to be split up onto multiple
processors anyway. However, to compute correlation functions over
multiple orders of magnitude in time, many copies (~100) of arrays
at different points in time are kept in memory, which in sum would
exceed the 2GB limit.

Thanks again for all the hints.

Peter

Other related posts: