Re: Brainstorming on views / ranges

  • From: Stefano <phd.st.p@xxxxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Sun, 6 Sep 2015 16:06:20 +0100

On 3 September 2015 at 22:27, Laurent Deniau <Laurent.Deniau@xxxxxxx> wrote:

...
My experience is that sub-matrices are more important than strides and
more efficient too. Strides add a computational step that is not so cheap
as they prevents vectorization (if supported), and most of the time they
are not used.


I don't understand, if you have to compute the trace of a matrix you have
to compute the trace of a matrix. And the diagonal view is a strided array.
Similarly, sometimes you have to operate on matrices now only row-by-row by
also column-by-column. Again, one of the two is necessarily a strided view.


On the other hand, sub-matrix view is very interesting in mathematics and
physics as it represents subspace and block wise decomposition (many
algorithms use divide and conquer strategy). For sub-matrices it is almost
free, add another column-like field that is the number of columns of the
view matrix, i.e. how much you have to add to your index to move by one row
in memory. I have written many variant of such framework for linear algebra
in C, C++, Haskell, my language, etc, and strides were always a mistake
while sub-matrix views are just essential. It is even easier to come with
an efficient implementation with LuaJIT.


Are you thinking of generic sub-matrix views of 2 dimensional matrices, or
just sub-matrices that still correspond to contiguous memory?
I am keen on keeping the simplicity and efficiency related to being able to
do element-wise arithmetic operations on matrices in a single for loop.
All specialised algorithms (from matrix multiplication downward) are done
in BLAS and LAPACK anyway.


Think of sorting part of a vector or computing the row-wise maximum of a
matrix.
As of now, I cannot do this without resorting to convoluted code as:

local subx = x:copy(5, 15) -- A newly allocated copy of the desired
range.
table.sort(subx) -- Pure Lua algorithm at some point.
for i=5,15 do
x[i] = subx[i - 5 + 1]
end

When what I want is:
table.sort(x:sub(5, 15)) -- Intention is clear.

* Things that I have tried so far:


You should try accessors, i.e. sub returns a small table with matrix like
behaviour as a view of your matrix with proper fields set (mentioned
above). LuaJIT optimises very well such small objects and I get impressive
performance with this approach (thanks to the sink optimisation), even for
very short-lived accessors.

+ implementing views via FFI cdata structures: without a
reference-counting __gc it is unsafe and with it I get more than a x10
performance penalty (not due to the GC, I get "error thrown or hook
called" -jv output)

You pointed the problems.

+ implementing views via tables: it is safe but the indexing is not
fast (a non-empty hash slows the indexing but proxying via weak tables
hits the GC hard, and storing object-specific data in meta-tables is
unadvised)

Do you have some sample code? It works pretty well for me, even for some
more complex objects like for spreadsheet-like tables with row accessors
while being columns oriented, that is rows don't exists in memory and need
to be emulated.


Yes,here is a simplified micro-benchmark:
http://pastebin.com/AHBmcVBk

This is a use case I actually have: the differential evolution algorithms
has particles stored as rows in a population matrix because most of the
time you work particle-wise.
But the stopping criteria is dimension-wise, so you have to iterate the
columns.

I report the results (1x baseline is the double for loop):
row (table-based row-view): 10x slower
row,ffi (FFI-based view, *not safe*): 4x slower
row,ffi,gc (FFI-based view, safe): 25x slower
row,copy (not a view, newly allocated vector): 7x slower

Depending on the dimensions (here it's a 25-elements matrix), at some point
the table view will be competitive agains the FFI copy.

The table view suffers from __len not being respected though (unless I
build LuaJIT with partial 5.2 compatibility, but this introduces other
issues).

Template-unrolling the sum reduce all cases (aside from gc) to the same
faster performance.
But it is not a good idea to do so: while it works in this micro-benchmark
this approach is too heavy for real-world numerical simulations (tested
already for expressions).
When I discussed this with Mike in the pass, he explained that the problem
is that cross-trace optimisations are limited (and the first trace starts
with the inner sum loop).

In practice, to have FFI views efficiently and safely implemented 2
features are needed:
+ additional cross-trace optimisations
+ efficient handling of __gc finalisers for FFI cdata

+ pass additional arguments to functions as in "sum(x, first, last)",
"sum(x, range)": the problem (aside from the explosion in the number
of arguments and resulting confusion) is that when dealing with
vector-like objects, Lua people think something like:

local function sum(x)
local v = 0
for i=1,#x do v = v + x[i] end
return v
end

Not flexible, error prone. Lua as significant functional abilities so it's
better to write algorithms like foreach, map, map2, foldl, foldr and use
closures. Then most of your sum-like functions will be one-liners. Works
very well with sub-matrices.


I am not keen on solving the efficiency issues this way: people who do not
have a computer science background do not think in functional language
terms.
While I can always implement the features I offer this way, I also have to
offer a way so that users can code their algorithms both efficiently and in
a way they find intuitive.

Optimisation, inference, Monte Carlo and simulation algorithms are always
described (in books and papers) in terms of iterations and of vector/matrix
operations, and that's how people from engineering, statistics,
econometrics, time series analysis think (me included).

There is a reason why MATLAB is popular with the scientific community, why
Julia stresses the facts that functional/vectorised programming is not
required to get good performance and for why people hate R (OK, there is a
lot to dislike here) that forces them to do so if they don't want to wait
two days for the simulation to finish.

Libraries should map closely to the problem domain, not the other way
around.

As an aside, closures are not compiled yet (indeed I am not implementing
vectorised random number generations yet).

Bests,
Stefano

Other related posts: