Re: Brainstorming on views / ranges

From: Laurent Deniau <Laurent.Deniau@xxxxxxx>
To: "<luajit@xxxxxxxxxxxxx>" <luajit@xxxxxxxxxxxxx>
Date: Mon, 7 Sep 2015 18:38:09 +0000

On Sep 6, 2015, at 5:06 PM, Stefano
<phd.st.p@xxxxxxxxx<mailto:phd.st.p@xxxxxxxxx>> wrote:

On 3 September 2015 at 22:27, Laurent Deniau
<Laurent.Deniau@xxxxxxx<mailto:Laurent.Deniau@xxxxxxx>> wrote:
...
My experience is that sub-matrices are more important than strides and more
efficient too. Strides add a computational step that is not so cheap as they
prevents vectorization (if supported), and most of the time they are not used.

I don't understand, if you have to compute the trace of a matrix you have to
compute the trace of a matrix. And the diagonal view is a strided array.

-- matrix:subview(row0,col0,nrow,ncol[,lda])

function trace(A)
return A :subview(1,1, A:rows(),1, A:cols()+1) :foldl(0, function(a,b) return
a+b end)
end

is as fast as the trivial loop with the good job of LuaJIT, because subview and
foldl are very small function

Similarly, sometimes you have to operate on matrices now only row-by-row by
also column-by-column. Again, one of the two is necessarily a strided view.

On the other hand, sub-matrix view is very interesting in mathematics and
physics as it represents subspace and block wise decomposition (many algorithms
use divide and conquer strategy). For sub-matrices it is almost free, add
another column-like field that is the number of columns of the view matrix,
i.e. how much you have to add to your index to move by one row in memory. I
have written many variant of such framework for linear algebra in C, C++,
Haskell, my language, etc, and strides were always a mistake while sub-matrix
views are just essential. It is even easier to come with an efficient
implementation with LuaJIT.

Are you thinking of generic sub-matrix views of 2 dimensional matrices, or just
sub-matrices that still correspond to contiguous memory?
I am keen on keeping the simplicity and efficiency related to being able to do
element-wise arithmetic operations on matrices in a single for loop.

Except for the multiplication, map and map2 do the job, no loop needed.

All specialised algorithms (from matrix multiplication downward) are done in
BLAS and LAPACK anyway.

They are slower for small matrices (< 50x50). Intel is developing a specialised
library for these small matrix operations, but the idea is around for 10 years
(since SSE2 is widely available). I mentioned the technic it in a paper in 2012
as a side result of speeding up particles tracking, thinking that it was not
important. Now Intel is/will be advertising this extension of the MKL.

Think of sorting part of a vector or computing the row-wise maximum of a
matrix.
As of now, I cannot do this without resorting to convoluted code as:

local subx = x:copy(5, 15) -- A newly allocated copy of the desired range.
table.sort(subx) -- Pure Lua algorithm at some point.
for i=5,15 do
x[i] = subx[i - 5 + 1]
end

When what I want is:
table.sort(x:sub(5, 15)) -- Intention is clear.

* Things that I have tried so far:

You should try accessors, i.e. sub returns a small table with matrix like
behaviour as a view of your matrix with proper fields set (mentioned above).
LuaJIT optimises very well such small objects and I get impressive performance
with this approach (thanks to the sink optimisation), even for very short-lived
accessors.

+ implementing views via FFI cdata structures: without a
reference-counting __gc it is unsafe and with it I get more than a x10
performance penalty (not due to the GC, I get "error thrown or hook
called" -jv output)

You pointed the problems.

+ implementing views via tables: it is safe but the indexing is not
fast (a non-empty hash slows the indexing but proxying via weak tables
hits the GC hard, and storing object-specific data in meta-tables is
unadvised)

Do you have some sample code? It works pretty well for me, even for some more
complex objects like for spreadsheet-like tables with row accessors while being
columns oriented, that is rows don't exists in memory and need to be emulated.

Yes,here is a simplified micro-benchmark:
http://pastebin.com/AHBmcVBk

This is a use case I actually have: the differential evolution algorithms has
particles stored as rows in a population matrix because most of the time you
work particle-wise.
But the stopping criteria is dimension-wise, so you have to iterate the columns.

I report the results (1x baseline is the double for loop):
row (table-based row-view): 10x slower
row,ffi (FFI-based view, *not safe*): 4x slower
row,ffi,gc (FFI-based view, safe): 25x slower
row,copy (not a view, newly allocated vector): 7x slower

Depending on the dimensions (here it's a 25-elements matrix), at some point the
table view will be competitive agains the FFI copy.

The table view suffers from __len not being respected though (unless I build
LuaJIT with partial 5.2 compatibility, but this introduces other issues).

Template-unrolling the sum reduce all cases (aside from gc) to the same faster
performance.
But it is not a good idea to do so: while it works in this micro-benchmark this
approach is too heavy for real-world numerical simulations (tested already for
expressions).
When I discussed this with Mike in the pass, he explained that the problem is
that cross-trace optimisations are limited (and the first trace starts with the
inner sum loop).

In practice, to have FFI views efficiently and safely implemented 2 features
are needed:
+ additional cross-trace optimisations
+ efficient handling of __gc finalisers for FFI data

what about a table holding a pointer to the matrix (cdata) and the view
(cdata)? The gc will collect the table and the view (or sink it) while leaving
the matrix alive if still in use.

+ pass additional arguments to functions as in "sum(x, first, last)",
"sum(x, range)": the problem (aside from the explosion in the number
of arguments and resulting confusion) is that when dealing with
vector-like objects, Lua people think something like:

local function sum(x)
local v = 0
for i=1,#x do v = v + x[i] end
return v
end

Not flexible, error prone. Lua as significant functional abilities so it's
better to write algorithms like foreach, map, map2, foldl, foldr and use
closures. Then most of your sum-like functions will be one-liners. Works very
well with sub-matrices.

I am not keen on solving the efficiency issues this way: people who do not have
a computer science background do not think in functional language terms.

You can use it for your implementation. The user is free to use it or not… Here
examples from my small module:

function M.fill (x, e ) return x:map (function() return e end, x) end
function M.copy (x, r_) return x:map (ident , r_) end
function M.real (x, r_) return x:map (real , r_) end
function M.imag (x, r_) return x:map (imag , r_) end
function M.conj (x, r_) return x:map (conj , r_) end
function M.unm (x, r_) return x:map (unm , r_) end
function M.abs (x, r_) return x:map (abs , r_) end
function M.arg (x, r_) return x:map (arg , r_) end
function M.exp (x, r_) return x:map (exp , r_) end
function M.log (x, r_) return x:map (log , r_) end
function M.sqrt (x, r_) return x:map (sqrt , r_) end
function M.proj (x, r_) return x:map (proj , r_) end
function M.sin (x, r_) return x:map (sin , r_) end
function M.cos (x, r_) return x:map (cos , r_) end
function M.tan (x, r_) return x:map (tan , r_) end
function M.sinh (x, r_) return x:map (sinh , r_) end
function M.cosh (x, r_) return x:map (cosh , r_) end
function M.tanh (x, r_) return x:map (tanh , r_) end
function M.asin (x, r_) return x:map (asin , r_) end
function M.acos (x, r_) return x:map (acos , r_) end
function M.atan (x, r_) return x:map (atan , r_) end
function M.asinh (x, r_) return x:map (asinh , r_) end
function M.acosh (x, r_) return x:map (acosh , r_) end
function M.atanh (x, r_) return x:map (atanh , r_) end
function M.mod (x, y, r_) return x:maps(y, mod, r_) end
function M.pow (x, y, r_) return x:maps(y, pow, r_) end

While I can always implement the features I offer this way, I also have to
offer a way so that users can code their algorithms both efficiently and in a
way they find intuitive.

Optimisation, inference, Monte Carlo and simulation algorithms are always
described (in books and papers) in terms of iterations and of vector/matrix
operations, and that's how people from engineering, statistics, econometrics,
time series analysis think (me included).

Fine, then duplicate your code ad infinitum...

There is a reason why MATLAB is popular with the scientific community

Not so popular...

, why Julia stresses the facts that functional/vectorised programming is not
required to get good performance and for why people hate R (OK, there is a lot
to dislike here) that forces them to do so if they don't want to wait two days
for the simulation to finish.

Libraries should map closely to the problem domain, not the other way around.

I think that we don't talk about the same thing. LuaJIT does a pretty good job
with functional programming. I don't says that everything should be programmed
in a functional way, but it has its advantage. And fold/map is a general
computational model (and fold can emulates map and vice versa).

As an aside, closures are not compiled yet (indeed I am not implementing
vectorised random number generations yet).

The closure used by M.fill above is compiled by LuaJIT (and the others too), or
at least this is what it reports and the measured speed is compatible with it.
As soon as you do not return it, closure are efficient.

Best,
Laurent.

Follow-Ups:
- Re: Brainstorming on views / ranges
  - From: Stefano

References:
- Brainstorming on views / ranges
  - From: Stefano
- Re: Brainstorming on views / ranges
  - From: Laurent Deniau
- Re: Brainstorming on views / ranges
  - From: Stefano

Re: Brainstorming on views / ranges

Other related posts: