Adam Strzelecki wrote: > Unfortunately it seems to bring absolutely no boost for > multiplication of 4x4 matrices (struct of 16 elements), are we > running short on registers here? It was running short of PHIs -- I've bumped the limit now (was probably too low, anyway). > However newer GCC like 4.7 or Intel C++ C 12.x have really > decent auto parallelisation that can make your app run 3-5x > faster than scalar code depending on emitted instruction set - > SSE vs AVX. I guess you mean auto-vectorization: automatically turning scalar code into SIMD-code, i.e. without manual insertion of builtins. Parallelization is something very different: running code in parallel on different CPU cores. However, auto-vectorization is a very, very hard problem. A couple years ago, even the best compilers could only auto-vectorize simple textbook examples, but failed on anything remotely interesting. This has changed a bit, of course. But it would take an extraordinary amount of research and lots of code to become competitive in this area. > If we want to match recent C++ compilers in this matter we would > need to bring parallelisation to LuaJIT as well, at least for 2 > or more assignments in the row that come from same expression, > i.e.: > __mul = function(a, b) > return vec4(a.x * b.x, a.y * b.y, a.z * b.z, a.w * b.w) > end The logical first step is to add SIMD builtins and hand-vectorize the code. The SIMD builtins etc. are one of the future extensions mentioned in the roadmap. --Mike