2009/6/16 Christian Packmann <Christian.Packmann@xxxxxx>: > First, the C code is not mine, but Stephans. :-) Oh. Makes sense, given the subject of this thread... Stupid me :P > And damn good code at that, considering how little I can speed it up. Yay for Stippy :) > Compiling on a 64-bit platform would indeed be interesting. But I think I'll > rather release the source than do all that myself. :-) As there's interest > in Linux variants I'll try to adapt the Benchy environment to Linux soon, > doing a 64-bit compile from there should be easy. Please, do. :) > I think that x64 should give nice speedups, as the C code has too many > variables to be held in x86 registers. However auto-vectorization on the > current code shouldn't work, the code is not properly laid out for that. And > I think (hope) that my hand-written assembly should still beat any > auto-vectorized code, For the next 4 years of open-source compiler tech you could bet on that. Unless Apple has even more interesting stuff for clang/LLVM under wraps. (And I suspect they do. And I suspect we're doing the Amiga all over, except that the GPU is the array of DSPs now :)) > unless you use very aggressive unrolling - but this > would raise other problems, because highly unrolled code takes more code > space. Not a problem for a single routine or specialized apps, but for a > small-footprint OS like Haiku, this will have to be considered once more > routines are optimized. The other option is a virtual machine with the same set of opcodes than a x86, but with saner encoding to --hm, let's call it like that, but not exactly, since it's more akin to machine language proper-- bytecodes. This would run on a tracing JIT that would do the unrolling work for hot paths itself. Should be *interesting* to do with LLVM :D (Yeah, I'm kind of a longing from consumer-level IA64. Or other VLIW/EPIC architectures with performance as a target, instead of Transmeta's goals for power consumption. All the buzz with tracing compilers for Javascript, and this tech dates back to the late 70's!) > SSE3 was about floating-point operations and also some new MOV* > instructions, but I don't have a need for the latter (so far). Hmm. > What is interesting about SSSE3 is the addition of the PSHUFB operation, > which allows byte-granular shuffling of values across a 128-bit register. > Still no AltiVec permute operation, but close. :-) Use of this instruction > can eliminate quite a few intermediate operations required for data > unpacking/distribution. But it only seems to be useful when the hardware > implementation is fast, as can be seen in the results from early and late > Core2s; only the latter have a real runtime-advantage of code using this > instruction. No free lunch, I guess :) Cheers, A. -- One last piece of advice: "ice".