[haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32

  • From: André Braga <meianoite@xxxxxxxxx>
  • To: haiku-development@xxxxxxxxxxxxx
  • Date: Tue, 16 Jun 2009 12:24:13 -0300

2009/6/16 Christian Packmann <Christian.Packmann@xxxxxx>:
> First, the C code is not mine, but Stephans. :-)

Oh. Makes sense, given the subject of this thread... Stupid me :P

> And damn good code at that, considering how little I can speed it up.

Yay for Stippy :)

> Compiling on a 64-bit platform would indeed be interesting. But I think I'll
> rather release the source than do all that myself. :-) As there's interest
> in Linux variants I'll try to adapt the Benchy environment to Linux soon,
> doing a 64-bit compile from there should be easy.

Please, do. :)

> I think that x64 should give nice speedups, as the C code has too many
> variables to be held in x86 registers. However auto-vectorization on the
> current code shouldn't work, the code is not properly laid out for that. And
> I think (hope) that my hand-written assembly should still beat any
> auto-vectorized code,

For the next 4 years of open-source compiler tech you could bet on
that. Unless Apple has even more interesting stuff for clang/LLVM
under wraps.
(And I suspect they do. And I suspect we're doing the Amiga all over,
except that the GPU is the array of DSPs now :))

> unless you use very aggressive unrolling - but this
> would raise other problems, because highly unrolled code takes more code
> space. Not a problem for a single routine or specialized apps, but for a
> small-footprint OS like Haiku, this will have to be considered once more
> routines are optimized.

The other option is a virtual machine with the same set of opcodes
than a x86, but with saner encoding to --hm, let's call it like that,
but not exactly, since it's more akin to machine language proper--
bytecodes. This would run on a tracing JIT that would do the unrolling
work for hot paths itself.

Should be *interesting* to do with LLVM :D

(Yeah, I'm kind of a longing from consumer-level IA64. Or other
VLIW/EPIC architectures with performance as a target, instead of
Transmeta's goals for power consumption. All the buzz with tracing
compilers for Javascript, and this tech dates back to the late 70's!)

> SSE3 was about floating-point operations and also some new MOV*
> instructions, but I don't have a need for the latter (so far).

Hmm.

> What is interesting about SSSE3 is the addition of the PSHUFB operation,
> which allows byte-granular shuffling of values across a 128-bit register.
> Still no AltiVec permute operation, but close. :-) Use of this instruction
> can eliminate quite a few intermediate operations required for data
> unpacking/distribution. But it only seems to be useful when the hardware
> implementation is fast, as can be seen in the results from early and late
> Core2s; only the latter have a real runtime-advantage of code using this
> instruction.

No free lunch, I guess :)


Cheers,
A.



-- 
One last piece of advice: "ice".

Other related posts: