[haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32

  • From: Christian Packmann <Christian.Packmann@xxxxxx>
  • To: haiku-development@xxxxxxxxxxxxx
  • Date: Tue, 16 Jun 2009 18:32:37 +0200

André Braga - 2009-06-16 17:24 :
2009/6/16 Christian Packmann <Christian.Packmann@xxxxxx>:
Compiling on a 64-bit platform would indeed be interesting. But I think I'll
rather release the source than do all that myself. :-) As there's interest
in Linux variants I'll try to adapt the Benchy environment to Linux soon,
doing a 64-bit compile from there should be easy.

Please, do. :)

I'll look into that after integration of the MMX/SSE code into app_server.

I think that x64 should give nice speedups, as the C code has too many
variables to be held in x86 registers. However auto-vectorization on the
current code shouldn't work, the code is not properly laid out for that. And
I think (hope) that my hand-written assembly should still beat any
auto-vectorized code,

For the next 4 years of open-source compiler tech you could bet on
that. Unless Apple has even more interesting stuff for clang/LLVM
under wraps.
(And I suspect they do.

Hm, it doesn't strike me as that interesting for high-performance code. More a general solution for platform-independent JIT-code (but a very elegant one at that). The instruction set has too many omissions though, and at least on x86 it wouldn't be easy to do proper compilation for efficient SSE3+ code, especially when doing JIT.

OpenCL is the solution for high-performance computing, and I'm looking forward to it being implemented widely. Using graphics cards as very wide SIMD units is really the most efficient way of achieving high performance at acceptable power consumption for the current hardware. And as OpenCL code can run on anything from graphics cards to normal CPUs to Larrabee, OpenCL code will likely be the best solution for doing high-performance code, as you can run it nearly anywhere.

> And I suspect we're doing the Amiga all over,
except that the GPU is the array of DSPs now :))

Hehe, too true. I'm actually hoping that either Sony or Microsoft will use Larrabee for their future consoles; this would be the closest to the Amiga architecture since the Hombre project died with Commodore. Doing all calculations, graphics and sound processing in a unified memory space should open up interesting perspectives for programmers. I probably would have to buy such a machine just to look at the demos. :-)

unless you use very aggressive unrolling - but this
would raise other problems, because highly unrolled code takes more code
space. Not a problem for a single routine or specialized apps, but for a
small-footprint OS like Haiku, this will have to be considered once more
routines are optimized.

The other option is a virtual machine with the same set of opcodes
than a x86, but with saner encoding to --hm, let's call it like that,
but not exactly, since it's more akin to machine language proper--
bytecodes. This would run on a tracing JIT that would do the unrolling
work for hot paths itself.

Should be *interesting* to do with LLVM :D

As stated, it doesn't match the x86 SSE instructions too well and thus would loose performance compared to native code. And doing JIT for performance-critical code is really a nice idea, but so far it never seemed to work out too well. :-) It all depends on compiler technology which always seems to lag a bit behind hardware abilities.

(Yeah, I'm kind of a longing from consumer-level IA64. Or other
VLIW/EPIC architectures with performance as a target, instead of
Transmeta's goals for power consumption. All the buzz with tracing
compilers for Javascript, and this tech dates back to the late 70's!)

Hm, I'm not sure that VLIW would actually be that useful. It just puts all responsibility for good performance on the compiler; as such it can only work on the desktop if you use JIT compilation, as static compilation will prevent any drastic changes to your CPU architecture. And on the desktop, you want the ability to distribute binaries without having to recompile on every machine. Okay, let's ignore Linux. ;-)

And from neither a performance nor efficiency POV VLIW fared well. Itanium never beat the other architectures decisively on all scores (for some benchmarks yes, but not universally) and Transmetas CPUs turned out to be not better from a power/performance perspective than Intels CPUs once Intel started to optimize for power.

It seems that once you optimize a CPU for performance, your microarchitecture has to implement certain hardware characteristics, and these seem to be ISA-agnostic. While some architectures like x86 have to pay a price for their obscure history, it doesn't really matter at the high end; there are some CPUs which are faster than x86 on some workloads, but the total system cost is usually so high as to be prohibitive for general use. And as x86 development costs can be shared across mobile, desktop and server CPUs, it just has the most R&D money available which puts it at an advantage over other CPU architectures.

Using "dedicated" hardware like GPUs for high-width vector processing is the better solution here IMO. For vectorizable algorithms they can deliver 2-4x times the MIPS/Watt than CPUs. And as most really compute-intensive workloads happen to be vector-friendly, that is a basically perfect solution. Now we just need an OpenCL port and drivers for Haiku. :-)

Christian

Other related posts: