[haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32

From: Christian Packmann <Christian.Packmann@xxxxxx>
To: haiku-development@xxxxxxxxxxxxx
Date: Sun, 14 Jun 2009 16:16:50 +0200

After being sidetracked on several other things, I'm currently working onthe SIMD optims for BilinearCopy again.


Benchmark binaries for Haiku/GCC2 and Windows/Cygwin can be downloaded here:
http://www.elenthara.de/Haiku/Benchmarks/AppserverBilinCopyBench_v1.0.zip

(The Windows version *requires* a basic Cygwin environment to beinstalled; I'll have to setup a MingW environment to get rid of the Cygwindependency, but don't have time for this right now)

The benchmark should automatically detect the supported SIMD instructionsets for the installed CPU and skip any routines with instructions notsupported by the CPU, but I've only tested the code on my Core2 system sofar; if the program crashes, please let me know on what CPU that happenedso I can fix the problem.The code only supports SIMD capability detection for AMD/Intel/VIA so far,Transmeta CPUs should work, but MMX/SSE will not be detected on them, soonly the C integer benchmarks will be run.



Results on my system:
-------------------------------------------------------------
Benchmark: Haiku app_server bilinear copy
Compile date: Jun 14 2009 14:01:24
GCC version: 2.95.3-haiku-081024

CPU vendor ID: GenuineIntel
CPU: Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
  SIMD instructions: MMX SSE SSE-Integer SSE2 SSE3 SSSE3 SSE4.1

Can't lock process to CPU on this platform.
Estimated CPUID/RDTSC overhead: 229 clock cycles.
10 runs per benchmark.

                    --  Results  --

       Minimum    Average    Maximum
# 1:    358080     385616     629213  - 'C, original'
# 2:    334680     334895     335147  - 'C, precise'
# 3:    349384     349902     351399  - 'C, precise DIV'
# 4:    186227     186317     186584  - 'MMX/SSE'
# 5:    176231     176319     176435  - 'MMX/SSE optim-test'
# 6:    178441     178488     178611  - 'SSE2'
# 7:    155890     155958     156086  - 'SSSE3'
-------------------------------------------------------------

Notes:

The C versions are identical with exception of final conversion; theoriginal uses

    d[0] = t0 >> 16;
the "precise" variant
    d[0] = (t0*129) >> 23;
and the "precise DIV"
    d[0] = t0 / 65025;

I don't know why the precise routines are faster than the original code;this may be due to anomalies with the GCC2 optimizer or that the loopentries happen to fall on a 16-byte boundary for the precise versions -which can't be controlled in GCC AFAIK.In contrast, the performance behavior for Cygwin/GCC4 is pretty much asexpected:

# 1:    323680     325732     342269  - 'C, original'
# 2:    363962     364290     365679  - 'C, precise'
# 3:    393218     394882     399729  - 'C, precise DIV'

Note that GCC2 is actually faster than GCC4 for the precise variants, andthat the performance difference GCC2<->4 is rather small, which is a bigdivergence from the factor ~2 observed with the earlier benchmarks.

The MMX/SSE routine needs one 64-bit integer instruction from SSE, so itisn't pure MMX. However, the SSE integer instructions were present in thefirst Athlons before full SSE was introduced by AMD with the Athlon XP. Sothis routine will work on all AMD/Intel CPUs produced after 1999 or so.

Performance is okay, but not breathtaking.

SSE2/SSSE3 routines are terrible; even though they have fewer instructionsthan the MMX/SSE code, they are not/only marginally faster. I haven't yetdetermined what the problem is; I think there's either a problem with theCPUs decode bandwith (as the SSE instructions have a high bytecount) orwith the instruction dependencies. If it's the latter, loop unrolling maygive an estimated 50% speedup; this is the next thing I'll test.

I could need a few volunteers now to run the benchmark on various systemsand post/mail the results. This would help me in deciding which routinesshould be aggressively optimized.I'd be especially interested in the following systems (but other systemswould be welcome as well):

* Intel Atom
* Intel Core2 65nm (can be recognized by lack of SSE4.1 support)
* Intel Pentium 4
* Intel Core/Pentium M
* AMD K10 - Phenom/Shanghai
* AMD K8 - Athlon64/Sempron
* AMD K7 - Athlon(XP)/Duron


Christian

Follow-Ups:
- [haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32
  - From: Mizsei Zoltán
- [haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32
  - From: Stephan Assmus
- [haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32
  - From: Simon Kennedy
- [haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32
  - From: Humdinger
- [haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32
  - From: Michael Weirauch
- [haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32
  - From: Urias McCullough
- [haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32
  - From: Urias McCullough
- [haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32
  - From: Urias McCullough
- [haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32
  - From: Urias McCullough
- [haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32
  - From: Rob Judd
- [haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32
  - From: Urias McCullough
- [haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32
  - From: Adam K Kirchhoff

[haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32

Other related posts: