[haiku-development] Re: Optimizing Painter::_DrawBitmapBilinearCopy32

  • From: Christian Packmann <Christian.Packmann@xxxxxx>
  • To: haiku-development@xxxxxxxxxxxxx
  • Date: Sun, 14 Jun 2009 16:16:50 +0200

After being sidetracked on several other things, I'm currently working on the SIMD optims for BilinearCopy again.


Benchmark binaries for Haiku/GCC2 and Windows/Cygwin can be downloaded here:
http://www.elenthara.de/Haiku/Benchmarks/AppserverBilinCopyBench_v1.0.zip
(The Windows version *requires* a basic Cygwin environment to be installed; I'll have to setup a MingW environment to get rid of the Cygwin dependency, but don't have time for this right now)

The benchmark should automatically detect the supported SIMD instruction sets for the installed CPU and skip any routines with instructions not supported by the CPU, but I've only tested the code on my Core2 system so far; if the program crashes, please let me know on what CPU that happened so I can fix the problem. The code only supports SIMD capability detection for AMD/Intel/VIA so far, Transmeta CPUs should work, but MMX/SSE will not be detected on them, so only the C integer benchmarks will be run.


Results on my system:
-------------------------------------------------------------
Benchmark: Haiku app_server bilinear copy
Compile date: Jun 14 2009 14:01:24
GCC version: 2.95.3-haiku-081024

CPU vendor ID: GenuineIntel
CPU: Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
  SIMD instructions: MMX SSE SSE-Integer SSE2 SSE3 SSSE3 SSE4.1

Can't lock process to CPU on this platform.
Estimated CPUID/RDTSC overhead: 229 clock cycles.
10 runs per benchmark.

                    --  Results  --

       Minimum    Average    Maximum
# 1:    358080     385616     629213  - 'C, original'
# 2:    334680     334895     335147  - 'C, precise'
# 3:    349384     349902     351399  - 'C, precise DIV'
# 4:    186227     186317     186584  - 'MMX/SSE'
# 5:    176231     176319     176435  - 'MMX/SSE optim-test'
# 6:    178441     178488     178611  - 'SSE2'
# 7:    155890     155958     156086  - 'SSSE3'
-------------------------------------------------------------

Notes:

The C versions are identical with exception of final conversion; the original uses
    d[0] = t0 >> 16;
the "precise" variant
    d[0] = (t0*129) >> 23;
and the "precise DIV"
    d[0] = t0 / 65025;
I don't know why the precise routines are faster than the original code; this may be due to anomalies with the GCC2 optimizer or that the loop entries happen to fall on a 16-byte boundary for the precise versions - which can't be controlled in GCC AFAIK. In contrast, the performance behavior for Cygwin/GCC4 is pretty much as expected:
# 1:    323680     325732     342269  - 'C, original'
# 2:    363962     364290     365679  - 'C, precise'
# 3:    393218     394882     399729  - 'C, precise DIV'
Note that GCC2 is actually faster than GCC4 for the precise variants, and that the performance difference GCC2<->4 is rather small, which is a big divergence from the factor ~2 observed with the earlier benchmarks.

The MMX/SSE routine needs one 64-bit integer instruction from SSE, so it isn't pure MMX. However, the SSE integer instructions were present in the first Athlons before full SSE was introduced by AMD with the Athlon XP. So this routine will work on all AMD/Intel CPUs produced after 1999 or so.
Performance is okay, but not breathtaking.

SSE2/SSSE3 routines are terrible; even though they have fewer instructions than the MMX/SSE code, they are not/only marginally faster. I haven't yet determined what the problem is; I think there's either a problem with the CPUs decode bandwith (as the SSE instructions have a high bytecount) or with the instruction dependencies. If it's the latter, loop unrolling may give an estimated 50% speedup; this is the next thing I'll test.


I could need a few volunteers now to run the benchmark on various systems and post/mail the results. This would help me in deciding which routines should be aggressively optimized. I'd be especially interested in the following systems (but other systems would be welcome as well):
* Intel Atom
* Intel Core2 65nm (can be recognized by lack of SSE4.1 support)
* Intel Pentium 4
* Intel Core/Pentium M
* AMD K10 - Phenom/Shanghai
* AMD K8 - Athlon64/Sempron
* AMD K7 - Athlon(XP)/Duron


Christian

Other related posts: