After being sidetracked on several other things, I'm currently working on the SIMD optims for BilinearCopy again.
Benchmark binaries for Haiku/GCC2 and Windows/Cygwin can be downloaded here: http://www.elenthara.de/Haiku/Benchmarks/AppserverBilinCopyBench_v1.0.zip(The Windows version *requires* a basic Cygwin environment to be installed; I'll have to setup a MingW environment to get rid of the Cygwin dependency, but don't have time for this right now)
The benchmark should automatically detect the supported SIMD instruction sets for the installed CPU and skip any routines with instructions not supported by the CPU, but I've only tested the code on my Core2 system so far; if the program crashes, please let me know on what CPU that happened so I can fix the problem. The code only supports SIMD capability detection for AMD/Intel/VIA so far, Transmeta CPUs should work, but MMX/SSE will not be detected on them, so only the C integer benchmarks will be run.
Results on my system: ------------------------------------------------------------- Benchmark: Haiku app_server bilinear copy Compile date: Jun 14 2009 14:01:24 GCC version: 2.95.3-haiku-081024 CPU vendor ID: GenuineIntel CPU: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz SIMD instructions: MMX SSE SSE-Integer SSE2 SSE3 SSSE3 SSE4.1 Can't lock process to CPU on this platform. Estimated CPUID/RDTSC overhead: 229 clock cycles. 10 runs per benchmark. -- Results -- Minimum Average Maximum # 1: 358080 385616 629213 - 'C, original' # 2: 334680 334895 335147 - 'C, precise' # 3: 349384 349902 351399 - 'C, precise DIV' # 4: 186227 186317 186584 - 'MMX/SSE' # 5: 176231 176319 176435 - 'MMX/SSE optim-test' # 6: 178441 178488 178611 - 'SSE2' # 7: 155890 155958 156086 - 'SSSE3' ------------------------------------------------------------- Notes:The C versions are identical with exception of final conversion; the original uses
d[0] = t0 >> 16; the "precise" variant d[0] = (t0*129) >> 23; and the "precise DIV" d[0] = t0 / 65025;I don't know why the precise routines are faster than the original code; this may be due to anomalies with the GCC2 optimizer or that the loop entries happen to fall on a 16-byte boundary for the precise versions - which can't be controlled in GCC AFAIK. In contrast, the performance behavior for Cygwin/GCC4 is pretty much as expected:
# 1: 323680 325732 342269 - 'C, original' # 2: 363962 364290 365679 - 'C, precise' # 3: 393218 394882 399729 - 'C, precise DIV'Note that GCC2 is actually faster than GCC4 for the precise variants, and that the performance difference GCC2<->4 is rather small, which is a big divergence from the factor ~2 observed with the earlier benchmarks.
The MMX/SSE routine needs one 64-bit integer instruction from SSE, so it isn't pure MMX. However, the SSE integer instructions were present in the first Athlons before full SSE was introduced by AMD with the Athlon XP. So this routine will work on all AMD/Intel CPUs produced after 1999 or so.
Performance is okay, but not breathtaking.SSE2/SSSE3 routines are terrible; even though they have fewer instructions than the MMX/SSE code, they are not/only marginally faster. I haven't yet determined what the problem is; I think there's either a problem with the CPUs decode bandwith (as the SSE instructions have a high bytecount) or with the instruction dependencies. If it's the latter, loop unrolling may give an estimated 50% speedup; this is the next thing I'll test.
I could need a few volunteers now to run the benchmark on various systems and post/mail the results. This would help me in deciding which routines should be aggressively optimized. I'd be especially interested in the following systems (but other systems would be welcome as well):
* Intel Atom * Intel Core2 65nm (can be recognized by lack of SSE4.1 support) * Intel Pentium 4 * Intel Core/Pentium M * AMD K10 - Phenom/Shanghai * AMD K8 - Athlon64/Sempron * AMD K7 - Athlon(XP)/Duron Christian