2009/6/14 Christian Packmann <Christian.Packmann@xxxxxx>: > After looking through the results, I think that a SSE2 codepath is not > interesting, as the MMX/SSE code always performs better. A SSSE3 version may > make sense for modern Core2 and i7/Nehalem systems, and maybe AMDs Bulldozer > (due 2011). So I'll try to optimize the MMX/SSE routine further, as it is > the most useful for general use. My CPU liked the SSE2 code :) I don't know if this makes a difference for this code, but this is a AM2 board, not AM2+ (which is the native type for this CPU). Benchmark: Haiku app_server bilinear copy Compile date: Jun 14 2009 14:38:02 GCC version: 2.95.3-haiku-081024 CPU vendor ID: AuthenticAMD CPU: AMD Phenom(tm) 9950 Quad-Core Processor SIMD instructions: MMX SSE SSE-Integer SSE2 SSE3 MOVU Can't lock process to CPU on this platform. Estimated CPUID/RDTSC overhead: 122 clock cycles. 10 runs per benchmark. -- Results -- Minimum Average Maximum # 1: 429197 439534 507568 - 'C, original' # 2: 440918 440982 441223 - 'C, precise' # 3: 449571 453421 474110 - 'C, precise DIV' # 4: 198232 200319 218137 - 'MMX/SSE' # 5: 196354 199110 217796 - 'MMX/SSE optim-test' # 6: 178408 180968 203687 - 'SSE2' Skipped 'SSSE3', insufficient SIMD support sysinfo Kernel name: kernel_x86 built on: Jun 8 2009 01:28:05 version 0x1 4 AMD Phenom, revision 40f23 running at 2611MHz (ID: 0x00000000 0x00000000) CPU #0: "AMD Phenom(tm) 9950 Quad-Core Processor" Type 0, family 16, model 2, stepping 3, features 0x178bfbff FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CFLUSH MMX FXSTR SSE SSE2 HTT Extended Intel: 0x00802009 SSE3 MONITOR CMPXCHG16B Extended AMD: type 0, family 16, model 2, stepping 3, features 0xefd3fbff SCE NX AMD-MMX FFXSTR RDTSCP 64 3DNow+ 3DNow! Power Management Features: TS TTP TM STC Inst TLB: 2M/4M-byte pages, 16 entries, fully associative Data TLB: 2M/4M-byte pages, 48 entries, fully associative Inst TLB: 4K-byte pages, 32 entries, fully associative Data TLB: 4K-byte pages, 48 entries, fully associative L1 inst cache: 64 KB, 2-way set associative, 1 lines/tag, 64 bytes/line L1 data cache: 64 KB, 2-way set associative, 1 lines/tag, 64 bytes/line L2 cache: 512 KB, 16-way set associative, 1 lines/tag, 64 bytes/line (same for CPUs #1, 2 and 3) 2018222080 bytes free (used/max 128081920 / 2146304000) (cached 56213504) 129516 semaphores free (used/max 1556 / 131072) 3967 ports free (used/max 129 / 4096) 3965 threads free (used/max 131 / 4096) 2031 teams free (used/max 17 / 2048) -- One last piece of advice: "ice".