Stephan Assmus wrote: > In terms of speed, I have neither
noticed nor could I measure any difference. I had one particular project, where I suspected a GCC4 build would improve speed, but it didn't.
Hm, that's interesting. This is contrary to my findings when comparingcode between GCC2 on Haiku and GCC4 on Cygwin, at least when testing code which does heavy calculations. I've measured differences of up to 2x for integer calculations, 3x for floating point. My measurements are done using the CPUs Time Stamp Counter, precision is about +- 100 CPU clock cycles. I use GCC2.95.3 on Haiku, GCC4.3.0 in Cygwin. Benchmarks in Haiku are of course run natively, not within a VM. The codebase is identical, Jamfile and code are written to support
both platforms.I've tested the core of an inner loop from ShowImages ScaleBilinear method, which looks like this in the float version:
volatile uint8_t a[4], b[4], c[4], d[4], destData[4]; float a0, a1, alpha0, alpha1; destData[0] = static_cast<uint8_t> ( ((int32_t)a[0] * a0 + (int32_t)b[0] * a1) * alpha0 + ((int32_t)c[0] * a0 + (int32_t)d[0] * a1) * alpha1); destData[1] = static_cast<uint8_t> ( ((int32_t)a[1] * a0 + (int32_t)b[1] * a1) * alpha0 + ((int32_t)c[1] * a0 + (int32_t)d[1] * a1) * alpha1); destData[2] = static_cast<uint8_t> ( ((int32_t)a[2] * a0 + (int32_t)b[2] * a1) * alpha0 + ((int32_t)c[2] * a0 + (int32_t)d[2] * a1) * alpha1); destData[3] = static_cast<uint8_t> ( ((int32_t)a[3] * a0 + (int32_t)b[3] * a1) * alpha0 + ((int32_t)c[3] * a0 + (int32_t)d[3] * a1) * alpha1);Source for the original code is in trunk/src/apps/showimage/Filter.cpp. This is just a first test version without walking memory access as I just want to measure raw performance; real code would perform differently due to cache/RAM latencies.
My benchmark also includes: - the fixed-point version from Filter.cpp - two variants of the float-version, with some experiments on float<->int conversion ("better conv. #n")- a SSE2 version as a straight adoption of the floating-point version. GCC2 doesn't like inline SSE code, so I also wrote a normal assembly version for YASM.
Here are the clock cycle timings for Haiku and Windows/Cygwin on my Intel Q9550 with 10.000 iterations of the core logic shown above:
Benchmark: ShowImage Bilinear Scale, inner loop Compile date: Mar 2 2009 11:15:10 GCC version: 2.95.3-haiku-081024 -- Results -- Minimum Average Maximum # 1: 1640033 1641250 1646085 - 'Float math, original' # 2: 665007 665615 671008 - 'FixPt math, original' # 3: 1640033 1641255 1646102 - 'Float math, better conv. #1' # 4: 1680017 1681825 1692011 - 'Float math, better conv. #2' # 5: 169125 169282 170111 - 'SSE2' Benchmark: ShowImage Bilinear Scale, inner loop Compile date: Mar 2 2009 10:19:35 GCC version: 4.3.0 20080305 (alpha-testing) 1 -- Results -- Minimum Average Maximum # 1: 563465 565586 569712 - 'Float math, original' # 2: 330020 330034 330122 - 'FixPt math, original' # 3: 564765 565895 566899 - 'Float math, better conv. #1' # 4: 600023 600137 601068 - 'Float math, better conv. #2' # 5: 178015 179381 179927 - 'SSE2-inline' # 6: 167619 225279 627495 - 'SSE2' This code is of course very extreme in its computational intensity, so these results overstate the average speedup of GCC4 vs 2; but I have ahard time believing that GCC4 won't perform better than GCC2 on "normal" code. How much of a speedup can be gained depends on many factors, of course. I doubt that normal OS code will profit very much; it often does more calls than computations. But any kind of "media" or real computational code should see significant speedups[1].
Maybe your test cases suffered from some problem which prevented high speedups. I haven't tested Haiku/GCC4 (I need more thumbdrives...), maybe GCC4 has some problems on Haiku? I can't imagine what could cause this, but if GCC4 really shouldn't be faster on Haiku while being faster in Cygwin, there has to be some reason for it.
I'll prepare the above benchmark for download and post a link here, so that anyone interested can test for himself; performance deltas on other CPUs should be interesting. I'm also open for any suggestions for more "normal" test cases to compare compiler performance.
Now I'll wait 'til my Haiku/GCC4 build finishes and test my benchmark with it... you've really made me curious. ;-)
Christian[1] Note that heavy use of float<->int conversions may result in slow execution, especially on AMD Athlon/64. This is due to the way C handles these conversions. Code making heavy use of such conversions could be performance-limited regardless of GCC version used.