[haiku] Re: gcc4 hibryd

  • From: Christian Packmann <Christian.Packmann@xxxxxx>
  • To: haiku@xxxxxxxxxxxxx
  • Date: Mon, 02 Mar 2009 21:18:53 +0100

Stephan Assmus wrote:

> In terms of speed, I have neither
noticed nor could I measure any difference. I had one particular project, where I suspected a GCC4 build would improve speed, but it didn't.

Hm, that's interesting. This is contrary to my findings when comparing
code between GCC2 on Haiku and GCC4 on Cygwin, at least when testing code which does heavy calculations. I've measured differences of up to 2x for integer calculations, 3x for floating point. My measurements are done using the CPUs Time Stamp Counter, precision is about +- 100 CPU clock cycles. I use GCC2.95.3 on Haiku, GCC4.3.0 in Cygwin. Benchmarks in Haiku are of course run natively, not within a VM. The codebase is identical, Jamfile and code are written to support
both platforms.

I've tested the core of an inner loop from ShowImages ScaleBilinear method, which looks like this in the float version:

volatile uint8_t a[4], b[4], c[4], d[4], destData[4];
float a0, a1, alpha0, alpha1;

destData[0] = static_cast<uint8_t> (
     ((int32_t)a[0] * a0 + (int32_t)b[0] * a1) * alpha0 +
     ((int32_t)c[0] * a0 + (int32_t)d[0] * a1) * alpha1);
destData[1] = static_cast<uint8_t> (
     ((int32_t)a[1] * a0 + (int32_t)b[1] * a1) * alpha0 +
     ((int32_t)c[1] * a0 + (int32_t)d[1] * a1) * alpha1);
destData[2] = static_cast<uint8_t> (                                            
      ((int32_t)a[2] * a0 +
(int32_t)b[2] * a1) * alpha0 +
     ((int32_t)c[2] * a0 + (int32_t)d[2] * a1) * alpha1);
destData[3] = static_cast<uint8_t> (
     ((int32_t)a[3] * a0 + (int32_t)b[3] * a1) * alpha0 +
     ((int32_t)c[3] * a0 + (int32_t)d[3] * a1) * alpha1);

Source for the original code is in trunk/src/apps/showimage/Filter.cpp. This is just a first test version without walking memory access as I just want to measure raw performance; real code would perform differently due to cache/RAM latencies.

My benchmark also includes:
- the fixed-point version from Filter.cpp
- two variants of the float-version, with some experiments on float<->int
conversion ("better conv. #n")
- a SSE2 version as a straight adoption of the floating-point version. GCC2 doesn't like inline SSE code, so I also wrote a normal assembly version for YASM.

Here are the clock cycle timings for Haiku and Windows/Cygwin on my Intel Q9550 with 10.000 iterations of the core logic shown above:

Benchmark: ShowImage Bilinear Scale, inner loop
Compile date: Mar  2 2009 11:15:10
GCC version: 2.95.3-haiku-081024
                     --  Results  --

        Minimum    Average    Maximum
# 1:   1640033    1641250    1646085  - 'Float math, original'
# 2:    665007     665615     671008  - 'FixPt math, original'
# 3:   1640033    1641255    1646102  - 'Float math, better conv. #1'
# 4:   1680017    1681825    1692011  - 'Float math, better conv. #2'
# 5:    169125     169282     170111  - 'SSE2'


Benchmark: ShowImage Bilinear Scale, inner loop
Compile date: Mar  2 2009 10:19:35
GCC version: 4.3.0 20080305 (alpha-testing) 1
                     --  Results  --

        Minimum    Average    Maximum
# 1:    563465     565586     569712  - 'Float math, original'
# 2:    330020     330034     330122  - 'FixPt math, original'
# 3:    564765     565895     566899  - 'Float math, better conv. #1'
# 4:    600023     600137     601068  - 'Float math, better conv. #2'
# 5:    178015     179381     179927  - 'SSE2-inline'
# 6:    167619     225279     627495  - 'SSE2'

This code is of course very extreme in its computational intensity, so
these results overstate the average speedup of GCC4 vs 2; but I have a
hard time believing that GCC4 won't perform better than GCC2 on "normal" code. How much of a speedup can be gained depends on many factors, of course. I doubt that normal OS code will profit very much; it often does more calls than computations. But any kind of "media" or real computational code should see significant speedups[1].

Maybe your test cases suffered from some problem which prevented high speedups. I haven't tested Haiku/GCC4 (I need more thumbdrives...), maybe GCC4 has some problems on Haiku? I can't imagine what could cause this, but if GCC4 really shouldn't be faster on Haiku while being faster in Cygwin, there has to be some reason for it.

I'll prepare the above benchmark for download and post a link here, so that anyone interested can test for himself; performance deltas on other CPUs should be interesting. I'm also open for any suggestions for more "normal" test cases to compare compiler performance.

Now I'll wait 'til my Haiku/GCC4 build finishes and test my benchmark with it... you've really made me curious. ;-)

Christian


[1] Note that heavy use of float<->int conversions may result in slow execution, especially on AMD Athlon/64. This is due to the way C handles these conversions. Code making heavy use of such conversions could be performance-limited regardless of GCC version used.


Other related posts: