[haiku] Re: gcc4 hibryd

From: Christian Packmann <Christian.Packmann@xxxxxx>
To: haiku@xxxxxxxxxxxxx
Date: Mon, 02 Mar 2009 21:18:53 +0100

Stephan Assmus wrote:

> In terms of speed, I have neither

noticed nor could I measure any difference. I had one particular project,where I suspected a GCC4 build would improve speed, but it didn't.


Hm, that's interesting. This is contrary to my findings when comparing

code between GCC2 on Haiku and GCC4 on Cygwin, at least when testing codewhich does heavy calculations. I've measured differences of up to 2x forinteger calculations, 3x for floating point.My measurements are done using the CPUs Time Stamp Counter, precision isabout +- 100 CPU clock cycles. I use GCC2.95.3 on Haiku, GCC4.3.0 inCygwin. Benchmarks in Haiku are of course run natively, not within a VM.The codebase is identical, Jamfile and code are written to support

both platforms.

I've tested the core of an inner loop from ShowImages ScaleBilinearmethod, which looks like this in the float version:


volatile uint8_t a[4], b[4], c[4], d[4], destData[4];
float a0, a1, alpha0, alpha1;

destData[0] = static_cast<uint8_t> (
     ((int32_t)a[0] * a0 + (int32_t)b[0] * a1) * alpha0 +
     ((int32_t)c[0] * a0 + (int32_t)d[0] * a1) * alpha1);
destData[1] = static_cast<uint8_t> (
     ((int32_t)a[1] * a0 + (int32_t)b[1] * a1) * alpha0 +
     ((int32_t)c[1] * a0 + (int32_t)d[1] * a1) * alpha1);
destData[2] = static_cast<uint8_t> (                                            
      ((int32_t)a[2] * a0 +
(int32_t)b[2] * a1) * alpha0 +
     ((int32_t)c[2] * a0 + (int32_t)d[2] * a1) * alpha1);
destData[3] = static_cast<uint8_t> (
     ((int32_t)a[3] * a0 + (int32_t)b[3] * a1) * alpha0 +
     ((int32_t)c[3] * a0 + (int32_t)d[3] * a1) * alpha1);

Source for the original code is in trunk/src/apps/showimage/Filter.cpp.This is just a first test version without walking memory access as I justwant to measure raw performance; real code would perform differently dueto cache/RAM latencies.


My benchmark also includes:
- the fixed-point version from Filter.cpp
- two variants of the float-version, with some experiments on float<->int
conversion ("better conv. #n")

- a SSE2 version as a straight adoption of the floating-point version.GCC2 doesn't like inline SSE code, so I also wrote a normal assemblyversion for YASM.

Here are the clock cycle timings for Haiku and Windows/Cygwin on my IntelQ9550 with 10.000 iterations of the core logic shown above:


Benchmark: ShowImage Bilinear Scale, inner loop
Compile date: Mar  2 2009 11:15:10
GCC version: 2.95.3-haiku-081024
                     --  Results  --

        Minimum    Average    Maximum
# 1:   1640033    1641250    1646085  - 'Float math, original'
# 2:    665007     665615     671008  - 'FixPt math, original'
# 3:   1640033    1641255    1646102  - 'Float math, better conv. #1'
# 4:   1680017    1681825    1692011  - 'Float math, better conv. #2'
# 5:    169125     169282     170111  - 'SSE2'


Benchmark: ShowImage Bilinear Scale, inner loop
Compile date: Mar  2 2009 10:19:35
GCC version: 4.3.0 20080305 (alpha-testing) 1
                     --  Results  --

        Minimum    Average    Maximum
# 1:    563465     565586     569712  - 'Float math, original'
# 2:    330020     330034     330122  - 'FixPt math, original'
# 3:    564765     565895     566899  - 'Float math, better conv. #1'
# 4:    600023     600137     601068  - 'Float math, better conv. #2'
# 5:    178015     179381     179927  - 'SSE2-inline'
# 6:    167619     225279     627495  - 'SSE2'

This code is of course very extreme in its computational intensity, so
these results overstate the average speedup of GCC4 vs 2; but I have a

hard time believing that GCC4 won't perform better than GCC2 on "normal"code.How much of a speedup can be gained depends on many factors, of course. Idoubt that normal OS code will profit very much; it often does more callsthan computations. But any kind of "media" or real computational codeshould see significant speedups[1].

Maybe your test cases suffered from some problem which prevented highspeedups. I haven't tested Haiku/GCC4 (I need more thumbdrives...), maybeGCC4 has some problems on Haiku? I can't imagine what could cause this,but if GCC4 really shouldn't be faster on Haiku while being faster inCygwin, there has to be some reason for it.

I'll prepare the above benchmark for download and post a link here, sothat anyone interested can test for himself; performance deltas on otherCPUs should be interesting. I'm also open for any suggestions for more"normal" test cases to compare compiler performance.

Now I'll wait 'til my Haiku/GCC4 build finishes and test my benchmark withit... you've really made me curious. ;-)


Christian

[1] Note that heavy use of float<->int conversions may result in slowexecution, especially on AMD Athlon/64. This is due to the way C handlesthese conversions. Code making heavy use of such conversions could beperformance-limited regardless of GCC version used.

Follow-Ups:
- [haiku] Re: gcc4 hibryd
  - From: Axel Dörfler
- [haiku] Re: gcc4 hibryd
  - From: Christian Packmann

[haiku] Re: gcc4 hibryd

Other related posts: