Nick - 2009-03-09 05:28 :
You can also change -O2 to -O3 if you think stability is not affected. Youcan change mtune to a lesser CPU ( =pentium3, pentium-m, athlon-xp, etc. ) but I believe many are getting or using Core2s these days. Maybemtune=pentium3 or athlon-xp would be the safer choices?
If you want to tune for modern CPUs, you should follow the GCC manual and use -mtune=generic.
However, this may result in very bad performance on an in-order architecture like Intels Atom. Some of the modern out-of-order CPUs seem to prefer their code grouped into blocks of serial dependencies, I guess this makes the job of the scheduler easier. I observed this with MMX code on an Athlon XP, the original code with lots of serial dependencies was 5% faster than my hand-"optimized" version which tried to decouple the dependencies. I don't know if the modern GCCs actually use serial-dependent ordering, I haven't examined this yet. So the following only applies if GCC actually generates optimal code for the modern OOOE architectures. If it always/mostly decouples serial dependencies, this is a moot point.
Using serial-dependent code streams is all fine and dandy on nearly all CPUs, but it can break badly on Atom (and C7, C3). These are in-order designs, they only can execute instructions in the order they appear in the instruction stream. If the code is arranged in serial dependencies, this may drop performance significantly. How much depends on the instructions used - low-latency instructions won't suffer much (reg2reg ADD, SHIFT, AND, OR, etc.), but code with long-latency instructions like MUL or memory accesses can lead to significant delays. Latencies in the instruction stream are usually hidden by the reordering done on OOOE CPUs which happens in real-time, but an in-order design can't do that, it is totally dependent on the quality of the code it executes.
Using -mtune=pentium may actually give acceptable code for Atom, as it is also an dual-issue in-order design. The pairing rules for parallel execution will be different, and floating point code will likely not be optimized as the Pentiums FPU wasn't pipelined. But overall it may be better than scheduling for "normal" CPUs. The real solution will only come when GCC adds -mtune=atom.
The tradeoff for compiling in-order code is slightly slower performance on OOOE CPUs, but this should be no higher than 5-10%. But an Atom may suffer at least 50% (WAG, needs benchmarking), depending on the amount of serial dependencies; and it needs any performance help the most as it is one of the slowest CPUs on the market.
This dependency problem is also the reason that Atom is multi-threaded, when one thread runs into a stall due to scheduling problems, the other thread can execute instructions. But this will only help if the workload is evenly split across two threads, which often will not be the case. Using good code in the first place is the better solution.