[haiku-development] Re: On timeslices and cycles

André Braga - 2009-03-13 01:53 :
On Thu, Mar 12, 2009 at 20:58, Stephan Aßmus <superstippi@xxxxxx> wrote:
The human perception is a more or less fixed factor. I don't think anything
can be gained (ie. be made to appear more fluently) by switching more often,
unless you have a lot of threads actually running in parallel (ie not
waiting on something). So switching more often sounds like it would only
waste CPU.

CPU overhead itself should be the lesser problem on modern designs. Cache trashing may have a bigger impact. Each running thread will fill the cache with its own data; if you switch to soon, the cache will always be reloaded with different data instead of making efficient use of the cache.

While modern systems have high memory bandwidths which make this reloading fast, cache sizes are growing as fast as memory bandwidth - we're quickly heading for 10+MiB as standard 2nd level cache, and the amount will probably increase to 32-64MiB in the next six years or so.

On the other hand, as far as cycles and IPS are concerned, a
milisecond on a 200MHz is a *lot* different than a milisecond on a
3GHz CPU. Not taking this into consideration if you boost thread
priorities based on consumed quantum is a *bad* idea.

Clock frequency doesn't matter in itself. Average IPC (instruction per cycle) times clock frequency does. This varies wildly between different microarchitectures. A dual-issue in-order design like Intel Atom should see an average IPC of maybe 1.25-1.5; when running two threads, obviously <=1 as only two instructions can be dispatched per cycle, one for each thread. A highly efficient out-of-order design like the Core2 should normally achieve an IPC of 2.5, often higher for integer code (the core can sustain 4 dispatch+retire per cycle). So comparing them at the same frequency, you get a factor of >2x in performance difference alone.

It is also possible that Intel will push a design like Larrabee into the CPU sector at some point in time. Larrabee is in-order dual-issue with 4-way SMT; with four threads, this would give an average IPC of <=0.5 per thread. Compared to Core2, this would give a difference factor of >5 per cycle.

If something like your idea is implemented, the clock frequencies need to be normalized against the CPU architecture. Not only on x86, but other CPU architectures as well; ARM also has in-order and out-of-order designs, even though the IPC doen't vary as wildly as it does on x86.

As for the CPUs which have different speed, I think it's also a concern for
Hyper Threading. You wouldn't want to schedule a thread on a second logical
core, if another physical core is readily available at the same time. So you
need some kind of speed-bonus associated with each CPU anyways.

I'm discussing this very matter on the article I'm writing. :)

First a nitpick: Hyper Threading is Intels trademarked name for its SMT implementation on x86. It would be better if you'd call SMT SMT in a general discussion. :-) Eh, and SMT = Simultaneous Multi-Threading.

Efficient use of SMT CPUs is a problem in itself. I don't think this can be elegantly solved on the OS side in current architectures (except POWER). For efficient use of SMT, you'd need to know if a thread is memory-bound or CPU-bound; two memory-bound threads on one core will perform very badly, as they're competing for memory bandwidth. Two CPU-bound threads will compete for execution resources and also perform badly. The ideal solution is to run one memory-bound and one CPU-bound thread on one CPU.

I can think of two approaches for this:
1. the ability to set CPU affinity for a thread, so that an application developer can select the CPUs/thread layout on a CPU himself. 2. adding a flag to thread spawning routines which indicate if a thread is memory-bound, CPU-bound or general (i.e., a mixture of both).

Actually a mixture of both would be good. For some applications which can use all cores, setting the CPU affinity is extremely useful to prevent "core hopping". This would go for e.g. Handbrake or other video transcoders, which can load most CPUs fully - making thread rescheduling superfluous. This would also allow an application to maximize cache usage on systems with asymmetrical cache (Intels Core2 quads have two 2nd level caches, one for core0+1, one for core 2+3). If the threads of a specific application could benefit from a peculiar thread/CPU affinity because some threads share lots of data while others don't, fixed CPU affinity could optimize the performance, which the OS never can do having to look at threads as "black boxes".

Oh, and fixing CPU affinity would also allow me for writing proper benchmarking tools without having to worry about core hopping, so I'm not quite neutral in this matter. ;-)

Christian

Other related posts: