[haiku-development] Re: On timeslices and cycles

From: Christian Packmann <Christian.Packmann@xxxxxx>
To: haiku-development@xxxxxxxxxxxxx
Date: Mon, 16 Mar 2009 13:19:55 +0100

André Braga - 2009-03-13 13:38 :

Em 13/03/2009, às 07:28, Christian Packmann<Christian.Packmann@xxxxxx> escreveu:
André Braga - 2009-03-13 01:53 :

If something like your idea is implemented, the clock frequenciesneed to be normalized against the CPU architecture. Not only on x86,but other CPU architectures as well; ARM also has in-order and out-of-order designs, even though the IPC doen't vary as wildly as itdoes on x86.
A calibration loop with a *good* mix of instructions during CPUinitialisation should suffice, no?


What is a "good" mix? This is a very hard question. If it would be
possible to reliably measure a CPUs performance with a single set of
instructions, I'd guess that SPEC would be delighted to hear about it. You
could use Dhrystone... whose results Linux properly reports as BogoMIPS,
because the results are so bogus.
However, if you find a simple set of instructions which does this, I
promise you immediate fame. :-)

You also have to take into account the performance of the memory subsystem
relative to the core speed (IPC, that is), cache behavior, performance for
integer/FP/vector code in pure and mixed forms, and of course the behavior
of all possible combinations of the above on SMT-cores. Fun when you do
that on a 4-way SMT core...

I'd just use some fudge-factor for this depending on CPU type and
frequency, which gives a rough approximation of the performance. Actually
doing the measurements would be so complicated, it wouldn't be worth the
bother IMO. Please take note that I'm usually all for fancy benchmarking,
but this is something I wouldn't touch with a bargepole.

However, if your main worry is effective scheduling on same-ISA AMP
systems, then the CPU identification process should tell you about the
relative abilities of the different cores. As the OS needs to identify CPU
abilities anyway, this would be a good place to gather the required
information for the scheduler.

I can think of two approaches for this:
1. the ability to set CPU affinity for a thread, so that anapplication developer can select the CPUs/thread layout on a CPUhimself.2. adding a flag to thread spawning routines which indicate if athread is memory-bound, CPU-bound or general (i.e., a mixture ofboth).
I don't know the first thing about performance counters, except thatthey exist. With this disclaimer in place, can't they be used toderive the behaviour of a thread in this regard? Like if it spent alot of time waiting for memory or if it actually taxed the CPU withcalculations.

Good idea. Had it myself a couple days ago, but had to dismiss it. Somereasons for that in no special order:

While modern CPUs offer dozens of variables you can measure, you can onlymeasure around four at a time (depends on the CPU). This is of littlerelevance when optimizing programs - you just run your test coderepeatedly with different measurements - but a significant problem whenyou want to examine real-time behavior from a scheduler.For really meaningful analysis and efficient scheduling (especially on SMTsystems) you'd have to track x86 instructions, µops, FP/vectorinstructions, L1D performance, L2 performance, retired branches andmispredicted branches. Doesn't work due to lack of PMCs. Using only fourPMCs gives you too little information for really judging a threads behavior.


Intels SMT CPUs (so far) have only one set of PMC counters per core, so
they measure the behavior of the two threads combined. This would require
you to keep the other hardware thread free to get usable measurements. So
you'd have to isolate the software threads from time to time to measure
their current behavior. This is possible, but awkward.
You'd also have to test any combination of threads you run on a SMT-core,
as any combination of threads may exhibit specific problems: L1I stalls,
L1D/L2D stalls, instruction decode overload, overload of either integer or

FP/vector execution units, etc. This behavior will depend on how well thethreads share a CPUs execution resources, and you can't extrapolate thisfrom their behavior when run isolated; specific problems may only occurwhen threads are run together on one SMT CPU.

This is where POWER shines; it has fine-grained measurements down to the
thread-level, and the CPU even supports the concept of priority at the
hardware thread level; so the OS can tell the CPU "run thread 1 at
priority X, thread 2 at Y". Such fine-grained control is really required
to implement this concept in a clean way, and IBM can only do it because
they sell integrated systems and have complete control over OS and CPU,
where each component can influence the others development.
Maybe we'll get such fine control on general CPUs one day, but I haven't
heard of it yet.

Then there is no standard for PMC values. While in general you get more
variables to examine each processor generation, some measurements are
specific to a certain CPU or CPU generation. The measurements rarely
translate 1:1 between Intel and AMD. Of course both have unique variables.
Interpretation of cache misses depends to a high degree on the cache
architecture/topology.
The same for retired µops - on Intel they may be fused, on AMD they aren't
so far; you'd have to track the number of fused µops to judge actual
performance. But this goes against the available number of PMC registers.

And yes, you need to track µops instead of x86 instructions, as the µopstell you about the usage of execution units, which always execute singleµops. Only tracking x86 instructions won't give usable information here.


If you'd want to implement PM in a scheduler, you'd have to distill all
available events down to a few you can use across all CPUs with PMC, and
then write appropriate drivers for all CPUs.
But any new CPU may invalidate your previous design work, if PM changes
too much. And of course you can't use it on all CPUs, because not all CPUs

support PM, a sufficient number/type of available PM variables, or youcan't get the PMC documentation easily.

Oh, and the slightest error in your PM implementation (i.e. one bit setwrongly) may crash the system at once. So you have to thoroughly test anychange to the PMC reading core on *all* affected CPUs (and preferably, CPUsteppings) to make sure it runs okay. This is again no problem when usingPMC during program development, where one can work around such problems asthey arise, but a problem in OS code.


Finally, you'd need to ability to turn off PM in the scheduler to make the
PMCs available for developers wanting to test their code; the scheduler
would just have to drop back to "dumb" scheduling in this case.

Summary: I don't see this working with the current generation of CPUs.PMCs are a fantastic tool for program development, but the way they arecurrently implemented (on x86) makes them unusable for any general use inscheduler IMO. Unless we port to POWER, of course. :-)

If the threads of a specific application could benefit from apeculiar thread/CPU affinity because some threads share lots of datawhile others don't, fixed CPU affinity could optimize theperformance, which the OS never can do having to look at threads as"black boxes".
Sure. Still, you can't completely avoid cache trashing if yourapplication isn't the sole running app in the system. :)


Hehe, of course. But I was thinking about the asymmetrical cache structure
of my C2Quad, with 2x6 MiB L2 cache. Using hard affinity, an application
could schedule its data-sharing threads to make optimal use of this cache
structure; this is something which an OS can not achieve, unless you can
give a really fine-tuned set of "hardware thread prefs" at thread creation.

I've considered a "pin this CPU to this thread" (as opposed to theusual other-way-around) option to the affinity settings, but it seemsthat Axel didn't like this and wanted the scheduler to be moreautomagical in this regard. The default behaviour WILL be to run athread on the same processor it ran less, to spawn new treads on thelesser loaded processors, and to consider load delta between thevarious processors and topology (buses, shared caches, performancescore (for the AMP case)) as well when balancing load.

Which is good general behavior for most cases. And getting this right ishard enough, considering the problems of SMT and NUMA.

Oh, and fixing CPU affinity would also allow me for writing properbenchmarking tools without having to worry about core hopping, soI'm not quite neutral in this matter. ;-)
Which is why feedback is important. What kind of (behaviour-influencing, as in B_SAME_PACKAGE_BUT_AVOID_LOGICAL or what have you;not just processor masking) flags would you like to be there when youset affinity for threads on a team?


Haven't thought about it on a general level so far, my main concern was
the ability to get proper timings. :-)

After thinking a bit, I'd guess the following variables should besufficient to cover all major use cases:


Type of instructions:
- thread is integer-heavy
- thread is FPU-heavy
- thread is vector-heavy[1]
-> and all possible combinations, probably weighted; so don't make them
flags, but numerical values

[1] Not relevant on x86 where FPU and vector operations happen in the same
pipelines, but of relevance for e.g. PowerPC which has separate FPU and
vector units.


General behavior
- thread is memory-bound
- thread is CPU-bound
- thread is I/O bound[2]
-> and all possible combinations, weighted as above

[2] i.e. external I/O; on multi-socket systems, where (AFAIK) only one CPUis connected to the southbridge, such threads should be scheduled on thatCPU if possible, or on a CPU close in the topology. This obviously dependson the topology, if you got 1-hop connections across all CPUs, this is nota big problem. If multiple hops are involved, this gets more important.



Thread communication behavior:
- communicates with threads X, Y, Z
- extent of communication (per thread): light, medium, heavy
- direction (per thread): unidirectional or bidirectional


Special behavior:
- always run isolated and uninterrupted on physical core [3]

[3] Yes, my TSC and PMC measurements. :-)

I've probably overlooked something, but this should cover most basicscenarios. If the scheduler interprets these settings intelligently andflawlessly at all times, you can safely omit hard affinity.

Of course, if you manage to write such a scheduler you'd be silly todonate it to Haiku. Take it and apply at all major companies which do OSdevelopment, you should get a good and well-paid job out of it.

I'd rather concentrate on getting the "basic" behavior right (whichalready is very complicated when trying to load-balance properly on SMTNUMA systems), and just give some very basic flags, e.g. for I/O-heavythreads.Trying to write the "perfect" scheduler is not a feasible undertakingIMHO. The number of variables you'd have to correlate for achievinganything like good scheduling for complicated cases is just too high.Doing the necessary calculations gets really expensive the higher yourthread count (hardware and software) gets. And as seeing that 16-threadCPUs will be available on the high-end desktop next year (Sandy Bridge),the hardware thread count will likely rise to 32-64 by 2014 or so. Formulti-socket systems, 64-256. Keeping the scheduling simple will beimportant on such systems, I think. However, I've never written ascheduler so maybe I'm wildly overestimating the problems. Any details onthis would be welcome.


Christian

References:
- [haiku-development] On timeslices and cycles
  - From: André Braga
- [haiku-development] Re: On timeslices and cycles
  - From: Danny Robson
- [haiku-development] Re: On timeslices and cycles
  - From: André Braga
- [haiku-development] Re: On timeslices and cycles
  - From: Stephan Aßmus
- [haiku-development] Re: On timeslices and cycles
  - From: André Braga
- [haiku-development] Re: On timeslices and cycles
  - From: Christian Packmann
- [haiku-development] Re: On timeslices and cycles
  - From: André Braga

[haiku-development] Re: On timeslices and cycles

Other related posts: