[haiku-development] Re: On timeslices and cycles

  • From: Christian Packmann <Christian.Packmann@xxxxxx>
  • To: haiku-development@xxxxxxxxxxxxx
  • Date: Mon, 16 Mar 2009 13:19:55 +0100

André Braga - 2009-03-13 13:38 :
Em 13/03/2009, às 07:28, Christian Packmann <Christian.Packmann@xxxxxx> escreveu:
André Braga - 2009-03-13 01:53 :

If something like your idea is implemented, the clock frequencies need to be normalized against the CPU architecture. Not only on x86, but other CPU architectures as well; ARM also has in-order and out- of-order designs, even though the IPC doen't vary as wildly as it does on x86.

A calibration loop with a *good* mix of instructions during CPU initialisation should suffice, no?

What is a "good" mix? This is a very hard question. If it would be
possible to reliably measure a CPUs performance with a single set of
instructions, I'd guess that SPEC would be delighted to hear about it. You
could use Dhrystone... whose results Linux properly reports as BogoMIPS,
because the results are so bogus.
However, if you find a simple set of instructions which does this, I
promise you immediate fame. :-)

You also have to take into account the performance of the memory subsystem
relative to the core speed (IPC, that is), cache behavior, performance for
integer/FP/vector code in pure and mixed forms, and of course the behavior
of all possible combinations of the above on SMT-cores. Fun when you do
that on a 4-way SMT core...

I'd just use some fudge-factor for this depending on CPU type and
frequency, which gives a rough approximation of the performance. Actually
doing the measurements would be so complicated, it wouldn't be worth the
bother IMO. Please take note that I'm usually all for fancy benchmarking,
but this is something I wouldn't touch with a bargepole.

However, if your main worry is effective scheduling on same-ISA AMP
systems, then the CPU identification process should tell you about the
relative abilities of the different cores. As the OS needs to identify CPU
abilities anyway, this would be a good place to gather the required
information for the scheduler.

I can think of two approaches for this:
1. the ability to set CPU affinity for a thread, so that an application developer can select the CPUs/thread layout on a CPU himself. 2. adding a flag to thread spawning routines which indicate if a thread is memory-bound, CPU-bound or general (i.e., a mixture of both).

I don't know the first thing about performance counters, except that they exist. With this disclaimer in place, can't they be used to derive the behaviour of a thread in this regard? Like if it spent a lot of time waiting for memory or if it actually taxed the CPU with calculations.

Good idea. Had it myself a couple days ago, but had to dismiss it. Some reasons for that in no special order:

While modern CPUs offer dozens of variables you can measure, you can only measure around four at a time (depends on the CPU). This is of little relevance when optimizing programs - you just run your test code repeatedly with different measurements - but a significant problem when you want to examine real-time behavior from a scheduler. For really meaningful analysis and efficient scheduling (especially on SMT systems) you'd have to track x86 instructions, µops, FP/vector instructions, L1D performance, L2 performance, retired branches and mispredicted branches. Doesn't work due to lack of PMCs. Using only four PMCs gives you too little information for really judging a threads behavior.

Intels SMT CPUs (so far) have only one set of PMC counters per core, so
they measure the behavior of the two threads combined. This would require
you to keep the other hardware thread free to get usable measurements. So
you'd have to isolate the software threads from time to time to measure
their current behavior. This is possible, but awkward.
You'd also have to test any combination of threads you run on a SMT-core,
as any combination of threads may exhibit specific problems: L1I stalls,
L1D/L2D stalls, instruction decode overload, overload of either integer or
FP/vector execution units, etc. This behavior will depend on how well the threads share a CPUs execution resources, and you can't extrapolate this from their behavior when run isolated; specific problems may only occur when threads are run together on one SMT CPU.
This is where POWER shines; it has fine-grained measurements down to the
thread-level, and the CPU even supports the concept of priority at the
hardware thread level; so the OS can tell the CPU "run thread 1 at
priority X, thread 2 at Y". Such fine-grained control is really required
to implement this concept in a clean way, and IBM can only do it because
they sell integrated systems and have complete control over OS and CPU,
where each component can influence the others development.
Maybe we'll get such fine control on general CPUs one day, but I haven't
heard of it yet.

Then there is no standard for PMC values. While in general you get more
variables to examine each processor generation, some measurements are
specific to a certain CPU or CPU generation. The measurements rarely
translate 1:1 between Intel and AMD. Of course both have unique variables.
Interpretation of cache misses depends to a high degree on the cache
architecture/topology.
The same for retired µops - on Intel they may be fused, on AMD they aren't
so far; you'd have to track the number of fused µops to judge actual
performance. But this goes against the available number of PMC registers.
And yes, you need to track µops instead of x86 instructions, as the µops tell you about the usage of execution units, which always execute single µops. Only tracking x86 instructions won't give usable information here.

If you'd want to implement PM in a scheduler, you'd have to distill all
available events down to a few you can use across all CPUs with PMC, and
then write appropriate drivers for all CPUs.
But any new CPU may invalidate your previous design work, if PM changes
too much. And of course you can't use it on all CPUs, because not all CPUs
support PM, a sufficient number/type of available PM variables, or you can't get the PMC documentation easily.

Oh, and the slightest error in your PM implementation (i.e. one bit set wrongly) may crash the system at once. So you have to thoroughly test any change to the PMC reading core on *all* affected CPUs (and preferably, CPU steppings) to make sure it runs okay. This is again no problem when using PMC during program development, where one can work around such problems as they arise, but a problem in OS code.

Finally, you'd need to ability to turn off PM in the scheduler to make the
PMCs available for developers wanting to test their code; the scheduler
would just have to drop back to "dumb" scheduling in this case.

Summary: I don't see this working with the current generation of CPUs. PMCs are a fantastic tool for program development, but the way they are currently implemented (on x86) makes them unusable for any general use in scheduler IMO. Unless we port to POWER, of course. :-)

If the threads of a specific application could benefit from a peculiar thread/CPU affinity because some threads share lots of data while others don't, fixed CPU affinity could optimize the performance, which the OS never can do having to look at threads as "black boxes".

Sure. Still, you can't completely avoid cache trashing if your application isn't the sole running app in the system. :)

Hehe, of course. But I was thinking about the asymmetrical cache structure
of my C2Quad, with 2x6 MiB L2 cache. Using hard affinity, an application
could schedule its data-sharing threads to make optimal use of this cache
structure; this is something which an OS can not achieve, unless you can
give a really fine-tuned set of "hardware thread prefs" at thread creation.

I've considered a "pin this CPU to this thread" (as opposed to the usual other-way-around) option to the affinity settings, but it seems that Axel didn't like this and wanted the scheduler to be more automagical in this regard. The default behaviour WILL be to run a thread on the same processor it ran less, to spawn new treads on the lesser loaded processors, and to consider load delta between the various processors and topology (buses, shared caches, performance score (for the AMP case)) as well when balancing load.

Which is good general behavior for most cases. And getting this right is hard enough, considering the problems of SMT and NUMA.

Oh, and fixing CPU affinity would also allow me for writing proper benchmarking tools without having to worry about core hopping, so I'm not quite neutral in this matter. ;-)

Which is why feedback is important. What kind of (behaviour- influencing, as in B_SAME_PACKAGE_BUT_AVOID_LOGICAL or what have you; not just processor masking) flags would you like to be there when you set affinity for threads on a team?

Haven't thought about it on a general level so far, my main concern was
the ability to get proper timings. :-)

After thinking a bit, I'd guess the following variables should be sufficient to cover all major use cases:

Type of instructions:
- thread is integer-heavy
- thread is FPU-heavy
- thread is vector-heavy[1]
-> and all possible combinations, probably weighted; so don't make them
flags, but numerical values

[1] Not relevant on x86 where FPU and vector operations happen in the same
pipelines, but of relevance for e.g. PowerPC which has separate FPU and
vector units.


General behavior
- thread is memory-bound
- thread is CPU-bound
- thread is I/O bound[2]
-> and all possible combinations, weighted as above

[2] i.e. external I/O; on multi-socket systems, where (AFAIK) only one CPU is connected to the southbridge, such threads should be scheduled on that CPU if possible, or on a CPU close in the topology. This obviously depends on the topology, if you got 1-hop connections across all CPUs, this is not a big problem. If multiple hops are involved, this gets more important.


Thread communication behavior:
- communicates with threads X, Y, Z
- extent of communication (per thread): light, medium, heavy
- direction (per thread): unidirectional or bidirectional


Special behavior:
- always run isolated and uninterrupted on physical core [3]

[3] Yes, my TSC and PMC measurements. :-)


I've probably overlooked something, but this should cover most basic scenarios. If the scheduler interprets these settings intelligently and flawlessly at all times, you can safely omit hard affinity.

Of course, if you manage to write such a scheduler you'd be silly to donate it to Haiku. Take it and apply at all major companies which do OS development, you should get a good and well-paid job out of it.

I'd rather concentrate on getting the "basic" behavior right (which already is very complicated when trying to load-balance properly on SMT NUMA systems), and just give some very basic flags, e.g. for I/O-heavy threads. Trying to write the "perfect" scheduler is not a feasible undertaking IMHO. The number of variables you'd have to correlate for achieving anything like good scheduling for complicated cases is just too high. Doing the necessary calculations gets really expensive the higher your thread count (hardware and software) gets. And as seeing that 16-thread CPUs will be available on the high-end desktop next year (Sandy Bridge), the hardware thread count will likely rise to 32-64 by 2014 or so. For multi-socket systems, 64-256. Keeping the scheduling simple will be important on such systems, I think. However, I've never written a scheduler so maybe I'm wildly overestimating the problems. Any details on this would be welcome.

Christian

Other related posts: