[haiku-development] [RFC] Proposal for the new Haiku scheduler

  • From: "André Braga" <meianoite@xxxxxxxxx>
  • To: haiku-development@xxxxxxxxxxxxx, openbeos@xxxxxxxxxxxxx
  • Date: Sun, 10 Jun 2007 09:44:16 +0300

INDEX:
PREFACE
INTRODUCTION
 (Or: how I got here)
PART II
 (Or: what you learn after a number of false starts)
   Section A: Inspiration
   Section B: Design decisions
PART III
 (Or: what we have so far, and what the future holds)


PREFACE

Hi, guys! Sorry about the silence. I got a little busy at school, and
had to (mostly) sort out some issues I found with several approaches I
tried. This is a result of roughly a month worth of deep
introspection, and letting go of some of my personal beliefs and
wishful thinking to come up with something truly powerful.

The general guidelines for the algorithms and data structures used
were, in order of importance: effectiveness, succinctness, generality,
choosing appropriate data structures instead of micro-tweaking the
algorithms, extensibility.

This is going to be a little big, and written in a somewhat
"educational" style, and I'm also telling the whole story of how I got
here, and that's because I intend to break this down in smaller parts
to be posted as blog entries :) If you'd rather cut to the point,
scroll down to PART II.

If you don't, I hope you find this text entertaining :)

Risking myself to sound too much like The Poignant's Guide (minus the
cartoons), here I go ;)


INTRODUCTION
(Or: how I got here)

The history goes like this: My first contact with BeOS was when I was
still attending high school. Unfortunately I didn't experience the
Golden Years, namely the R4 era. My first real contact with BeOS was
with R5 PE; then I bought the Gobe Productive + R5 Pro bundle, and
BeOS was my main OS for quite some time on my Pentium 133. Ah,
nostalgia :)

I'm a Computer Science student now. Circa 3 years ago I designed a
thread scheduler as an assignment for the Operating Systems class. It
fared pretty well in artificial benchmarks with purely CPU-bound
simulated threads. I was excited. Seemed like I did something really
cool :D

I contacted Axel right after that semester finished and told him about
the scheduler and asked him if there was interest in using it on
Haiku. He was pretty open-minded, and I promised him I'd translate the
documentation written for that assignment into English. I never did.

Why? Well... I burned out. College was eating most of my energy. I was
just getting out of a troubled relationship. And it all coincided with
a hard drive failure. I lost almost no data, but the excitement was
just gone. I used to be quite active in BeShare and IRC, even compiled
stuff on my old trusty Pentium 133... (if Fyysik is reading this,
he'll remember that Pentium immediately ;))

Anyway, the scheduler plans were on the back burner for roughly the
mentioned 3 years. And you'll remember that before we got accepted in
the GSoC I mailed the list asking about ticket #1069
(http://dev.haiku-os.org/ticket/1069), asking if I could still
participate. Then I spent some days digging up the sources for the
scheduler and re-doing some benchmarks on my current computer. Things
still looked nice :)



PART I
(Or: rose-coloured glasses are both the blessing and the curse of being in love)

So, now I set myself to improve my original algorithm performance even
further. But that's just because I *knew* it sucked. It was extremely
inefficient as far as implementation goes; it looked great in the
benchmarks because it was being compared to O(n) algorithms, while it
was O(1) itself, so I already had a head-start, so to speak. But I
knew the algorithm very well and understood that there were plenty of
bottlenecks to fix.

Well, things started to look great as I progressed. I eliminated a
nested loop, then I eliminated an outer loop, then I benchmarked a
little more, and I reached extremely small overhead (only 8x slower)
compared to the theoretical lower bound, i.e., a standard "for" loop
that reads sequentially from an array. But then something hit me.

My algorithm made no distinction between absolute priority levels,
only relative.

If you read the description of set_thread_priority() and
suggest_thread_priority() on the BeBook, you'll realise that there
should be an attempt to correlate priority levels and frequency of
scheduling a given thread. This means that, at the very least, that if
I have two threads in the system, one with a higher priority than the
other, the higher-priority one shouldn't, for example, always be
picked 66% of the time while the other one is picked 33% of the time.
If said priorities are, say, 20 and 10, then those numbers are OK (2:1
ratio). But if one has a priority of 40 and the other a priority of
10, the ratio they're chosen should be 4:1. My algorithm only cared if
one had a higher priority than the other, and tended to pick ratios
such that the highest priority one got the largest share, then it
diminished progressively as levels became lower. If there were 3
threads in the system with different priority levels, the ratios would
be 50:33:16; 4 threads, 40:30:20:10, and so on.

I felt that this is just wrong.

And then I resorted to nasty workarounds.

First, I tried adjusting the ratios with an exponential function. And
although the ratios actually improved, things started to feel way too
hackish.

You know what? I was so infatuated with my own algorithm that I didn't
see how plainly flawed it was.

So, I admit: I spent nearly 3 weeks polishing a turd.



PART II
(Or: what you learn after a number of false starts)

Section A
 Inspiration

After I admitted the basic flaws of my previous attempts, I decided to
go back to the basics and use a bottom-up approach to come up with
solutions to the needs of the Haiku scheduler: fairness,
responsiveness, responsive, scalability, succinctness and efficiency.

The BeBook already hints at what the priority numbers should roughly
mean: frequency of scheduling. We have 120 priority levels; the higher
20 are reserved to real-time threads.

So while I was studying classic scheduler approaches on the Minix
book, I noticed that the description of lottery-based scheduling was
very compatible with that description of priorities meaning
approximate frequency of being chosen by the scheduler. Lottery
scheduling was proposed by Carl Waldspurger in 1994, and it works
roughly like this:

Suppose you have some resource that is limited (like CPU power, I/O
bandwidth, network bandwidth, and so on). Suppose you have several
processes competing for those resources, and they have different
priorities. Now, suppose you have a number of tickets that correspond
to a share of time some process can make use of those resources. If
you give many tickets to a higher priority process, it will be allowed
to use those limited resources proportionally to the number of tickets
it holds. So process with priority 20 gets 20 tickets, process with
priority 10 gets 10 tickets, and so on.

Like in a lottery, a ticket is picked at random, and the process with
the winning ticket is allowed to use the resource. As time goes by,
the processes will have used the resource proportionally to the number
of tickets they hold.

This is very nice in theory, but there are some problems. The main
problem is the randomness: the first several rounds of scheduling will
resemble anything but the proportion we'd like to attain. We only
observe the desired ratios in the long run.

A possible solution to this was proposed by Waldspurger himself in
1995, when he introduced the stride-based scheduling. In a nutshell,
he replaced random lotteries with strides and passes. All processes
had a property called "pass" and a property called "stride". Strides
are inversely proportional to the priority. Each time a process is
chosen, its pass is increased by stride. The scheduler always picks
the process with the lowest pass.

The second problem lies in the mapping between tickets and processes.
Searching for a ticket by visiting every process is very inefficient.
We could use hash tables, or trees, to do this indexing. I'll tackle
this shortly.

The third problem is that the algorithm as expressed in Waldspurger's
work seems to have linear complexity regarding the number of processes
in the system. One of the requirements of Haiku's scheduler is that it
should have constant (O(1)) complexity regarding the number of
processes (threads, in our case). One possible workaround is to have
what Waldspurger called "hierarchical lottery/stride scheduling", but
I devised a much simpler expression of the same idea. Read on, this
will be tackled shortly, too.

The fourth and worst problem is that both lottery- and stride-based
scheduling are great for processes that want to use as much of the
limited resource as possible during their time share. In the case of
CPU power, those would be CPU-bound threads. BeOS and Haiku are
desktop operating systems, where interactive threads are prevalent.

But wait, all is not wasted ;) There are solutions for all those
problems. I have come up with some twists that I'd like to share with
you.

I'm convinced that even if the stride scheduling as expressed in
Waldspurger's thesis could be less than optimal for interactive tasks
(i.e., it doesn't approach smallest-job-first scheduling, which is
proven to provide the best responsiveness), I have no doubt that the
idea of using strides results in perfect fairness when all tasks are
equal. One can actually prove this mathematically. Not necessarily his
stride + pass approach, though. Simple strides will do.

For example, for modeling purposes, suppose you have an array
where you store a task T as many times as T's priority is (let's call
this the temporal frequency property, or TFP). So let task A have
priority 1, task B have priority 2, task C have priority 3. One
possible array where the TFP holds is
A B B C C C
but so does any permutation of this array. If you multiply the
occurrences of every element by a constant factor, for example 2, the
array becomes
A A B B B B C C C C C C
the TFP still holds for this array, as well as for any possible
permutation of it.

It turns out that for any array for which the TFP holds, you can
reduce it to the "trivial" array by ordering the array and "scaling"
it so that the fewest occurring element appears
greatest_common_divisor(A.prio, B.prio, C.prio, ...) times.
Alternatively, it's enough to calculate the stride using the trivial
array, and then scale the stride using the GCD.

Nice, so now we have an invariant :) So let's keep the trivial array,
A B B C C C.

Notice that any strided, modulo size of the array (which is
identical to the sum of all priorities), traversal of this array,
where the stride is coprime to the size of the array (let's call it
Good Stride Property), will return a given task proportionally to the
number of times it appears on the array, divided by the size of the
array (i.e., the sum of the priorities of all tasks).

So in our example A will appear 1/6 of the times, B, 2/6 (1/3) of the
times, and C, 3/6 (1/2) of the times. For ANY stride where the GSP
holds.

Trivial case: stride = 1. Traversing the array will output A, B, B, C,
C, C, A, B, B, ...
Next case, stride = 2: not coprime, GSP doesn't hold. Would output A,
B, C, A, B, C...
Next case, stride = 3. not coprime, GSP doesn't hold, would output A,
C, A, C, A, C...
Stride 4: A, C, B, A, C, B... Not coprime, GSP doesn't hold.
Stride 5: A, C, C, C, B, B, ... OK.
Stride 6: not coprime.

(well, unfortunately I picked a terrible example... 6 is an awful
number for the array size, since only 1 and 5 are coprimes, and the
modular distance of both 1 and 5 to 6 is 1...)

Anyway, all it takes is respecting the GSP. After size_of_array
scheduling rounds, the correct proportions will emerge by
construction.

Actually, picking a good stride ensues that the distribution of the
tasks in time is more even, but there's no algorithm that will
algorithm a stride that displays this even-ness property in every
possible case, because such stride might not even exist. The example
above is precisely one counter-example that proves this assertion. But
there are heuristics available to pick a good stride whenever
possible.

Now, what if we add a new task? Should we abandon the current strided
traversal and start anew? Well, I still haven't decided that . So far
I considered starting anew but taking the opposite direction
(subtracting strides if they were being added in the previous
scheduling round, and vice-versa), with a different stride, starting
from a random position. This should eventually converge to a fair
distribution. Simply finishing the current strided traversal until it
completes size_of_array and ignoring new tasks would be *terrible* for
responsiveness when the number of tasks grow. I believe using the rule
of thumb I described as a compromise should be reasonable enough, but
I'm open for suggestions.

Now, let's forget the array abstraction for a little. Time for
something more concrete :)



PART II
Section B
 Design decisions

Suppose you're starting with a clean slate, and all you have so far
are the tasks and their respective priorities. Now let's scribble a
little in that slate :)

Remember our application domain: scheduling threads. The
implementation I have considered so far looks like this:

We would have several queues, one per priority. There would be 120 of
those on Haiku.

Let's make a little differentiation here: 20 of those are reserved to
real-time threads. The other 100 are for non-RT threads. So let's
separate the scheduling jobs in two "realms": RT and non-RT.

For the sake of simplicity, whenever RT threads are ready, they're
chosen, priority order. That simple. Now let's take care of the non-RT
realm:

We only care about "active" queues, i.e., those which have queued
threads. Most times, however, we don't need allocate space for every
queue; using more than 10 queues is actually pretty rare. If we decide
that SMP should be implemented with per-CPU queues* (all 120 of
them... times # of CPUs!), wasting this space makes me feel somewhat
sloppy. Same thing with hash tables: they'll be either
space-inefficient or too small to bring any real benefits.
(*) Read on! ;)

Axel told me not to worry about this, though, and that it's better to
be on the safe side and preallocate all the memory we should ever
need. If we ever feel the need to manage that memory dynamically, for
instance if we end up needing to support hundreds of cores, we can
adapt the algorithm easily.

Anyway, under any circumstances we'd better come up with a dynamic,
compact and versatile data structure. Well, arrays are compact, we
could simply place the queues where they belong and use a bitmap to
control active/inactive queues.

But that's not very cache-friendly, though. We could place the queues
all together in the front of the array, then. But now we must take
into account the fact that queues can come and go anytime, and I feel
that insertions in the middle of the array will be way more common
than insertions in the end. Insertions will then require moving up to
O(q) queues. We could potentially use some memcpy tricks, but I don't
know if it's worth the effort, not to mention that if we ever go the
dynamic memory management route, memory fragmentation is almost
certain to happen as we would allocate and deallocate memory every
time a queue is destroyed and almost every time queue is created and
must be inserted in the array, as we'd like to keep the structure
compact. Hugo's slab allocator would come in handy, but... Well,
abusing it to implement resizing arrays, in kernel land, and to use it
in the thread scheduler, which suggests that insertions and removals
will happen frequently, well, that's still a hack. And an ugly one
which I'd rather avoid RIGHT NOW.

Linked lists are less of a hack, but finding the insertion point still
takes O(q). Eeeh...

So... Trees? Well, trees should be OK. Red-black trees are nice, as
they're self-balancing and will keep the amortized cost of any
operation O(lg q); in the worst case we'd have 100 non-RT active
queues, so any operations (insertion, deletion, search) are always
going to be upper-bounded by O(7), which reduces to O(1). Nice!!

Now how can we relate that to strides, so we have fair scheduling?

Suppose you keep account of the sum of the priorities of each queue in
a variable, let's call it SoP. Let variable Stride be... the stride.
All the array ever did was to map the numeric index of a position to
the corresponding task. So index 1 mapped to A, index 2 and 3 mapped
to B, and index 3 to 6 mapped to C. If we could emulate that with just
SoP and Stride, we can do away with the array altogether. Well,
producing numeric indexes (i.e., tickets) is completely trivial:

index = ((index + Stride) % SoP) + 1

Mapping them, however, is a little tricky.

Let's go back to the array model. Instead of placing tasks multiple
times on the array, we could store triples comprised of:
(task, offset, priority)
our array would become
(A, 0, 1), (B,1,2), (C, 3, 3)

Notice that offset(K) = offset(antecessor(K)) + priority(antecessor(K))

In our example, index will vary from 1 to 6, and mappings would work like this:
1. Let T be the first node.
2. If T.offset < index <= (T.offset + T.priority), then return T.
3. Else, let T be the next node, and go to step 2.

That's very naive, but would work.

We have two options here. Either we store computed offsets within the
nodes, or we compute them on demand.

Drawback of first option: upon insertion, we must adjust every offset
from nodes ahead of the insertion point.
Drawback of second option: we must recursively visit every antecessor
node to compute the offset.

A little observation here couldn't hurt: we don't have to keep the
nodes ordered! We can always insert at the tail! So computed offsets
are the way to go!

Eeeh... Not that fast. Consider deletions. They will FREQUENTLY occur
in the middle of the list. Worse: they will frequently occur in the
head as well. So we still need to propagate offset deltas to successor
nodes. That's still O(q). So not keeping the order is doing buying us
very little; we now have two O(q) operations instead of three.

But we have already decided to use balanced binary trees so that every
operation is O(lg q) and therefore bounded to O(7) in our case,
haven't we? I'm glad we did!

Why?

Well, consider this 4-uple:

(PQ, offset, priority, # of threads)

This 4-uple differs from the (task, offset, priority) triple in that
step 2 in the mapping function will now become (variable names
adjusted accordingly):

2. If PQ.offset < index <= (PQ.offset + PQ.priority * PQ.#threads),
then return PQ.

And the offset rule now becomes

offset(K) = offset(antecessor(K)) + priority(antecessor(K)) *
no_of_threads(antecessor(K))

Why would I do that? Well, we store the queues, not the threads, so
the tree is kept compact. Threds are picked from the queues in FIFO
order; that will naturally give us round-robin scheduling when we
consider the queue in isolation. "Boosting" the queue's priority by
multiplying it by the number of threads it holds has the exact same
effect of using the threads as nodes, but it's much more space
efficient, not to mention how it avoids messing with the tree on every
insertion and removal of threads: if there is a corresponding queue in
place already, the tree is largely untouched.

"Largely", you say? Yes. The only reason it is not completely
untouched is that we must take the offsets into account.

Ah... Those damned offsets. But I managed to avoid having to
explicitly propagate them to the successors and not recursively
recompute them using the antecessors. How? Using one of the oldest
tricks in the computer science: lazy evaluation!

(Well, here I can attest that knowing some functional programming and
artificial intelligence techniques really DO pay off when you least
expect ;))

First, we must assert that there won't be bogus deletions, i.e., we
won't try to remove nodes that are not on the tree. Well, easy enough:
removals only happen when there's only one thread in the priority
queue, and we're removing that thread from the queue. So now the queue
is empty, and this is the only situation that triggers a removal
operation.

We must also ensure that when you insert a node, it will push every
successor node.priority units to the right. This is the same as adding
node.priority to every successor's offset component. The key is to
postpone pushing nodes around for as long as we can.

In order to use lazy evaluation, we must augment our node from a
4-uple to a 5-uple: (PQ, offset, offset delta, priority, # of
threads). Now every time we insert a node, we add three extra steps in
the insertion routine:

a. if the curent node offset_delta field is non-zero, add it to the
*offset* field of current node. Also add it to the *offset_delta*
field of the left child node (when present) and right child node (when
present).
b. if the new (to be added) node will follow the path to the left
child (i.e., has a lower priority than the current node), add
new_node.priority to current_node.offset and
current_node.right.offset_delta.
c. when you find the point of insertion of new_node, remember to
compute new_node.offset using the offset rule.

Step (a) must be repeated in the search operation and in the deletion
operation as well. Step (b) is not required when searching, but when
deleting, it changes a little: you must now subtract
node_to_be_deleted.priority from current_node.offset and
current_node.right.offset_delta. Step (c) is only required when
inserting. That's for inserting, searching for and removing whole
queues. The adaptations necessary to insert threads into queues are
left as an exercise for the reader ;)



PART III
 (Or: what we have so far, and what the future holds)

Cool, now we have a space- and time-efficient data structure to
handle the stride scheduling thing. Not only that, but the same
reasoning applies to *any* trees, should we decide to use splay trees,
AA trees, AVL trees, unbalanced binary trees, whatever. So far we took
care of the (not anymore) skewed proportions due to randomness (by
using strides), the mapping of tickets to tasks (by using trees,
"indexes" as the tickets, offsets to enable using indexes, and lazy
evaluation to efficiently implement offset propagation to upper
queues), and the order of complexity (by using the queues as the node
elements of the tree, where the key is base priority times number of
threads in that queue).

(BIG Sidenote: notice that hashtables that map index numbers to queues
won't really work: we're dealing with ranges, and you really don't
want to add a hundred tickets in the hashtable every time a thread
with priority 100 is spawned...
For the sake of clarity, let me re-state the problem by using an
analogy: it's like having an arbitrary number of sticks of different
sizes that are multiple of the same unit, and you arrange them in a
circle. Now you begin at an arbitrary point and walk R unit-large
steps. Which stick will you be standing next to when you finish
walking?
The problem lies in expressing the following facts: the sticks can be
placed arbitrarily, but once you lay them down, they must be kept in
that order at least until a (new) stick is removed from(/introduced
into) the circle; the number of steps you take depend on how big the
circumference of said circle is; the circumference will evidently
change when you change the number of sticks; after you finish taking
size_of_circumference walks, you MUST have stood beside a given stick
size_of_stick times)

Now, responsiveness is another matter entirely, and I intend to take
the "easy" way out: forget about the cap on priority levels.

First, there are 100 "regular" priority levels and 20 real-time
priority levels. As mentioned, RT levels would be scheduled
separately, and only when RT queues are active (evidently). They
should be scheduled unconditionally when active, and in priority
order. So strides will never touch RT queues. Keep this in mind: RT
realm first, non-RT realm later.

Axel mentioned he'd like to have some mechanism to prevent starvation
of non-RT queues. I believe something as simple as randomly skipping
RT scheduling 5% of the time or something like that should do it.
Travis mentioned that he'd rather kill a RT thread that consumes a
whole quantum in a scheduling round. He has a point there: RT threads
should never go runaway, ever. I'd like to know what you people think.

Now that RT is (mostly ;)) out of our way and scheduled completely
apart from "regular" threads, there's absolutely NOTHING stopping us
from using any "regular" priority level that's way beyond 100. The 120
priority numbers are simply part of a protocol between userland and
the kernel. They are only numbers. There's nothing (other than sanity)
that prevents us from internally numbering levels counting from 39254,
or having them spaced by 50 units. This means we really don't need to
care if we artificially inflate those numbers to well beyond what
user-land would consider part of RT-reserved levels if we never expose
the internal numbering scheme to user-land: we have already taken care
of the RT-realm.

Thusly, we could practically force any thread to run by simply placing
it in a "fictitious" priority level (queue) that results from adding,
say, 1000 to that given thread's base priority level. This way, if we
detect that an interactive thread just woke up after receiving input
from the user interface (using some heuristics, like those found on
FreeBSD's ULE scheduler), we could boost it through the rooftop so it
will (almost) certainly run next (unless a thread with even higher
priority also received this "mega-boost"). Notice that:

* mega-boosts *do not* interfere with RT scheduling; those are
separate scheduling realms
* they are only good for a single scheduling round; after the boosted
thread gets scheduled, it should be placed right back into its
"native" queue, unless it blocks once again. Thus, those boosts should
be completely transparent to the thread.
* they should *only* be used to boost interactive threads, preferrably
those associated with the GUI; boosting CPU-bound "batch-like" threads
is a sure way to have crappy responsiveness.
* the whole mega-boosting thing is effectively an expression of
"scheduling classes". This way, we can transparently prioritize
certain patterns like interactive behaviour, and let CPU-bound threads
be (quasi-optimally) handled by the variant of stride scheduling
previously described. This way we can emulate the ideal, and
impossible to achieve in a general setup, shortest-job-first
scheduling, quite successfully! And we don't need to stop here:
* I mentioned in a previous message that I'd like to "brand"
semaphores with the subsystems they belong to, like GUI, disk I/O,
network I/O, HID I/O and so on. So in the future, if we get to brand
semaphores with the subsystems and purposes they serve, this boosting
mechanism (i.e., implicit scheduling classes), possibly even coupled
with variable quantas, could be extended to transparently (in the
sense that the application developer doesn't need to explicitly guess
how to parametrize his threads, or even create separate threads to
perform trivial operations because it could otherwise mischaracterize
another thread's behaviour. The thread will *always* be adequately
responsive depending on the code path it follows. Cool, huh?) shape
loads in any way we desire, be them I/O loads, CPU loads, network
loads etc. One obvious application that comes to mind is providing
time-sensitive multiplayer games like FPS, RTS and so on,
quality-of-service-like reserved shares of resources such as network
bandwidth, CPU time and so on, while you have dozens of torrents
happily downloading, and any number of distributed computing projects
crunching in the background. And your framerate remains undisturbed.

The sky is the limit.

In a nutshell: full-fledged scheduling classes, transparent to the
application developer. Always doing the right thing depending not on
explicit hints, but actual behaviour. Load shaping based on whatever
we wish, tweakable by user-land with the simple and familiar "priority
level" interface, but extended to well beyond just CPU priorities.

Now, for SMP... Well, that definitely deserves some further
discussion. I've already hinted that I'm almost set on local, per-CPU
queues, and, if possible, every CPU should schedule its threads
independently. Having a single "scheduling time" that is synchronized
with other CPUs is pretty sub-optimal. So I'd like to disturb CPUs as
little as possible, by reducing IPIs and cache line syncs.

Let's consider the direction technology is heading. Intel is
suggesting their future multicore CPUs will allow disabling select
cores and overclocking the active ones when load is low. We could have
threads spawned on any (enabled) CPU/core (from now on I'll use the
term CPU to mean "processor unit", be it a core on a multicore chip,
or traditional multi-socket setup. I know this generalization is not
optimal when IPI and caches enter the equation, but read on!), and
each CPU is responsible for balancing the load. If one CPU "sees"
another that is less loaded, it migrates a thread to that CPU. Thread
migration could be implemented by having per-CPU "incoming" queues.

CPU F would move that a thread into CPU G's incoming queue, and every
time CPU G runs the scheduler, it takes (one? all? some?) thread(s)
from this queue and puts them in the appropriate,
priority-level-related queue. Such incoming queues should be
non-blocking mutex-protected, of course. Alternatively, there could be
"outgoing" queues instead, and "idling" processors could actively
fetch those left-behind threads... I'd like to know what you think.

Regarding affinity, well, using the apparatus just described,
implementing it became surprisingly easy. With per-CPU queues, threads
should not be migrated unless load is unbalanced, and said threads
don't have "affinity flag" set. Affinity flags could come in several
flavours: migrate as you wish (i.e., not set), do not migrate, migrate
only to CPUs in a given CPU pool, and so on.

"Hey!! CPU pools??"

Remember when I mentioned that generalising processor units was a bad
idea? It's because THIS is what I actually had in mind. Using the CPU
pool abstraction is very powerful, because it allows us to encapsulate
things like multi-cores and multi-sockets, SMT/Hyperthreading,
shared/discrete caches and so on, behind certain CPU pool profiles.
The CPU affinity API should be designed in such a way that exploring
those incarnations of SMP is transparent and close to optimal without
having to resort to the amount of exceptions and special cases that
abound on Linux's and FreeBSD's schedulers.


OK, those last paragraphs are surprisingly small for such hairy
themes. Well, let me know what you think, and feel free to correct me
and ask for further clarification. As i mentioned, I intend to post
blog entries based on this message, so I tried to be as educational as
possible ;) And your criticism is greatly appreciated.


Thank you for your time!
A.

Other related posts:

  • » [haiku-development] [RFC] Proposal for the new Haiku scheduler