[openbeos] [FAQ] FAQ #1: CFS and Haiku's new scheduler comparison

  • From: "André Braga" <meianoite@xxxxxxxxx>
  • To: haiku-development@xxxxxxxxxxxxx, openbeos@xxxxxxxxxxxxx
  • Date: Mon, 11 Jun 2007 15:29:26 -0300

Hi, folks!

I got a few personal emails since I described my plans and approach
for the new Haiku scheduler. I'll let the authors remain anonymous,
because I didn't ask them for permission to reproduce any part of
their emails, but I thought it would be a good idea to address this
question right away.

The most frequently asked question is, *SHOCK SURPRISE*, how does my
proposed scheduler compares with the Completely Fair Scheduler
recently introduced on Linux. Or, how one of the anonymous (hey, you
could come out and say your name if you feel comfortable with that!)
put it:

"note, I didn't read your mail, so I don't know how well your
scheduler would work for us. Also, I don't know anything about
schedulers. I just wanted to ask, what do you think about the
Completely Fair Scheduler (CFS)? Do you think your scheduler
suggestion will run comparably well on a desktop system? Personally, I
wouldn't mind having the CFS reimplemented for Haiku if it's the best
solution. Since you want to drop your first scheduler design I thought
it might be worth asking about existing solutions."


First and foremost, the very necessary disclaimer: I don't want to
start any kind of religious war here or brag about how one OS got it
right, how one researcher got it right, how one hacker got it right.
Nor is my intention to hurt anyone's egos. Specially not mine. =P

And I guess I should address FAQ #0 first:
"Why did you send that email in the first place?"
My message was truly a "Request For Comments" RFC, not the
traditional, numbered, "For Your Information" RFC. Else, I'd tag it
with [FYI] ;)

I want your feedback. And in order to receive any, people must at, the
very least, read it! Or skip to the interesting parts. Or skim over
the whole enchilada until something catches one's attention.
*Anything* is preferable over not reading a single line and pointing
me to whatever is happening on the other side of the fence. I'm
OBVIOUSLY aware of what is happening there. I obviously "rejected"
copying it FOR A REASON. I'll return to this shortly.

I understand that my email was VERY LARGE (almost 32k worth of text),
but *please*, I urge people to *spend some time reading it*. I
definitely spent much more time and energy writing it (remember
English is not my native language) than most people will spend reading
it.

I'm taking no offense, but please, *read the RFC*. I want your feedback.


Now, back to FAQ #1 ;)

FAQ #1 asks how does it compare with the CFS.

The first thing that must be understood is that *no* Linux scheduler
or BSD scheduler or [whatever traditional multi-user environment]
scheduler can be used as a drop-in replacement for BeOS-style
schedulers. Simply because BeOS doesn't constrain higher priority
levels to root-belonging threads. The BeBook clearly correlates
priority levels with frequency of scheduling. This alone is enough
reason not to adopt *any* Linux scheduler verbatim: Linux threads are
all created equal and their priorities have no place to go but down,
unless you're root. That's what hampers their desktop responsiveness.
That's why they recently resorted to in-kernel renicing of threads
that behave like the X11 server. That's why most traditional
schedulers must mess with boosting, aging, and so on. And that's why
they don't really fit the BeOS model.

Not that the BeOS model is The Right Thing(TM) and it certainly
becomes an issue when multi-user comes into question, but that's
beyond the scope of R1 anyway.

*My target is R1*, but rest assured, there's nothing in the basic
design I'm using that makes it impossible to be adapted to multi-user
requirements.

For the sake of completeness, let's now suppose all BeOS threads were
equal and userland apps had no way to set their thread's priorities up
based on educated guesses about the thread's behaviour.



The hard, cold fact is: Ingo Molnár would have saved himself a *lot*
of time if he at least skimmed over computer science publications on
the task scheduling subject from, say, 12 years ago... Heck, he didn't
even need to go as far as reading ACM and IEEE periodicals, or their
online indexes and abstracts and then searching for freely available
versions of those publications; checking the relevant bibliography
from Tanenbaum's "Operating Systems Design and Implementation", 2nd
edition, would have set him on the same direction more than a decade
ago. But given Linux devs chasm with Tanenbaum, that's not really very
surprising...



Look, this is not me being biased because I'm a computer scientist.
It's not that I'm favouring the *BSD "publication-based" development
model stemming from their academic roots, or alleged higher standards
of commercial UNIX offerings, or anything like that.

Plenty of people have criticised the recurring NIH syndrome of Linux
kernel developers. And, really, the CFS is yet another victim of core
Linux devs dissing whatever happens outside the LKML.

Molnár's "Completely Fair Scheduler" is a re-expression of Carl
Waldspurger's "Stride Scheduler", which was published in his
ACM-award-winning, 1995 Ph.D. thesis. Yes, the same "Stride Scheduler"
that is an extension of his 1994 work, "Lottery Scheduling", that is
prominently mentioned on Tanenbaum's book.

Yes, the same "Stride Scheduler" that served as the inspiration for my
scheduler.


Here's part of Ingo Molnár's own description of CFS (2007)(1):

"The rq->fair_clock value tracks the 'CPU time a runnable task would
have fairly gotten, had it been runnable during that time'. So by
using rq->fair_clock values we can accurately timestamp and measure
the 'expected CPU time' a task should have gotten. All runnable tasks
are sorted in the rbtree by the "rq->fair_clock - p->wait_runtime"
key, and CFS picks the 'leftmost' task and sticks to it. As the system
progresses forwards, newly woken tasks are put into the tree more and
more to the right - slowly but surely giving a chance for every task
to become the 'leftmost task' and thus get on the CPU within a
deterministic amount of time."



Now, here's part of the description of Stride Scheduling (1995)(2):

"[...] A stride scheduler is based upon a virtual-time measurement
that enables it to determine the process to be selected for the next
time slice.

[...]

Each process is associated with a dynamic property which denotes an
indicator for scheduling relative to other processes. This property is
called the pass of a process. Allocations are performed by selecting
the process with the currently minimal pass value. After consuming its
quantum the process' pass value is advanced by an increment which is
inversely proportional to the amount of tickets allocated. As a
consequence, passes of processes advance with inversely proportional
speed of their allocations. This increment is called a stride. To keep
integer values, we multiply the inverse proportional with a large
integer constant $stride_1$."

(1) http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt
(2) http://wwwagss.informatik.uni-kl.de/Projekte/Squirrel/stride/node3.html



Still on (1), Molnár comments:
"Most of the rest of CFS's design just falls out of this really simple
concept, with a few add-on embellishments like nice levels,
multiprocessing and various algorithm variants to recognize sleepers."


This explains the sheer size of CFS. The latest patch version
(sched-cfs-v2.6.22-rc4-v16.patch as of this writing) weights in 170k.
Even if 2/3 of those were deletions of obsoleted code, it would still
be 56k big. Heck, even if only 1/5 of it consisted of active code it
would still be 34k big.

Now, notice a few excerpts from (1):
" - the introduction of Scheduling Classes: an extensible hierarchy of
  scheduler modules. These modules encapsulate scheduling policy
  details and are handled by the scheduler core without the core
  code assuming about them too much"

Guess what? I proposed that as well! But my version of "scheduling
classes" is implicit and emerge naturally as a consequence of the
general design.

"   The CFS patch uses a completely different approach and implementation
  from RSDL/SD. My goal was to make CFS's interactivity quality exceed
  that of RSDL/SD, which is a high standard to meet :-) Testing
  feedback is welcome to decide this one way or another. [ and, in any
  case, all of SD's logic could be added via a kernel/sched_sd.c module
  as well, if Con is interested in such an approach. ]"

It stunned me how Molnár needed to write a lot of logic to address the
issue of interactivity. My proposal needs either a *single*
"interactivity score" computation that's less than 10 lines of code
and would be shamelessly stolen from FreeBSD, OR __my preferred
approach__, touching the semaphore wakeup code to make everything
implicit, natural, resilient to the "history window" issue (not
knowing for how long history of a given task must be kept, because it
might behave like an interactive task for some time, then it might
behave like a CPU-bound task, ...), and correct by construction.

"   CFS's design is quite radical: it does not use runqueues, it uses a
  time-ordered rbtree to build a 'timeline' of future task execution,
  and thus has no 'array switch' artifacts (by which both the vanilla
  scheduler and RSDL/SD are affected)."

Still, the rbtree changes its shape all the time. It's better than the
"array switch" artifacts only in the sense that it doesn't suffer from
the weirdness of reinserting interactive tasks on the current queue;
it does nothing to address the cache unfriendliness.

Hierarchical Stride Scheduling is a step in the direction of avoiding
the tree changing shape all the time. My scheduler proposal is
inspired by HSS, but once again it's not a straight copy.





SUMMARY/CONSEQUENCES OF DESIGN DECISIONS FROM BOTH FRONTS
- CFS is cool, but it's not really new. Most of the "innovative" ideas
it carries have been published more than 12 years ago.
- Most of the schedulers in existence don't fit the BeOS thread
priority model. Adopting one verbatim will suck, always. It's not that
BeOS's model is The Right Way, but that's what we have to deal with
for R1.
- An *awful* lot of explicit logic present on CFS became implicit and
natural on the scheduler I proposed. I don't need to write a single
line of code to already state without a single shadow of doubt that my
scheduler is WAY less complex and achieves most of the benefits of CFS
with much smaller overhead. Remember the old saying, "the last 10%
demands 90% of the effort"? Their "10%" is mostly Linux-specific
issues we never happened to have.
- The scheduler I proposed avoids changing the shape of the RBTree by
combining the effects using the queues as the node elements, and
extending/shrinking the *offsets* as threads are inserted in and
removed from the queues. We could *easily* extend this to have
separate trees for each scheduling class, so the tree comprised of
CPU-bound threads would keep a stable shape most of the time.
- The situation is *way worse* with CFS anyway, because the node
elements are the threads themselves. Because of that, the CFS tree is
poised to be *very large* and have high maintainance costs. Removing
stuff from a RBTree is O(lg n) + constant penalty of up to 3 rotations
of the lower nodes to rebalance the tree. And that happens after
*every scheduling event* with the CFS. Hence my bold statement
regarding not having to write a single line of code to affirm the
scheduler I proposed has a much smaller overhead.
- Can the new Haiku scheduler (I guess I should come up with an
acronym ;)) be adopted mostly verbatim on other OSs? Well, strangely
enough, yes! Specially if the other OS coders would be willing to
modify the thread wakeup code as well by branding their semaphores,
mutexes, whatever sync primitives they use.
- I forgot to mention this before, but apart from thread blocking,
unblocking, creation and destruction, I don't need to touch the queues
at all, because they're per-CPU. Removing the to-become-current thread
from the queue is an artifact of having a single "ready threads" queue
shared by all CPUs. With per-CPU queues, it's enough to either move
the current thread to the end of the queue when the quantum expires,
or leave the current thread at the head if it got preempted by some
other event like HW interrupt.    But I did mention they're FIFO
queues. Well, not quite. They behave more like deques. They're
actually implemented as circular queues. A single rotation is enough
to "move" the head to the tail of the queue. Inserting at the tail is
accomplished by simply inserting at the head and then rotating.
Inserting at the head requires no rotation.
- The sticking-out-like-a-sore-thumb weakness of the scheduler I
proposed is that the very simplified incarnation of strides I'm using
could end up not producing perfect fairness. Actually, it won't
produce perfect fairness unless
size_of_array/sum_of_priorities/that_funny_number scheduling events
happen undisturbed (i.e., no threads block, wake up, are created,
destroyed; actually, unless the
size_of_array/sum_of_priorities/that_funny_number remains constant
until size_of_array/sum_of_priorities/that_funny_number scheduling
events happen). With separate trees for different scheduling classes
(with correspondingly separate sum_of_priorities attributes) this
becomes *MUCH* less of an issue, so it might be the way to go after
all. And if the rule of thumb I proposed ends up sucking a lot, and if
we can't come up with alternative heuristics to address this issue,
there's always the plan B of adopting the stride+pass scheme. But
interactive threads will have to be handled separately anyway to avoid
both the deficiencies of stride scheduling and the sheer amount of
logic dedicated to interactive tasks of CFS.



I'm here to answer whatever further questions you have.

See ya!
A.

Other related posts: