Hi, folks! I got a few personal emails since I described my plans and approach for the new Haiku scheduler. I'll let the authors remain anonymous, because I didn't ask them for permission to reproduce any part of their emails, but I thought it would be a good idea to address this question right away. The most frequently asked question is, *SHOCK SURPRISE*, how does my proposed scheduler compares with the Completely Fair Scheduler recently introduced on Linux. Or, how one of the anonymous (hey, you could come out and say your name if you feel comfortable with that!) put it: "note, I didn't read your mail, so I don't know how well your scheduler would work for us. Also, I don't know anything about schedulers. I just wanted to ask, what do you think about the Completely Fair Scheduler (CFS)? Do you think your scheduler suggestion will run comparably well on a desktop system? Personally, I wouldn't mind having the CFS reimplemented for Haiku if it's the best solution. Since you want to drop your first scheduler design I thought it might be worth asking about existing solutions." First and foremost, the very necessary disclaimer: I don't want to start any kind of religious war here or brag about how one OS got it right, how one researcher got it right, how one hacker got it right. Nor is my intention to hurt anyone's egos. Specially not mine. =P And I guess I should address FAQ #0 first: "Why did you send that email in the first place?" My message was truly a "Request For Comments" RFC, not the traditional, numbered, "For Your Information" RFC. Else, I'd tag it with [FYI] ;) I want your feedback. And in order to receive any, people must at, the very least, read it! Or skip to the interesting parts. Or skim over the whole enchilada until something catches one's attention. *Anything* is preferable over not reading a single line and pointing me to whatever is happening on the other side of the fence. I'm OBVIOUSLY aware of what is happening there. I obviously "rejected" copying it FOR A REASON. I'll return to this shortly. I understand that my email was VERY LARGE (almost 32k worth of text), but *please*, I urge people to *spend some time reading it*. I definitely spent much more time and energy writing it (remember English is not my native language) than most people will spend reading it. I'm taking no offense, but please, *read the RFC*. I want your feedback. Now, back to FAQ #1 ;) FAQ #1 asks how does it compare with the CFS. The first thing that must be understood is that *no* Linux scheduler or BSD scheduler or [whatever traditional multi-user environment] scheduler can be used as a drop-in replacement for BeOS-style schedulers. Simply because BeOS doesn't constrain higher priority levels to root-belonging threads. The BeBook clearly correlates priority levels with frequency of scheduling. This alone is enough reason not to adopt *any* Linux scheduler verbatim: Linux threads are all created equal and their priorities have no place to go but down, unless you're root. That's what hampers their desktop responsiveness. That's why they recently resorted to in-kernel renicing of threads that behave like the X11 server. That's why most traditional schedulers must mess with boosting, aging, and so on. And that's why they don't really fit the BeOS model. Not that the BeOS model is The Right Thing(TM) and it certainly becomes an issue when multi-user comes into question, but that's beyond the scope of R1 anyway. *My target is R1*, but rest assured, there's nothing in the basic design I'm using that makes it impossible to be adapted to multi-user requirements. For the sake of completeness, let's now suppose all BeOS threads were equal and userland apps had no way to set their thread's priorities up based on educated guesses about the thread's behaviour. The hard, cold fact is: Ingo Molnár would have saved himself a *lot* of time if he at least skimmed over computer science publications on the task scheduling subject from, say, 12 years ago... Heck, he didn't even need to go as far as reading ACM and IEEE periodicals, or their online indexes and abstracts and then searching for freely available versions of those publications; checking the relevant bibliography from Tanenbaum's "Operating Systems Design and Implementation", 2nd edition, would have set him on the same direction more than a decade ago. But given Linux devs chasm with Tanenbaum, that's not really very surprising... Look, this is not me being biased because I'm a computer scientist. It's not that I'm favouring the *BSD "publication-based" development model stemming from their academic roots, or alleged higher standards of commercial UNIX offerings, or anything like that. Plenty of people have criticised the recurring NIH syndrome of Linux kernel developers. And, really, the CFS is yet another victim of core Linux devs dissing whatever happens outside the LKML. Molnár's "Completely Fair Scheduler" is a re-expression of Carl Waldspurger's "Stride Scheduler", which was published in his ACM-award-winning, 1995 Ph.D. thesis. Yes, the same "Stride Scheduler" that is an extension of his 1994 work, "Lottery Scheduling", that is prominently mentioned on Tanenbaum's book. Yes, the same "Stride Scheduler" that served as the inspiration for my scheduler. Here's part of Ingo Molnár's own description of CFS (2007)(1): "The rq->fair_clock value tracks the 'CPU time a runnable task would have fairly gotten, had it been runnable during that time'. So by using rq->fair_clock values we can accurately timestamp and measure the 'expected CPU time' a task should have gotten. All runnable tasks are sorted in the rbtree by the "rq->fair_clock - p->wait_runtime" key, and CFS picks the 'leftmost' task and sticks to it. As the system progresses forwards, newly woken tasks are put into the tree more and more to the right - slowly but surely giving a chance for every task to become the 'leftmost task' and thus get on the CPU within a deterministic amount of time." Now, here's part of the description of Stride Scheduling (1995)(2): "[...] A stride scheduler is based upon a virtual-time measurement that enables it to determine the process to be selected for the next time slice. [...] Each process is associated with a dynamic property which denotes an indicator for scheduling relative to other processes. This property is called the pass of a process. Allocations are performed by selecting the process with the currently minimal pass value. After consuming its quantum the process' pass value is advanced by an increment which is inversely proportional to the amount of tickets allocated. As a consequence, passes of processes advance with inversely proportional speed of their allocations. This increment is called a stride. To keep integer values, we multiply the inverse proportional with a large integer constant $stride_1$." (1) http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt (2) http://wwwagss.informatik.uni-kl.de/Projekte/Squirrel/stride/node3.html Still on (1), Molnár comments: "Most of the rest of CFS's design just falls out of this really simple concept, with a few add-on embellishments like nice levels, multiprocessing and various algorithm variants to recognize sleepers." This explains the sheer size of CFS. The latest patch version (sched-cfs-v2.6.22-rc4-v16.patch as of this writing) weights in 170k. Even if 2/3 of those were deletions of obsoleted code, it would still be 56k big. Heck, even if only 1/5 of it consisted of active code it would still be 34k big. Now, notice a few excerpts from (1): " - the introduction of Scheduling Classes: an extensible hierarchy of scheduler modules. These modules encapsulate scheduling policy details and are handled by the scheduler core without the core code assuming about them too much" Guess what? I proposed that as well! But my version of "scheduling classes" is implicit and emerge naturally as a consequence of the general design. " The CFS patch uses a completely different approach and implementation from RSDL/SD. My goal was to make CFS's interactivity quality exceed that of RSDL/SD, which is a high standard to meet :-) Testing feedback is welcome to decide this one way or another. [ and, in any case, all of SD's logic could be added via a kernel/sched_sd.c module as well, if Con is interested in such an approach. ]" It stunned me how Molnár needed to write a lot of logic to address the issue of interactivity. My proposal needs either a *single* "interactivity score" computation that's less than 10 lines of code and would be shamelessly stolen from FreeBSD, OR __my preferred approach__, touching the semaphore wakeup code to make everything implicit, natural, resilient to the "history window" issue (not knowing for how long history of a given task must be kept, because it might behave like an interactive task for some time, then it might behave like a CPU-bound task, ...), and correct by construction. " CFS's design is quite radical: it does not use runqueues, it uses a time-ordered rbtree to build a 'timeline' of future task execution, and thus has no 'array switch' artifacts (by which both the vanilla scheduler and RSDL/SD are affected)." Still, the rbtree changes its shape all the time. It's better than the "array switch" artifacts only in the sense that it doesn't suffer from the weirdness of reinserting interactive tasks on the current queue; it does nothing to address the cache unfriendliness. Hierarchical Stride Scheduling is a step in the direction of avoiding the tree changing shape all the time. My scheduler proposal is inspired by HSS, but once again it's not a straight copy. SUMMARY/CONSEQUENCES OF DESIGN DECISIONS FROM BOTH FRONTS - CFS is cool, but it's not really new. Most of the "innovative" ideas it carries have been published more than 12 years ago. - Most of the schedulers in existence don't fit the BeOS thread priority model. Adopting one verbatim will suck, always. It's not that BeOS's model is The Right Way, but that's what we have to deal with for R1. - An *awful* lot of explicit logic present on CFS became implicit and natural on the scheduler I proposed. I don't need to write a single line of code to already state without a single shadow of doubt that my scheduler is WAY less complex and achieves most of the benefits of CFS with much smaller overhead. Remember the old saying, "the last 10% demands 90% of the effort"? Their "10%" is mostly Linux-specific issues we never happened to have. - The scheduler I proposed avoids changing the shape of the RBTree by combining the effects using the queues as the node elements, and extending/shrinking the *offsets* as threads are inserted in and removed from the queues. We could *easily* extend this to have separate trees for each scheduling class, so the tree comprised of CPU-bound threads would keep a stable shape most of the time. - The situation is *way worse* with CFS anyway, because the node elements are the threads themselves. Because of that, the CFS tree is poised to be *very large* and have high maintainance costs. Removing stuff from a RBTree is O(lg n) + constant penalty of up to 3 rotations of the lower nodes to rebalance the tree. And that happens after *every scheduling event* with the CFS. Hence my bold statement regarding not having to write a single line of code to affirm the scheduler I proposed has a much smaller overhead. - Can the new Haiku scheduler (I guess I should come up with an acronym ;)) be adopted mostly verbatim on other OSs? Well, strangely enough, yes! Specially if the other OS coders would be willing to modify the thread wakeup code as well by branding their semaphores, mutexes, whatever sync primitives they use. - I forgot to mention this before, but apart from thread blocking, unblocking, creation and destruction, I don't need to touch the queues at all, because they're per-CPU. Removing the to-become-current thread from the queue is an artifact of having a single "ready threads" queue shared by all CPUs. With per-CPU queues, it's enough to either move the current thread to the end of the queue when the quantum expires, or leave the current thread at the head if it got preempted by some other event like HW interrupt. But I did mention they're FIFO queues. Well, not quite. They behave more like deques. They're actually implemented as circular queues. A single rotation is enough to "move" the head to the tail of the queue. Inserting at the tail is accomplished by simply inserting at the head and then rotating. Inserting at the head requires no rotation. - The sticking-out-like-a-sore-thumb weakness of the scheduler I proposed is that the very simplified incarnation of strides I'm using could end up not producing perfect fairness. Actually, it won't produce perfect fairness unless size_of_array/sum_of_priorities/that_funny_number scheduling events happen undisturbed (i.e., no threads block, wake up, are created, destroyed; actually, unless the size_of_array/sum_of_priorities/that_funny_number remains constant until size_of_array/sum_of_priorities/that_funny_number scheduling events happen). With separate trees for different scheduling classes (with correspondingly separate sum_of_priorities attributes) this becomes *MUCH* less of an issue, so it might be the way to go after all. And if the rule of thumb I proposed ends up sucking a lot, and if we can't come up with alternative heuristics to address this issue, there's always the plan B of adopting the stride+pass scheme. But interactive threads will have to be handled separately anyway to avoid both the deficiencies of stride scheduling and the sheer amount of logic dedicated to interactive tasks of CFS. I'm here to answer whatever further questions you have. See ya! A.