INDEX: PREFACE INTRODUCTION (Or: how I got here) PART II (Or: what you learn after a number of false starts) Section A: Inspiration Section B: Design decisions PART III (Or: what we have so far, and what the future holds) PREFACE Hi, guys! Sorry about the silence. I got a little busy at school, and had to (mostly) sort out some issues I found with several approaches I tried. This is a result of roughly a month worth of deep introspection, and letting go of some of my personal beliefs and wishful thinking to come up with something truly powerful. The general guidelines for the algorithms and data structures used were, in order of importance: effectiveness, succinctness, generality, choosing appropriate data structures instead of micro-tweaking the algorithms, extensibility. This is going to be a little big, and written in a somewhat "educational" style, and I'm also telling the whole story of how I got here, and that's because I intend to break this down in smaller parts to be posted as blog entries :) If you'd rather cut to the point, scroll down to PART II. If you don't, I hope you find this text entertaining :) Risking myself to sound too much like The Poignant's Guide (minus the cartoons), here I go ;) INTRODUCTION (Or: how I got here) The history goes like this: My first contact with BeOS was when I was still attending high school. Unfortunately I didn't experience the Golden Years, namely the R4 era. My first real contact with BeOS was with R5 PE; then I bought the Gobe Productive + R5 Pro bundle, and BeOS was my main OS for quite some time on my Pentium 133. Ah, nostalgia :) I'm a Computer Science student now. Circa 3 years ago I designed a thread scheduler as an assignment for the Operating Systems class. It fared pretty well in artificial benchmarks with purely CPU-bound simulated threads. I was excited. Seemed like I did something really cool :D I contacted Axel right after that semester finished and told him about the scheduler and asked him if there was interest in using it on Haiku. He was pretty open-minded, and I promised him I'd translate the documentation written for that assignment into English. I never did. Why? Well... I burned out. College was eating most of my energy. I was just getting out of a troubled relationship. And it all coincided with a hard drive failure. I lost almost no data, but the excitement was just gone. I used to be quite active in BeShare and IRC, even compiled stuff on my old trusty Pentium 133... (if Fyysik is reading this, he'll remember that Pentium immediately ;)) Anyway, the scheduler plans were on the back burner for roughly the mentioned 3 years. And you'll remember that before we got accepted in the GSoC I mailed the list asking about ticket #1069 (http://dev.haiku-os.org/ticket/1069), asking if I could still participate. Then I spent some days digging up the sources for the scheduler and re-doing some benchmarks on my current computer. Things still looked nice :) PART I (Or: rose-coloured glasses are both the blessing and the curse of being in love) So, now I set myself to improve my original algorithm performance even further. But that's just because I *knew* it sucked. It was extremely inefficient as far as implementation goes; it looked great in the benchmarks because it was being compared to O(n) algorithms, while it was O(1) itself, so I already had a head-start, so to speak. But I knew the algorithm very well and understood that there were plenty of bottlenecks to fix. Well, things started to look great as I progressed. I eliminated a nested loop, then I eliminated an outer loop, then I benchmarked a little more, and I reached extremely small overhead (only 8x slower) compared to the theoretical lower bound, i.e., a standard "for" loop that reads sequentially from an array. But then something hit me. My algorithm made no distinction between absolute priority levels, only relative. If you read the description of set_thread_priority() and suggest_thread_priority() on the BeBook, you'll realise that there should be an attempt to correlate priority levels and frequency of scheduling a given thread. This means that, at the very least, that if I have two threads in the system, one with a higher priority than the other, the higher-priority one shouldn't, for example, always be picked 66% of the time while the other one is picked 33% of the time. If said priorities are, say, 20 and 10, then those numbers are OK (2:1 ratio). But if one has a priority of 40 and the other a priority of 10, the ratio they're chosen should be 4:1. My algorithm only cared if one had a higher priority than the other, and tended to pick ratios such that the highest priority one got the largest share, then it diminished progressively as levels became lower. If there were 3 threads in the system with different priority levels, the ratios would be 50:33:16; 4 threads, 40:30:20:10, and so on. I felt that this is just wrong. And then I resorted to nasty workarounds. First, I tried adjusting the ratios with an exponential function. And although the ratios actually improved, things started to feel way too hackish. You know what? I was so infatuated with my own algorithm that I didn't see how plainly flawed it was. So, I admit: I spent nearly 3 weeks polishing a turd. PART II (Or: what you learn after a number of false starts) Section A Inspiration After I admitted the basic flaws of my previous attempts, I decided to go back to the basics and use a bottom-up approach to come up with solutions to the needs of the Haiku scheduler: fairness, responsiveness, responsive, scalability, succinctness and efficiency. The BeBook already hints at what the priority numbers should roughly mean: frequency of scheduling. We have 120 priority levels; the higher 20 are reserved to real-time threads. So while I was studying classic scheduler approaches on the Minix book, I noticed that the description of lottery-based scheduling was very compatible with that description of priorities meaning approximate frequency of being chosen by the scheduler. Lottery scheduling was proposed by Carl Waldspurger in 1994, and it works roughly like this: Suppose you have some resource that is limited (like CPU power, I/O bandwidth, network bandwidth, and so on). Suppose you have several processes competing for those resources, and they have different priorities. Now, suppose you have a number of tickets that correspond to a share of time some process can make use of those resources. If you give many tickets to a higher priority process, it will be allowed to use those limited resources proportionally to the number of tickets it holds. So process with priority 20 gets 20 tickets, process with priority 10 gets 10 tickets, and so on. Like in a lottery, a ticket is picked at random, and the process with the winning ticket is allowed to use the resource. As time goes by, the processes will have used the resource proportionally to the number of tickets they hold. This is very nice in theory, but there are some problems. The main problem is the randomness: the first several rounds of scheduling will resemble anything but the proportion we'd like to attain. We only observe the desired ratios in the long run. A possible solution to this was proposed by Waldspurger himself in 1995, when he introduced the stride-based scheduling. In a nutshell, he replaced random lotteries with strides and passes. All processes had a property called "pass" and a property called "stride". Strides are inversely proportional to the priority. Each time a process is chosen, its pass is increased by stride. The scheduler always picks the process with the lowest pass. The second problem lies in the mapping between tickets and processes. Searching for a ticket by visiting every process is very inefficient. We could use hash tables, or trees, to do this indexing. I'll tackle this shortly. The third problem is that the algorithm as expressed in Waldspurger's work seems to have linear complexity regarding the number of processes in the system. One of the requirements of Haiku's scheduler is that it should have constant (O(1)) complexity regarding the number of processes (threads, in our case). One possible workaround is to have what Waldspurger called "hierarchical lottery/stride scheduling", but I devised a much simpler expression of the same idea. Read on, this will be tackled shortly, too. The fourth and worst problem is that both lottery- and stride-based scheduling are great for processes that want to use as much of the limited resource as possible during their time share. In the case of CPU power, those would be CPU-bound threads. BeOS and Haiku are desktop operating systems, where interactive threads are prevalent. But wait, all is not wasted ;) There are solutions for all those problems. I have come up with some twists that I'd like to share with you. I'm convinced that even if the stride scheduling as expressed in Waldspurger's thesis could be less than optimal for interactive tasks (i.e., it doesn't approach smallest-job-first scheduling, which is proven to provide the best responsiveness), I have no doubt that the idea of using strides results in perfect fairness when all tasks are equal. One can actually prove this mathematically. Not necessarily his stride + pass approach, though. Simple strides will do. For example, for modeling purposes, suppose you have an array where you store a task T as many times as T's priority is (let's call this the temporal frequency property, or TFP). So let task A have priority 1, task B have priority 2, task C have priority 3. One possible array where the TFP holds is A B B C C C but so does any permutation of this array. If you multiply the occurrences of every element by a constant factor, for example 2, the array becomes A A B B B B C C C C C C the TFP still holds for this array, as well as for any possible permutation of it. It turns out that for any array for which the TFP holds, you can reduce it to the "trivial" array by ordering the array and "scaling" it so that the fewest occurring element appears greatest_common_divisor(A.prio, B.prio, C.prio, ...) times. Alternatively, it's enough to calculate the stride using the trivial array, and then scale the stride using the GCD. Nice, so now we have an invariant :) So let's keep the trivial array, A B B C C C. Notice that any strided, modulo size of the array (which is identical to the sum of all priorities), traversal of this array, where the stride is coprime to the size of the array (let's call it Good Stride Property), will return a given task proportionally to the number of times it appears on the array, divided by the size of the array (i.e., the sum of the priorities of all tasks). So in our example A will appear 1/6 of the times, B, 2/6 (1/3) of the times, and C, 3/6 (1/2) of the times. For ANY stride where the GSP holds. Trivial case: stride = 1. Traversing the array will output A, B, B, C, C, C, A, B, B, ... Next case, stride = 2: not coprime, GSP doesn't hold. Would output A, B, C, A, B, C... Next case, stride = 3. not coprime, GSP doesn't hold, would output A, C, A, C, A, C... Stride 4: A, C, B, A, C, B... Not coprime, GSP doesn't hold. Stride 5: A, C, C, C, B, B, ... OK. Stride 6: not coprime. (well, unfortunately I picked a terrible example... 6 is an awful number for the array size, since only 1 and 5 are coprimes, and the modular distance of both 1 and 5 to 6 is 1...) Anyway, all it takes is respecting the GSP. After size_of_array scheduling rounds, the correct proportions will emerge by construction. Actually, picking a good stride ensues that the distribution of the tasks in time is more even, but there's no algorithm that will algorithm a stride that displays this even-ness property in every possible case, because such stride might not even exist. The example above is precisely one counter-example that proves this assertion. But there are heuristics available to pick a good stride whenever possible. Now, what if we add a new task? Should we abandon the current strided traversal and start anew? Well, I still haven't decided that . So far I considered starting anew but taking the opposite direction (subtracting strides if they were being added in the previous scheduling round, and vice-versa), with a different stride, starting from a random position. This should eventually converge to a fair distribution. Simply finishing the current strided traversal until it completes size_of_array and ignoring new tasks would be *terrible* for responsiveness when the number of tasks grow. I believe using the rule of thumb I described as a compromise should be reasonable enough, but I'm open for suggestions. Now, let's forget the array abstraction for a little. Time for something more concrete :) PART II Section B Design decisions Suppose you're starting with a clean slate, and all you have so far are the tasks and their respective priorities. Now let's scribble a little in that slate :) Remember our application domain: scheduling threads. The implementation I have considered so far looks like this: We would have several queues, one per priority. There would be 120 of those on Haiku. Let's make a little differentiation here: 20 of those are reserved to real-time threads. The other 100 are for non-RT threads. So let's separate the scheduling jobs in two "realms": RT and non-RT. For the sake of simplicity, whenever RT threads are ready, they're chosen, priority order. That simple. Now let's take care of the non-RT realm: We only care about "active" queues, i.e., those which have queued threads. Most times, however, we don't need allocate space for every queue; using more than 10 queues is actually pretty rare. If we decide that SMP should be implemented with per-CPU queues* (all 120 of them... times # of CPUs!), wasting this space makes me feel somewhat sloppy. Same thing with hash tables: they'll be either space-inefficient or too small to bring any real benefits. (*) Read on! ;) Axel told me not to worry about this, though, and that it's better to be on the safe side and preallocate all the memory we should ever need. If we ever feel the need to manage that memory dynamically, for instance if we end up needing to support hundreds of cores, we can adapt the algorithm easily. Anyway, under any circumstances we'd better come up with a dynamic, compact and versatile data structure. Well, arrays are compact, we could simply place the queues where they belong and use a bitmap to control active/inactive queues. But that's not very cache-friendly, though. We could place the queues all together in the front of the array, then. But now we must take into account the fact that queues can come and go anytime, and I feel that insertions in the middle of the array will be way more common than insertions in the end. Insertions will then require moving up to O(q) queues. We could potentially use some memcpy tricks, but I don't know if it's worth the effort, not to mention that if we ever go the dynamic memory management route, memory fragmentation is almost certain to happen as we would allocate and deallocate memory every time a queue is destroyed and almost every time queue is created and must be inserted in the array, as we'd like to keep the structure compact. Hugo's slab allocator would come in handy, but... Well, abusing it to implement resizing arrays, in kernel land, and to use it in the thread scheduler, which suggests that insertions and removals will happen frequently, well, that's still a hack. And an ugly one which I'd rather avoid RIGHT NOW. Linked lists are less of a hack, but finding the insertion point still takes O(q). Eeeh... So... Trees? Well, trees should be OK. Red-black trees are nice, as they're self-balancing and will keep the amortized cost of any operation O(lg q); in the worst case we'd have 100 non-RT active queues, so any operations (insertion, deletion, search) are always going to be upper-bounded by O(7), which reduces to O(1). Nice!! Now how can we relate that to strides, so we have fair scheduling? Suppose you keep account of the sum of the priorities of each queue in a variable, let's call it SoP. Let variable Stride be... the stride. All the array ever did was to map the numeric index of a position to the corresponding task. So index 1 mapped to A, index 2 and 3 mapped to B, and index 3 to 6 mapped to C. If we could emulate that with just SoP and Stride, we can do away with the array altogether. Well, producing numeric indexes (i.e., tickets) is completely trivial: index = ((index + Stride) % SoP) + 1 Mapping them, however, is a little tricky. Let's go back to the array model. Instead of placing tasks multiple times on the array, we could store triples comprised of: (task, offset, priority) our array would become (A, 0, 1), (B,1,2), (C, 3, 3) Notice that offset(K) = offset(antecessor(K)) + priority(antecessor(K)) In our example, index will vary from 1 to 6, and mappings would work like this: 1. Let T be the first node. 2. If T.offset < index <= (T.offset + T.priority), then return T. 3. Else, let T be the next node, and go to step 2. That's very naive, but would work. We have two options here. Either we store computed offsets within the nodes, or we compute them on demand. Drawback of first option: upon insertion, we must adjust every offset from nodes ahead of the insertion point. Drawback of second option: we must recursively visit every antecessor node to compute the offset. A little observation here couldn't hurt: we don't have to keep the nodes ordered! We can always insert at the tail! So computed offsets are the way to go! Eeeh... Not that fast. Consider deletions. They will FREQUENTLY occur in the middle of the list. Worse: they will frequently occur in the head as well. So we still need to propagate offset deltas to successor nodes. That's still O(q). So not keeping the order is doing buying us very little; we now have two O(q) operations instead of three. But we have already decided to use balanced binary trees so that every operation is O(lg q) and therefore bounded to O(7) in our case, haven't we? I'm glad we did! Why? Well, consider this 4-uple: (PQ, offset, priority, # of threads) This 4-uple differs from the (task, offset, priority) triple in that step 2 in the mapping function will now become (variable names adjusted accordingly): 2. If PQ.offset < index <= (PQ.offset + PQ.priority * PQ.#threads), then return PQ. And the offset rule now becomes offset(K) = offset(antecessor(K)) + priority(antecessor(K)) * no_of_threads(antecessor(K)) Why would I do that? Well, we store the queues, not the threads, so the tree is kept compact. Threds are picked from the queues in FIFO order; that will naturally give us round-robin scheduling when we consider the queue in isolation. "Boosting" the queue's priority by multiplying it by the number of threads it holds has the exact same effect of using the threads as nodes, but it's much more space efficient, not to mention how it avoids messing with the tree on every insertion and removal of threads: if there is a corresponding queue in place already, the tree is largely untouched. "Largely", you say? Yes. The only reason it is not completely untouched is that we must take the offsets into account. Ah... Those damned offsets. But I managed to avoid having to explicitly propagate them to the successors and not recursively recompute them using the antecessors. How? Using one of the oldest tricks in the computer science: lazy evaluation! (Well, here I can attest that knowing some functional programming and artificial intelligence techniques really DO pay off when you least expect ;)) First, we must assert that there won't be bogus deletions, i.e., we won't try to remove nodes that are not on the tree. Well, easy enough: removals only happen when there's only one thread in the priority queue, and we're removing that thread from the queue. So now the queue is empty, and this is the only situation that triggers a removal operation. We must also ensure that when you insert a node, it will push every successor node.priority units to the right. This is the same as adding node.priority to every successor's offset component. The key is to postpone pushing nodes around for as long as we can. In order to use lazy evaluation, we must augment our node from a 4-uple to a 5-uple: (PQ, offset, offset delta, priority, # of threads). Now every time we insert a node, we add three extra steps in the insertion routine: a. if the curent node offset_delta field is non-zero, add it to the *offset* field of current node. Also add it to the *offset_delta* field of the left child node (when present) and right child node (when present). b. if the new (to be added) node will follow the path to the left child (i.e., has a lower priority than the current node), add new_node.priority to current_node.offset and current_node.right.offset_delta. c. when you find the point of insertion of new_node, remember to compute new_node.offset using the offset rule. Step (a) must be repeated in the search operation and in the deletion operation as well. Step (b) is not required when searching, but when deleting, it changes a little: you must now subtract node_to_be_deleted.priority from current_node.offset and current_node.right.offset_delta. Step (c) is only required when inserting. That's for inserting, searching for and removing whole queues. The adaptations necessary to insert threads into queues are left as an exercise for the reader ;) PART III (Or: what we have so far, and what the future holds) Cool, now we have a space- and time-efficient data structure to handle the stride scheduling thing. Not only that, but the same reasoning applies to *any* trees, should we decide to use splay trees, AA trees, AVL trees, unbalanced binary trees, whatever. So far we took care of the (not anymore) skewed proportions due to randomness (by using strides), the mapping of tickets to tasks (by using trees, "indexes" as the tickets, offsets to enable using indexes, and lazy evaluation to efficiently implement offset propagation to upper queues), and the order of complexity (by using the queues as the node elements of the tree, where the key is base priority times number of threads in that queue). (BIG Sidenote: notice that hashtables that map index numbers to queues won't really work: we're dealing with ranges, and you really don't want to add a hundred tickets in the hashtable every time a thread with priority 100 is spawned... For the sake of clarity, let me re-state the problem by using an analogy: it's like having an arbitrary number of sticks of different sizes that are multiple of the same unit, and you arrange them in a circle. Now you begin at an arbitrary point and walk R unit-large steps. Which stick will you be standing next to when you finish walking? The problem lies in expressing the following facts: the sticks can be placed arbitrarily, but once you lay them down, they must be kept in that order at least until a (new) stick is removed from(/introduced into) the circle; the number of steps you take depend on how big the circumference of said circle is; the circumference will evidently change when you change the number of sticks; after you finish taking size_of_circumference walks, you MUST have stood beside a given stick size_of_stick times) Now, responsiveness is another matter entirely, and I intend to take the "easy" way out: forget about the cap on priority levels. First, there are 100 "regular" priority levels and 20 real-time priority levels. As mentioned, RT levels would be scheduled separately, and only when RT queues are active (evidently). They should be scheduled unconditionally when active, and in priority order. So strides will never touch RT queues. Keep this in mind: RT realm first, non-RT realm later. Axel mentioned he'd like to have some mechanism to prevent starvation of non-RT queues. I believe something as simple as randomly skipping RT scheduling 5% of the time or something like that should do it. Travis mentioned that he'd rather kill a RT thread that consumes a whole quantum in a scheduling round. He has a point there: RT threads should never go runaway, ever. I'd like to know what you people think. Now that RT is (mostly ;)) out of our way and scheduled completely apart from "regular" threads, there's absolutely NOTHING stopping us from using any "regular" priority level that's way beyond 100. The 120 priority numbers are simply part of a protocol between userland and the kernel. They are only numbers. There's nothing (other than sanity) that prevents us from internally numbering levels counting from 39254, or having them spaced by 50 units. This means we really don't need to care if we artificially inflate those numbers to well beyond what user-land would consider part of RT-reserved levels if we never expose the internal numbering scheme to user-land: we have already taken care of the RT-realm. Thusly, we could practically force any thread to run by simply placing it in a "fictitious" priority level (queue) that results from adding, say, 1000 to that given thread's base priority level. This way, if we detect that an interactive thread just woke up after receiving input from the user interface (using some heuristics, like those found on FreeBSD's ULE scheduler), we could boost it through the rooftop so it will (almost) certainly run next (unless a thread with even higher priority also received this "mega-boost"). Notice that: * mega-boosts *do not* interfere with RT scheduling; those are separate scheduling realms * they are only good for a single scheduling round; after the boosted thread gets scheduled, it should be placed right back into its "native" queue, unless it blocks once again. Thus, those boosts should be completely transparent to the thread. * they should *only* be used to boost interactive threads, preferrably those associated with the GUI; boosting CPU-bound "batch-like" threads is a sure way to have crappy responsiveness. * the whole mega-boosting thing is effectively an expression of "scheduling classes". This way, we can transparently prioritize certain patterns like interactive behaviour, and let CPU-bound threads be (quasi-optimally) handled by the variant of stride scheduling previously described. This way we can emulate the ideal, and impossible to achieve in a general setup, shortest-job-first scheduling, quite successfully! And we don't need to stop here: * I mentioned in a previous message that I'd like to "brand" semaphores with the subsystems they belong to, like GUI, disk I/O, network I/O, HID I/O and so on. So in the future, if we get to brand semaphores with the subsystems and purposes they serve, this boosting mechanism (i.e., implicit scheduling classes), possibly even coupled with variable quantas, could be extended to transparently (in the sense that the application developer doesn't need to explicitly guess how to parametrize his threads, or even create separate threads to perform trivial operations because it could otherwise mischaracterize another thread's behaviour. The thread will *always* be adequately responsive depending on the code path it follows. Cool, huh?) shape loads in any way we desire, be them I/O loads, CPU loads, network loads etc. One obvious application that comes to mind is providing time-sensitive multiplayer games like FPS, RTS and so on, quality-of-service-like reserved shares of resources such as network bandwidth, CPU time and so on, while you have dozens of torrents happily downloading, and any number of distributed computing projects crunching in the background. And your framerate remains undisturbed. The sky is the limit. In a nutshell: full-fledged scheduling classes, transparent to the application developer. Always doing the right thing depending not on explicit hints, but actual behaviour. Load shaping based on whatever we wish, tweakable by user-land with the simple and familiar "priority level" interface, but extended to well beyond just CPU priorities. Now, for SMP... Well, that definitely deserves some further discussion. I've already hinted that I'm almost set on local, per-CPU queues, and, if possible, every CPU should schedule its threads independently. Having a single "scheduling time" that is synchronized with other CPUs is pretty sub-optimal. So I'd like to disturb CPUs as little as possible, by reducing IPIs and cache line syncs. Let's consider the direction technology is heading. Intel is suggesting their future multicore CPUs will allow disabling select cores and overclocking the active ones when load is low. We could have threads spawned on any (enabled) CPU/core (from now on I'll use the term CPU to mean "processor unit", be it a core on a multicore chip, or traditional multi-socket setup. I know this generalization is not optimal when IPI and caches enter the equation, but read on!), and each CPU is responsible for balancing the load. If one CPU "sees" another that is less loaded, it migrates a thread to that CPU. Thread migration could be implemented by having per-CPU "incoming" queues. CPU F would move that a thread into CPU G's incoming queue, and every time CPU G runs the scheduler, it takes (one? all? some?) thread(s) from this queue and puts them in the appropriate, priority-level-related queue. Such incoming queues should be non-blocking mutex-protected, of course. Alternatively, there could be "outgoing" queues instead, and "idling" processors could actively fetch those left-behind threads... I'd like to know what you think. Regarding affinity, well, using the apparatus just described, implementing it became surprisingly easy. With per-CPU queues, threads should not be migrated unless load is unbalanced, and said threads don't have "affinity flag" set. Affinity flags could come in several flavours: migrate as you wish (i.e., not set), do not migrate, migrate only to CPUs in a given CPU pool, and so on. "Hey!! CPU pools??" Remember when I mentioned that generalising processor units was a bad idea? It's because THIS is what I actually had in mind. Using the CPU pool abstraction is very powerful, because it allows us to encapsulate things like multi-cores and multi-sockets, SMT/Hyperthreading, shared/discrete caches and so on, behind certain CPU pool profiles. The CPU affinity API should be designed in such a way that exploring those incarnations of SMP is transparent and close to optimal without having to resort to the amount of exceptions and special cases that abound on Linux's and FreeBSD's schedulers. OK, those last paragraphs are surprisingly small for such hairy themes. Well, let me know what you think, and feel free to correct me and ask for further clarification. As i mentioned, I intend to post blog entries based on this message, so I tried to be as educational as possible ;) And your criticism is greatly appreciated. Thank you for your time! A.