Splitting this into a different thread in order not to hijack René's. Copying and pasting my own text liberally :) I'm considering this approach to implement load balancing and migration: First, supposing the kernel is equipped with some means of discovering the topology of the cores on the system, it would build a graph of migration penalties, such that cores connected by slower means would receive a higher penalty (SMT* < shared cache < SMT** < shared socket < shared... backplane? [and Haiku goes NUMA!]). Let's assume a tree where cores are the leaves and interconnections are the nodes. The kernel could use default values for different classes of interconnects, or it could actually measure the actual latency of transferring a thread from one core to others sitting on different branch of the tree. (*,**: SMT is tricky, as discussed on a previous thread. Affinity flags could be used to hint the scheduler about the proper situation to run a thread on a logical core, in order not to trash caches and so on.) The graph of penalties would be represented by a compact triangular matrix, since the penalties would be symmetrical between a pair of cores and its reciprocal, i.e., (Ca, Cb) and (Cb, Ca). Let's assume that migration would either happen explicitly, where an application would use the proper syscalls to distribute its threads to different cores, or implicitly, for load-balancing. So only the latter concerns the scheduler and its load balancer. I'm under the strong impression that this is already implemented in Haiku, but anyway, each core should have some book-keeping structure to note things like load average and so on. Each core would also store pointers for N "hog" threads, that is, the threads that consume the most of their allocated timeslices. Those threads are the prime candidates for being migrated to lesser-loaded cores. There would be an array of pointers to each such structures. When the load-balancer kicks in, it sorts this array using the load average as the key, while computing the average load between cores, It then proceeds to examine the most loaded cores for "hog" threads, stopping when it reaches a core whose load is smaller than the average. Let K be the quantity of cores above this threshold. It then does the reverse, comparing each of the selected loaded cores to the less loaded ones, adding the migration penalty to the load average of the candidate destination core. The penalties should be chosen so that they represent the time wasted on migration and restoring the caches, but being reasonable enough so that the scheduler wouldn't avoid migrating to an idling core just because it's on a distant branch of the interconnects tree. Then the load-balancer migrates this thread to the best destination core among the candidates. Let W be the quantity of cores considered for receiving migrated threads, so that K <= W <= 2K. This is so that the algorithm scales reasonably well for a large number of cores. In order to avoid the ping-pong effect, a just-migrated thread could have a "cool-down" period where it would not be allowed to get to the "hogs" list. What do you people think? Cheers, A.