Nick Piggin wrote: > On Tuesday 26 August 2008 21:36, Gregory Haskins wrote: > >> Nick Piggin wrote: >> >>> On Tuesday 26 August 2008 06:15, Gregory Haskins wrote: >>> >>>> git-id c4acb2c0669c5c5c9b28e9d02a34b5c67edf7092 attempted to limit >>>> newidle critical section length by stopping after at least one task >>>> was moved. Further investigation has shown that there are other >>>> paths nested further inside the algorithm which still remain that allow >>>> long latencies to occur with newidle balancing. This patch applies >>>> the same technique inside balance_tasks() to limit the duration of >>>> this optional balancing operation. >>>> >>>> Signed-off-by: Gregory Haskins >>>> CC: Nick Piggin >>>> >>> Hmm, this (andc4acb2c0669c5c5c9b28e9d02a34b5c67edf7092) still could >>> increase the amount of work to do significantly for workloads where >>> the CPU is going idle and pulling tasks over frequently. I don't >>> really like either of them too much. >>> >> I had a feeling you may object to this patch based on your comments on >> the first one. Thats why I CC'd you so you wouldnt think I was trying >> to sneak something past ;) >> > > Appreciated. > > > >>> Maybe increasing the limit would effectively amortize most of the >>> problem (say, limit to move 16 tasks at most). >>> >> The problem I was seeing was that even moving 2 was too many in the >> ftraces traces I looked at. I think the idea of making a variable limit >> (set via a sysctl, etc) here is a good one, but I would recommend we >> have the default be "1" for CONFIG_PREEMPT (or at least >> CONFIG_PREEMPT_RT) based on what I know right now. I know last time >> you objected to any kind of special cases for the preemptible kernels, >> but I think this is a good compromise. Would this be acceptable? >> > > Well I _prefer_ not to have a special case for preemptible kernels, but > we already have similar arbitrary kind of changes like in tlb flushing, > so... > > I understand and accept there are some places where fundamentally you > have to trade latency for throughput, so at some point we have to have a > config and/or sysctl for that. > > I'm surprised 2 is too much but 1 is OK. Seems pretty fragile to me. Its not that 1 is magically "ok". Its simply that newidle balancing hurts latency, and 1 is the minimum to pull to reasonably reduce the critical section. I already check if we NEEDS_RESCHED before taking the rq->lock in newidle, so waiting for one task to pull is the first opportunity I have to end the section as quickly as possible. It would be nice if I could just keep going if I could detect whether there was not any real contention. Let me give this angle some more thought. > Are > you just running insane tests that load up the runqueues heaps and tests > latency? -rt users will have to understand that some algorithms scale > linearly or so with the number of a particular resource allocated, so > they aren't going to get a constant low latency under arbitrary > conditions. > > FWIW, if you haven't already, then for -rt you might want to look at a > more advanced data structure than simple run ordered list for moving tasks > from one rq to the other. A simple one I was looking at is a time ordered > list to pull the most cache cold tasks (and thus we can stop searching > when we encounter the first cache hot task, in situations where it is > appropriate, etc). > Im not sure I follow your point, but if I do note that the RT scheduler uses a completely different load balancer (that is priority ordered). > Anyway... yeah I'm OK with this if it is under a config option. > Cool.. See v2 ;) Thanks Nick, -Greg