* newidle balancing in NUMA domain? @ 2009-11-23 11:22 Nick Piggin 2009-11-23 11:36 ` Peter Zijlstra 2009-11-23 14:37 ` Mike Galbraith 0 siblings, 2 replies; 30+ messages in thread From: Nick Piggin @ 2009-11-23 11:22 UTC (permalink / raw) To: Linux Kernel Mailing List, Ingo Molnar, Peter Zijlstra Hi, I wonder why it was decided to do newidle balancing in the NUMA domain? And with newidle_idx == 0 at that. This means that every time the CPU goes idle, every CPU in the system gets a remote cacheline or two hit. Not very nice O(n^2) behaviour on the interconnect. Not to mention trashing our NUMA locality. And then I see some proposal to do ratelimiting of newidle balancing :( Seems like hack upon hack making behaviour much more complex. One "symptom" of bad mutex contention can be that increasing the balancing rate can help a bit to reduce idle time (because it can get the woken thread which is holding a semaphore to run ASAP after we run out of runnable tasks in the system due to them hitting contention on that semaphore). I really hope this change wasn't done in order to help -rt or something sad like sysbench on MySQL. And btw, I'll stay out of mentioning anything about CFS development, but it really sucks to be continually making significant changes to domains balancing *and* per-runqueue scheduling at the same time :( It makes it even difficult to bisect things. Thanks, Nick ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 11:22 newidle balancing in NUMA domain? Nick Piggin @ 2009-11-23 11:36 ` Peter Zijlstra 2009-11-23 11:43 ` Nick Piggin 2009-11-23 11:45 ` Ingo Molnar 2009-11-23 14:37 ` Mike Galbraith 1 sibling, 2 replies; 30+ messages in thread From: Peter Zijlstra @ 2009-11-23 11:36 UTC (permalink / raw) To: Nick Piggin; +Cc: Linux Kernel Mailing List, Ingo Molnar On Mon, 2009-11-23 at 12:22 +0100, Nick Piggin wrote: > Hi, > > I wonder why it was decided to do newidle balancing in the NUMA > domain? And with newidle_idx == 0 at that. > > This means that every time the CPU goes idle, every CPU in the > system gets a remote cacheline or two hit. Not very nice O(n^2) > behaviour on the interconnect. Not to mention trashing our > NUMA locality. > > And then I see some proposal to do ratelimiting of newidle > balancing :( Seems like hack upon hack making behaviour much more > complex. > > One "symptom" of bad mutex contention can be that increasing the > balancing rate can help a bit to reduce idle time (because it > can get the woken thread which is holding a semaphore to run ASAP > after we run out of runnable tasks in the system due to them > hitting contention on that semaphore). > > I really hope this change wasn't done in order to help -rt or > something sad like sysbench on MySQL. IIRC this was kbuild and other spreading workloads that want this. the newidle_idx=0 thing is because I frequently saw it make funny balance decisions based on old load numbers, like f_b_g() selecting a group that didn't even have tasks in anymore. We went without newidle for a while, but then people started complaining about that kbuild time, and there is a x264 encoder thing that looses tons of throughput. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 11:36 ` Peter Zijlstra @ 2009-11-23 11:43 ` Nick Piggin 2009-11-23 11:50 ` Peter Zijlstra 2009-11-23 11:45 ` Ingo Molnar 1 sibling, 1 reply; 30+ messages in thread From: Nick Piggin @ 2009-11-23 11:43 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Linux Kernel Mailing List, Ingo Molnar On Mon, Nov 23, 2009 at 12:36:15PM +0100, Peter Zijlstra wrote: > On Mon, 2009-11-23 at 12:22 +0100, Nick Piggin wrote: > > Hi, > > > > I wonder why it was decided to do newidle balancing in the NUMA > > domain? And with newidle_idx == 0 at that. > > > > This means that every time the CPU goes idle, every CPU in the > > system gets a remote cacheline or two hit. Not very nice O(n^2) > > behaviour on the interconnect. Not to mention trashing our > > NUMA locality. > > > > And then I see some proposal to do ratelimiting of newidle > > balancing :( Seems like hack upon hack making behaviour much more > > complex. > > > > One "symptom" of bad mutex contention can be that increasing the > > balancing rate can help a bit to reduce idle time (because it > > can get the woken thread which is holding a semaphore to run ASAP > > after we run out of runnable tasks in the system due to them > > hitting contention on that semaphore). > > > > I really hope this change wasn't done in order to help -rt or > > something sad like sysbench on MySQL. > > IIRC this was kbuild and other spreading workloads that want this. > > the newidle_idx=0 thing is because I frequently saw it make funny > balance decisions based on old load numbers, like f_b_g() selecting a > group that didn't even have tasks in anymore. Well it is just a damping factor on runqueue flucturations. If the group recently had load then the point of the idx is to account for this. On the other hand, if we have other groups that are also above the idx damped average, it would make sense to use them instead. (ie. cull source groups with no pullable tasks). > We went without newidle for a while, but then people started complaining > about that kbuild time, and there is a x264 encoder thing that looses > tons of throughput. So... these were due to what? Other changes in domains balancing? Changes in CFS? Something else? Or were they comparisons versus other operating systems? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 11:43 ` Nick Piggin @ 2009-11-23 11:50 ` Peter Zijlstra 2009-11-23 12:16 ` Nick Piggin 0 siblings, 1 reply; 30+ messages in thread From: Peter Zijlstra @ 2009-11-23 11:50 UTC (permalink / raw) To: Nick Piggin; +Cc: Linux Kernel Mailing List, Ingo Molnar On Mon, 2009-11-23 at 12:43 +0100, Nick Piggin wrote: > On Mon, Nov 23, 2009 at 12:36:15PM +0100, Peter Zijlstra wrote: > > IIRC this was kbuild and other spreading workloads that want this. > > > > the newidle_idx=0 thing is because I frequently saw it make funny > > balance decisions based on old load numbers, like f_b_g() selecting a > > group that didn't even have tasks in anymore. > > Well it is just a damping factor on runqueue flucturations. If the > group recently had load then the point of the idx is to account > for this. On the other hand, if we have other groups that are also > above the idx damped average, it would make sense to use them > instead. (ie. cull source groups with no pullable tasks). Right, thing is, I'm still catching up from being gone, and haven't actually read and tought through the whole rate-limiting thing :-( If you see a better way to accomplish things, please holler. > > We went without newidle for a while, but then people started complaining > > about that kbuild time, and there is a x264 encoder thing that looses > > tons of throughput. > > So... these were due to what? Other changes in domains balancing? > Changes in CFS? Something else? Or were they comparisons versus > other operating systems? Comparison to Con's latest single-rq spread like there's no cache affinity BFS thing. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 11:50 ` Peter Zijlstra @ 2009-11-23 12:16 ` Nick Piggin 0 siblings, 0 replies; 30+ messages in thread From: Nick Piggin @ 2009-11-23 12:16 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Linux Kernel Mailing List, Ingo Molnar On Mon, Nov 23, 2009 at 12:50:45PM +0100, Peter Zijlstra wrote: > On Mon, 2009-11-23 at 12:43 +0100, Nick Piggin wrote: > > On Mon, Nov 23, 2009 at 12:36:15PM +0100, Peter Zijlstra wrote: > > > > IIRC this was kbuild and other spreading workloads that want this. > > > > > > the newidle_idx=0 thing is because I frequently saw it make funny > > > balance decisions based on old load numbers, like f_b_g() selecting a > > > group that didn't even have tasks in anymore. > > > > Well it is just a damping factor on runqueue flucturations. If the > > group recently had load then the point of the idx is to account > > for this. On the other hand, if we have other groups that are also > > above the idx damped average, it would make sense to use them > > instead. (ie. cull source groups with no pullable tasks). > > Right, thing is, I'm still catching up from being gone, and haven't > actually read and tought through the whole rate-limiting thing :-( > > If you see a better way to accomplish things, please holler. Well, not with adding another hack on top of newidle becuase it is too aggressive because it was turned on where it shouldn't be :) Within an LLC we really want to be quite aggressive at moving tasks around in order to minimise idle time. *Perhaps* another type of balancing mode which is rate limited. However... > > > We went without newidle for a while, but then people started complaining > > > about that kbuild time, and there is a x264 encoder thing that looses > > > tons of throughput. > > > > So... these were due to what? Other changes in domains balancing? > > Changes in CFS? Something else? Or were they comparisons versus > > other operating systems? > > Comparison to Con's latest single-rq spread like there's no cache > affinity BFS thing. Seems like a bit of a knee jerk reaction here. sched domains balancing seems to have been _relatively_ good for quite a long time without lots of changes. And lots of users (distros, apparently google) still haven't got a handle on CFS performance problems so I would have really rathered introducing such extra changes far far more slowly. (ie. wait, until CFS is more sorted and at least gone through some maturity and deployment in distros). I can easily make test cases where performance goes up by actually some orders of magnitude by making the balancing totally aggressive over all domains. And actually even we have some report of a (closed source) java app having slowdowns due to this. Thing is, such workloads if they require such frequent cross-numa-node balancing are maybe not really scalable anyway. kbuild -- this is frequent communications, short lived tasks, so it is a good case for more aggressive balancing at the expense of NUMA affinity. But this is probably further over to that side than a lot of other workloads, so if our defaults aren't the *best* for kbuild (ie. not quite aggressive enough) then I think that is probably a good sign. How large was the kbuild improvement? Was O(1) tried as well? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 11:36 ` Peter Zijlstra 2009-11-23 11:43 ` Nick Piggin @ 2009-11-23 11:45 ` Ingo Molnar 2009-11-23 12:01 ` Nick Piggin 1 sibling, 1 reply; 30+ messages in thread From: Ingo Molnar @ 2009-11-23 11:45 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Nick Piggin, Linux Kernel Mailing List * Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, 2009-11-23 at 12:22 +0100, Nick Piggin wrote: > > Hi, > > > > I wonder why it was decided to do newidle balancing in the NUMA > > domain? And with newidle_idx == 0 at that. > > > > This means that every time the CPU goes idle, every CPU in the > > system gets a remote cacheline or two hit. Not very nice O(n^2) > > behaviour on the interconnect. Not to mention trashing our > > NUMA locality. > > > > And then I see some proposal to do ratelimiting of newidle > > balancing :( Seems like hack upon hack making behaviour much more > > complex. > > > > One "symptom" of bad mutex contention can be that increasing the > > balancing rate can help a bit to reduce idle time (because it > > can get the woken thread which is holding a semaphore to run ASAP > > after we run out of runnable tasks in the system due to them > > hitting contention on that semaphore). > > > > I really hope this change wasn't done in order to help -rt or > > something sad like sysbench on MySQL. > > IIRC this was kbuild and other spreading workloads that want this. > > the newidle_idx=0 thing is because I frequently saw it make funny > balance decisions based on old load numbers, like f_b_g() selecting a > group that didn't even have tasks in anymore. > > We went without newidle for a while, but then people started > complaining about that kbuild time, and there is a x264 encoder thing > that looses tons of throughput. Yep, i too reacted in a similar way to Nick initially - but i think you are right, we really want good, precise metrics and want to be optional/fuzzy in our balancing _decisions_, not in our metrics. Ingo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 11:45 ` Ingo Molnar @ 2009-11-23 12:01 ` Nick Piggin 2009-11-23 12:08 ` Ingo Molnar 0 siblings, 1 reply; 30+ messages in thread From: Nick Piggin @ 2009-11-23 12:01 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, Linux Kernel Mailing List On Mon, Nov 23, 2009 at 12:45:50PM +0100, Ingo Molnar wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > On Mon, 2009-11-23 at 12:22 +0100, Nick Piggin wrote: > > > Hi, > > > > > > I wonder why it was decided to do newidle balancing in the NUMA > > > domain? And with newidle_idx == 0 at that. > > > > > > This means that every time the CPU goes idle, every CPU in the > > > system gets a remote cacheline or two hit. Not very nice O(n^2) > > > behaviour on the interconnect. Not to mention trashing our > > > NUMA locality. > > > > > > And then I see some proposal to do ratelimiting of newidle > > > balancing :( Seems like hack upon hack making behaviour much more > > > complex. > > > > > > One "symptom" of bad mutex contention can be that increasing the > > > balancing rate can help a bit to reduce idle time (because it > > > can get the woken thread which is holding a semaphore to run ASAP > > > after we run out of runnable tasks in the system due to them > > > hitting contention on that semaphore). > > > > > > I really hope this change wasn't done in order to help -rt or > > > something sad like sysbench on MySQL. > > > > IIRC this was kbuild and other spreading workloads that want this. > > > > the newidle_idx=0 thing is because I frequently saw it make funny > > balance decisions based on old load numbers, like f_b_g() selecting a > > group that didn't even have tasks in anymore. > > > > We went without newidle for a while, but then people started > > complaining about that kbuild time, and there is a x264 encoder thing > > that looses tons of throughput. > > Yep, i too reacted in a similar way to Nick initially - but i think you > are right, we really want good, precise metrics and want to be > optional/fuzzy in our balancing _decisions_, not in our metrics. Well to be fair, the *decision* is to use a longer-term weight for the runqueue to reduce balancing (seeing as we naturally do far more balancing on conditions means that we tend to look at our instant runqueue weight when it is 0). ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 12:01 ` Nick Piggin @ 2009-11-23 12:08 ` Ingo Molnar 2009-11-23 12:27 ` Nick Piggin 0 siblings, 1 reply; 30+ messages in thread From: Ingo Molnar @ 2009-11-23 12:08 UTC (permalink / raw) To: Nick Piggin; +Cc: Peter Zijlstra, Linux Kernel Mailing List * Nick Piggin <npiggin@suse.de> wrote: > On Mon, Nov 23, 2009 at 12:45:50PM +0100, Ingo Molnar wrote: > > > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > > On Mon, 2009-11-23 at 12:22 +0100, Nick Piggin wrote: > > > > Hi, > > > > > > > > I wonder why it was decided to do newidle balancing in the NUMA > > > > domain? And with newidle_idx == 0 at that. > > > > > > > > This means that every time the CPU goes idle, every CPU in the > > > > system gets a remote cacheline or two hit. Not very nice O(n^2) > > > > behaviour on the interconnect. Not to mention trashing our > > > > NUMA locality. > > > > > > > > And then I see some proposal to do ratelimiting of newidle > > > > balancing :( Seems like hack upon hack making behaviour much more > > > > complex. > > > > > > > > One "symptom" of bad mutex contention can be that increasing the > > > > balancing rate can help a bit to reduce idle time (because it > > > > can get the woken thread which is holding a semaphore to run ASAP > > > > after we run out of runnable tasks in the system due to them > > > > hitting contention on that semaphore). > > > > > > > > I really hope this change wasn't done in order to help -rt or > > > > something sad like sysbench on MySQL. > > > > > > IIRC this was kbuild and other spreading workloads that want this. > > > > > > the newidle_idx=0 thing is because I frequently saw it make funny > > > balance decisions based on old load numbers, like f_b_g() selecting a > > > group that didn't even have tasks in anymore. > > > > > > We went without newidle for a while, but then people started > > > complaining about that kbuild time, and there is a x264 encoder thing > > > that looses tons of throughput. > > > > Yep, i too reacted in a similar way to Nick initially - but i think you > > are right, we really want good, precise metrics and want to be > > optional/fuzzy in our balancing _decisions_, not in our metrics. > > Well to be fair, the *decision* is to use a longer-term weight for the > runqueue to reduce balancing (seeing as we naturally do far more > balancing on conditions means that we tend to look at our instant > runqueue weight when it is 0). Well, the problem with that is that it uses a potentially outdated piece of metric - and that can become visible if balancing events are rare enough. I.e. we do need a time scale (rate of balancing) to be able to do this correctly on a statistical level - which pretty much brings in 'rate limit' kind of logic. We are better off observing reality precisely and then saying "dont do this action" instead of fuzzing our metrics [or using fuzzy metrics conditionally - which is really the same] and hoping that in the end it will be as if we didnt do certain decisions. (I hope i explained my point clearly enough.) No argument that it could be done cleaner - the duality right now of both having the fuzzy stats and the rate limiting should be decided one way or another. Also, no argument that if you can measure bad effects from this change on any workload we need to look at that and fix it. Thanks, Ingo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 12:08 ` Ingo Molnar @ 2009-11-23 12:27 ` Nick Piggin 2009-11-23 12:46 ` Ingo Molnar 0 siblings, 1 reply; 30+ messages in thread From: Nick Piggin @ 2009-11-23 12:27 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, Linux Kernel Mailing List On Mon, Nov 23, 2009 at 01:08:49PM +0100, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > On Mon, Nov 23, 2009 at 12:45:50PM +0100, Ingo Molnar wrote: > > Well to be fair, the *decision* is to use a longer-term weight for the > > runqueue to reduce balancing (seeing as we naturally do far more > > balancing on conditions means that we tend to look at our instant > > runqueue weight when it is 0). > > Well, the problem with that is that it uses a potentially outdated piece > of metric - and that can become visible if balancing events are rare > enough. It shouldn't, it happens on the local CPU, at each scheduler tick. > I.e. we do need a time scale (rate of balancing) to be able to do this > correctly on a statistical level - which pretty much brings in 'rate > limit' kind of logic. > > We are better off observing reality precisely and then saying "dont do > this action" instead of fuzzing our metrics [or using fuzzy metrics > conditionally - which is really the same] and hoping that in the end it > will be as if we didnt do certain decisions. We do though. We take the instant value and the weighted value and take the min or max (depending on source or destination). And we decide to do that because of the instantaneous fluctuations on runqueue load can be far far outside even a normal short term operating condition. I don't know how you would otherwise propose to damp those fluctuations that don't actually require any balancing. Rate limiting doesn't get rid of those at all, it just does a bit less frequent blaancing. But on the NUMA domain, memory affinity has been destroyed whether a task is moved once or 100 times. > (I hope i explained my point clearly enough.) > > No argument that it could be done cleaner - the duality right now of > both having the fuzzy stats and the rate limiting should be decided > one way or another. Well, I would say please keep domain balancing behaviour at least somewhat close to how it was with O(1) scheduler at least until CFS is more sorted out. There is no need to knee jerk because BFS is better at something. We can certainly look at making improvements and take queues from BFS to use in sched domains, but it is so easy to introduce regressions and also regressions versus previous kernels are much more important than slowdowns versus an out of tree patch. So while CFS is still going through troubles I think it is much better to slow down on sched-domains changes. Thanks, ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 12:27 ` Nick Piggin @ 2009-11-23 12:46 ` Ingo Molnar 2009-11-24 6:36 ` Nick Piggin 0 siblings, 1 reply; 30+ messages in thread From: Ingo Molnar @ 2009-11-23 12:46 UTC (permalink / raw) To: Nick Piggin; +Cc: Peter Zijlstra, Linux Kernel Mailing List * Nick Piggin <npiggin@suse.de> wrote: > > (I hope i explained my point clearly enough.) > > > > No argument that it could be done cleaner - the duality right now of > > both having the fuzzy stats and the rate limiting should be decided > > one way or another. > > Well, I would say please keep domain balancing behaviour at least > somewhat close to how it was with O(1) scheduler at least until CFS is > more sorted out. There is no need to knee jerk because BFS is better > at something. Well, see the change (commit 0ec9fab3d) attached below - even in hindsight it was well argued by Mike with quite a few numbers: +68% in x264 performance. If you think it's a bad change (combined with newidle-rate-limit) then please show the numbers that counter it and preferably extend 'perf bench sched' with the testcases that you care about most. IIRC (and Peter mentioned this too) Mike got into the whole kbuild angle due to comparing mainline scheduler to BFS - but that's really a positive thing IMO, not some "knee-jerk reaction". Ingo -----------------------> >From 0ec9fab3d186d9cbb00c0f694d4a260d07c198d9 Mon Sep 17 00:00:00 2001 From: Mike Galbraith <efault@gmx.de> Date: Tue, 15 Sep 2009 15:07:03 +0200 Subject: [PATCH] sched: Improve latencies and throughput Make the idle balancer more agressive, to improve a x264 encoding workload provided by Jason Garrett-Glaser: NEXT_BUDDY NO_LB_BIAS encoded 600 frames, 252.82 fps, 22096.60 kb/s encoded 600 frames, 250.69 fps, 22096.60 kb/s encoded 600 frames, 245.76 fps, 22096.60 kb/s NO_NEXT_BUDDY LB_BIAS encoded 600 frames, 344.44 fps, 22096.60 kb/s encoded 600 frames, 346.66 fps, 22096.60 kb/s encoded 600 frames, 352.59 fps, 22096.60 kb/s NO_NEXT_BUDDY NO_LB_BIAS encoded 600 frames, 425.75 fps, 22096.60 kb/s encoded 600 frames, 425.45 fps, 22096.60 kb/s encoded 600 frames, 422.49 fps, 22096.60 kb/s Peter pointed out that this is better done via newidle_idx, not via LB_BIAS, newidle balancing should look for where there is load _now_, not where there was load 2 ticks ago. Worst-case latencies are improved as well as no buddies means less vruntime spread. (as per prior lkml discussions) This change improves kbuild-peak parallelism as well. Reported-by: Jason Garrett-Glaser <darkshikari@gmail.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1253011667.9128.16.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/ia64/include/asm/topology.h | 5 +++-- arch/powerpc/include/asm/topology.h | 2 +- arch/sh/include/asm/topology.h | 3 ++- arch/x86/include/asm/topology.h | 4 +--- include/linux/topology.h | 2 +- kernel/sched_features.h | 2 +- 6 files changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index 47f3c51..42f1673 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -61,7 +61,7 @@ void build_cpu_to_node_map(void); .cache_nice_tries = 2, \ .busy_idx = 2, \ .idle_idx = 1, \ - .newidle_idx = 2, \ + .newidle_idx = 0, \ .wake_idx = 0, \ .forkexec_idx = 1, \ .flags = SD_LOAD_BALANCE \ @@ -87,10 +87,11 @@ void build_cpu_to_node_map(void); .cache_nice_tries = 2, \ .busy_idx = 3, \ .idle_idx = 2, \ - .newidle_idx = 2, \ + .newidle_idx = 0, \ .wake_idx = 0, \ .forkexec_idx = 1, \ .flags = SD_LOAD_BALANCE \ + | SD_BALANCE_NEWIDLE \ | SD_BALANCE_EXEC \ | SD_BALANCE_FORK \ | SD_BALANCE_WAKE \ diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index a6b220a..1a2c9eb 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -57,7 +57,7 @@ static inline int pcibus_to_node(struct pci_bus *bus) .cache_nice_tries = 1, \ .busy_idx = 3, \ .idle_idx = 1, \ - .newidle_idx = 2, \ + .newidle_idx = 0, \ .wake_idx = 0, \ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_EXEC \ diff --git a/arch/sh/include/asm/topology.h b/arch/sh/include/asm/topology.h index 9054e5c..c843677 100644 --- a/arch/sh/include/asm/topology.h +++ b/arch/sh/include/asm/topology.h @@ -15,13 +15,14 @@ .cache_nice_tries = 2, \ .busy_idx = 3, \ .idle_idx = 2, \ - .newidle_idx = 2, \ + .newidle_idx = 0, \ .wake_idx = 0, \ .forkexec_idx = 1, \ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_BALANCE_WAKE \ + | SD_BALANCE_NEWIDLE \ | SD_SERIALIZE, \ .last_balance = jiffies, \ .balance_interval = 1, \ diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 4b1b335..7fafd1b 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -116,14 +116,12 @@ extern unsigned long node_remap_size[]; # define SD_CACHE_NICE_TRIES 1 # define SD_IDLE_IDX 1 -# define SD_NEWIDLE_IDX 2 # define SD_FORKEXEC_IDX 0 #else # define SD_CACHE_NICE_TRIES 2 # define SD_IDLE_IDX 2 -# define SD_NEWIDLE_IDX 2 # define SD_FORKEXEC_IDX 1 #endif @@ -137,7 +135,7 @@ extern unsigned long node_remap_size[]; .cache_nice_tries = SD_CACHE_NICE_TRIES, \ .busy_idx = 3, \ .idle_idx = SD_IDLE_IDX, \ - .newidle_idx = SD_NEWIDLE_IDX, \ + .newidle_idx = 0, \ .wake_idx = 0, \ .forkexec_idx = SD_FORKEXEC_IDX, \ \ diff --git a/include/linux/topology.h b/include/linux/topology.h index c87edcd..4298745 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -151,7 +151,7 @@ int arch_update_cpu_topology(void); .cache_nice_tries = 1, \ .busy_idx = 2, \ .idle_idx = 1, \ - .newidle_idx = 2, \ + .newidle_idx = 0, \ .wake_idx = 0, \ .forkexec_idx = 1, \ \ diff --git a/kernel/sched_features.h b/kernel/sched_features.h index 891ea0f..e98c2e8 100644 --- a/kernel/sched_features.h +++ b/kernel/sched_features.h @@ -67,7 +67,7 @@ SCHED_FEAT(AFFINE_WAKEUPS, 1) * wakeup-preemption), since its likely going to consume data we * touched, increases cache locality. */ -SCHED_FEAT(NEXT_BUDDY, 1) +SCHED_FEAT(NEXT_BUDDY, 0) /* * Prefer to schedule the task that ran last (when we did ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 12:46 ` Ingo Molnar @ 2009-11-24 6:36 ` Nick Piggin 2009-11-24 17:24 ` Jason Garrett-Glaser 0 siblings, 1 reply; 30+ messages in thread From: Nick Piggin @ 2009-11-24 6:36 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, Linux Kernel Mailing List On Mon, Nov 23, 2009 at 01:46:15PM +0100, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > > (I hope i explained my point clearly enough.) > > > > > > No argument that it could be done cleaner - the duality right now of > > > both having the fuzzy stats and the rate limiting should be decided > > > one way or another. > > > > Well, I would say please keep domain balancing behaviour at least > > somewhat close to how it was with O(1) scheduler at least until CFS is > > more sorted out. There is no need to knee jerk because BFS is better > > at something. > > Well, see the change (commit 0ec9fab3d) attached below - even in > hindsight it was well argued by Mike with quite a few numbers: +68% in > x264 performance. Quite a few being one test case, and on a program with a horrible parallelism design (rapid heavy weight forks to distribute small units of work). > If you think it's a bad change (combined with newidle-rate-limit) then > please show the numbers that counter it and preferably extend 'perf > bench sched' with the testcases that you care about most. > > IIRC (and Peter mentioned this too) Mike got into the whole kbuild angle > due to comparing mainline scheduler to BFS - but that's really a > positive thing IMO, not some "knee-jerk reaction". It is a knee-jerk "we have to be better than BFS" reaction. It is quite obvious that vasty increasing inter node balancing and remote cacheline touches is not a good idea (unless it helps _reasonably_ designed programs without unduely hurting others). And when you are changing all this desktop interactivity stuff, I would ask please not to make these kinds of "improvements" to domains balancing. I don't think that is too much to ask given track record of scheduler problems. Finally, I didn't realise I would have to be responsible for shooting down every "ooh shiny" change that gets pushed into the scheduler. I would have hoped they are ensured to be rather well thought out and reviewed and tested first. I do think it is a bad change, for extensive reasons I explained. It will obviously cause regressions on workloads with lots of newidle events that are un-balanceable (which would include _well tuned_ net work and data base servers). And it does. I thought you'd have been aware of that. http://patchwork.kernel.org/patch/58931/ > > Ingo > > -----------------------> > >From 0ec9fab3d186d9cbb00c0f694d4a260d07c198d9 Mon Sep 17 00:00:00 2001 > From: Mike Galbraith <efault@gmx.de> > Date: Tue, 15 Sep 2009 15:07:03 +0200 > Subject: [PATCH] sched: Improve latencies and throughput > > Make the idle balancer more agressive, to improve a > x264 encoding workload provided by Jason Garrett-Glaser: > > NEXT_BUDDY NO_LB_BIAS > encoded 600 frames, 252.82 fps, 22096.60 kb/s > encoded 600 frames, 250.69 fps, 22096.60 kb/s > encoded 600 frames, 245.76 fps, 22096.60 kb/s > > NO_NEXT_BUDDY LB_BIAS > encoded 600 frames, 344.44 fps, 22096.60 kb/s > encoded 600 frames, 346.66 fps, 22096.60 kb/s > encoded 600 frames, 352.59 fps, 22096.60 kb/s > > NO_NEXT_BUDDY NO_LB_BIAS > encoded 600 frames, 425.75 fps, 22096.60 kb/s > encoded 600 frames, 425.45 fps, 22096.60 kb/s > encoded 600 frames, 422.49 fps, 22096.60 kb/s > > Peter pointed out that this is better done via newidle_idx, > not via LB_BIAS, newidle balancing should look for where > there is load _now_, not where there was load 2 ticks ago. > > Worst-case latencies are improved as well as no buddies > means less vruntime spread. (as per prior lkml discussions) > > This change improves kbuild-peak parallelism as well. > > Reported-by: Jason Garrett-Glaser <darkshikari@gmail.com> > Signed-off-by: Mike Galbraith <efault@gmx.de> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > LKML-Reference: <1253011667.9128.16.camel@marge.simson.net> > Signed-off-by: Ingo Molnar <mingo@elte.hu> > --- > arch/ia64/include/asm/topology.h | 5 +++-- > arch/powerpc/include/asm/topology.h | 2 +- > arch/sh/include/asm/topology.h | 3 ++- > arch/x86/include/asm/topology.h | 4 +--- > include/linux/topology.h | 2 +- > kernel/sched_features.h | 2 +- > 6 files changed, 9 insertions(+), 9 deletions(-) > > diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h > index 47f3c51..42f1673 100644 > --- a/arch/ia64/include/asm/topology.h > +++ b/arch/ia64/include/asm/topology.h > @@ -61,7 +61,7 @@ void build_cpu_to_node_map(void); > .cache_nice_tries = 2, \ > .busy_idx = 2, \ > .idle_idx = 1, \ > - .newidle_idx = 2, \ > + .newidle_idx = 0, \ > .wake_idx = 0, \ > .forkexec_idx = 1, \ > .flags = SD_LOAD_BALANCE \ > @@ -87,10 +87,11 @@ void build_cpu_to_node_map(void); > .cache_nice_tries = 2, \ > .busy_idx = 3, \ > .idle_idx = 2, \ > - .newidle_idx = 2, \ > + .newidle_idx = 0, \ > .wake_idx = 0, \ > .forkexec_idx = 1, \ > .flags = SD_LOAD_BALANCE \ > + | SD_BALANCE_NEWIDLE \ > | SD_BALANCE_EXEC \ > | SD_BALANCE_FORK \ > | SD_BALANCE_WAKE \ > diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h > index a6b220a..1a2c9eb 100644 > --- a/arch/powerpc/include/asm/topology.h > +++ b/arch/powerpc/include/asm/topology.h > @@ -57,7 +57,7 @@ static inline int pcibus_to_node(struct pci_bus *bus) > .cache_nice_tries = 1, \ > .busy_idx = 3, \ > .idle_idx = 1, \ > - .newidle_idx = 2, \ > + .newidle_idx = 0, \ > .wake_idx = 0, \ > .flags = SD_LOAD_BALANCE \ > | SD_BALANCE_EXEC \ > diff --git a/arch/sh/include/asm/topology.h b/arch/sh/include/asm/topology.h > index 9054e5c..c843677 100644 > --- a/arch/sh/include/asm/topology.h > +++ b/arch/sh/include/asm/topology.h > @@ -15,13 +15,14 @@ > .cache_nice_tries = 2, \ > .busy_idx = 3, \ > .idle_idx = 2, \ > - .newidle_idx = 2, \ > + .newidle_idx = 0, \ > .wake_idx = 0, \ > .forkexec_idx = 1, \ > .flags = SD_LOAD_BALANCE \ > | SD_BALANCE_FORK \ > | SD_BALANCE_EXEC \ > | SD_BALANCE_WAKE \ > + | SD_BALANCE_NEWIDLE \ > | SD_SERIALIZE, \ > .last_balance = jiffies, \ > .balance_interval = 1, \ > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h > index 4b1b335..7fafd1b 100644 > --- a/arch/x86/include/asm/topology.h > +++ b/arch/x86/include/asm/topology.h > @@ -116,14 +116,12 @@ extern unsigned long node_remap_size[]; > > # define SD_CACHE_NICE_TRIES 1 > # define SD_IDLE_IDX 1 > -# define SD_NEWIDLE_IDX 2 > # define SD_FORKEXEC_IDX 0 > > #else > > # define SD_CACHE_NICE_TRIES 2 > # define SD_IDLE_IDX 2 > -# define SD_NEWIDLE_IDX 2 > # define SD_FORKEXEC_IDX 1 > > #endif > @@ -137,7 +135,7 @@ extern unsigned long node_remap_size[]; > .cache_nice_tries = SD_CACHE_NICE_TRIES, \ > .busy_idx = 3, \ > .idle_idx = SD_IDLE_IDX, \ > - .newidle_idx = SD_NEWIDLE_IDX, \ > + .newidle_idx = 0, \ > .wake_idx = 0, \ > .forkexec_idx = SD_FORKEXEC_IDX, \ > \ > diff --git a/include/linux/topology.h b/include/linux/topology.h > index c87edcd..4298745 100644 > --- a/include/linux/topology.h > +++ b/include/linux/topology.h > @@ -151,7 +151,7 @@ int arch_update_cpu_topology(void); > .cache_nice_tries = 1, \ > .busy_idx = 2, \ > .idle_idx = 1, \ > - .newidle_idx = 2, \ > + .newidle_idx = 0, \ > .wake_idx = 0, \ > .forkexec_idx = 1, \ > \ > diff --git a/kernel/sched_features.h b/kernel/sched_features.h > index 891ea0f..e98c2e8 100644 > --- a/kernel/sched_features.h > +++ b/kernel/sched_features.h > @@ -67,7 +67,7 @@ SCHED_FEAT(AFFINE_WAKEUPS, 1) > * wakeup-preemption), since its likely going to consume data we > * touched, increases cache locality. > */ > -SCHED_FEAT(NEXT_BUDDY, 1) > +SCHED_FEAT(NEXT_BUDDY, 0) > > /* > * Prefer to schedule the task that ran last (when we did ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-24 6:36 ` Nick Piggin @ 2009-11-24 17:24 ` Jason Garrett-Glaser 2009-11-24 18:09 ` Mike Galbraith 2009-11-30 8:19 ` Nick Piggin 0 siblings, 2 replies; 30+ messages in thread From: Jason Garrett-Glaser @ 2009-11-24 17:24 UTC (permalink / raw) To: Nick Piggin; +Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List > Quite a few being one test case, and on a program with a horrible > parallelism design (rapid heavy weight forks to distribute small > units of work). > If x264 is declared dainbramaged, that's fine with me too. We did multiple benchmarks using a thread pool and it did not help. If you want to declare our app "braindamaged", feel free, but pooling threads to avoid re-creation gave no benefit whatsoever. If you think the parallelism methodology is wrong as a whole, you're basically saying that Linux shouldn't be used for video compression, because this is the exact same threading model used by almost every single video encoder ever made. There are actually a few that use slice-based threading, but those are actually even worse from your perspective, because slice-based threading spawns mulitple threads PER FRAME instead of one per frame. Because of the inter-frame dependencies in video coding it is impossible to efficiently get a granularity of more than one thread per frame. Pooling threads doesn't change the fact that you are conceptually creating a thread for each frame--it just eliminates the pthread_create call. In theory you could do one thread per group of frames, but that is completely unrealistic for real-time encoding (e.g. streaming), requires a catastrophically large amount of memory, makes it impossible to track the bit buffer, and all other sorts of bad stuff. Jason Garrett-Glaser ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-24 17:24 ` Jason Garrett-Glaser @ 2009-11-24 18:09 ` Mike Galbraith 2009-11-30 8:19 ` Nick Piggin 1 sibling, 0 replies; 30+ messages in thread From: Mike Galbraith @ 2009-11-24 18:09 UTC (permalink / raw) To: Jason Garrett-Glaser Cc: Nick Piggin, Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List On Tue, 2009-11-24 at 09:24 -0800, Jason Garrett-Glaser wrote: > > Quite a few being one test case, and on a program with a horrible > > parallelism design (rapid heavy weight forks to distribute small > > units of work). > > > If x264 is declared dainbramaged, that's fine with me too. > > We did multiple benchmarks using a thread pool and it did not help. Yes, I see no way it possibly could make any difference. Well, there is one thing. We have this START_DEBIT thing, using a thread pool avoids that very significant penalty. WRT idle->busy again time though, it can't make any difference. > If you want to declare our app "braindamaged", feel free, but pooling > threads to avoid re-creation gave no benefit whatsoever. If you think > the parallelism methodology is wrong as a whole, you're basically > saying that Linux shouldn't be used for video compression, because > this is the exact same threading model used by almost every single > video encoder ever made. There are actually a few that use > slice-based threading, but those are actually even worse from your > perspective, because slice-based threading spawns mulitple threads PER > FRAME instead of one per frame. > > Because of the inter-frame dependencies in video coding it is > impossible to efficiently get a granularity of more than one thread > per frame. Pooling threads doesn't change the fact that you are > conceptually creating a thread for each frame--it just eliminates the > pthread_create call. In theory you could do one thread per group of > frames, but that is completely unrealistic for real-time encoding > (e.g. streaming), requires a catastrophically large amount of memory, > makes it impossible to track the bit buffer, and all other sorts of > bad stuff. I don't consider x264 to be braindamaged btw, I consider it to be a very nice testcase for the scheduler. As soon as I saw the problem it highlighted so well, it became a permanent member of my collection. -Mike ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-24 17:24 ` Jason Garrett-Glaser 2009-11-24 18:09 ` Mike Galbraith @ 2009-11-30 8:19 ` Nick Piggin 2009-12-01 8:18 ` Jason Garrett-Glaser 1 sibling, 1 reply; 30+ messages in thread From: Nick Piggin @ 2009-11-30 8:19 UTC (permalink / raw) To: Jason Garrett-Glaser Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List On Tue, Nov 24, 2009 at 09:24:26AM -0800, Jason Garrett-Glaser wrote: > > Quite a few being one test case, and on a program with a horrible > > parallelism design (rapid heavy weight forks to distribute small > > units of work). > > > If x264 is declared dainbramaged, that's fine with me too. > > We did multiple benchmarks using a thread pool and it did not help. > If you want to declare our app "braindamaged", feel free, but pooling > threads to avoid re-creation gave no benefit whatsoever. If you think > the parallelism methodology is wrong as a whole, you're basically > saying that Linux shouldn't be used for video compression, because > this is the exact same threading model used by almost every single > video encoder ever made. There are actually a few that use > slice-based threading, but those are actually even worse from your > perspective, because slice-based threading spawns mulitple threads PER > FRAME instead of one per frame. > > Because of the inter-frame dependencies in video coding it is > impossible to efficiently get a granularity of more than one thread > per frame. Pooling threads doesn't change the fact that you are > conceptually creating a thread for each frame--it just eliminates the > pthread_create call. In theory you could do one thread per group of > frames, but that is completely unrealistic for real-time encoding > (e.g. streaming), requires a catastrophically large amount of memory, > makes it impossible to track the bit buffer, and all other sorts of > bad stuff. If you can scale to N threads by having 1 frame per thread, then you can scale to N/2 threads and have 2 frames per thread. Can't you? Is your problem in scaling to a large N? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-30 8:19 ` Nick Piggin @ 2009-12-01 8:18 ` Jason Garrett-Glaser 0 siblings, 0 replies; 30+ messages in thread From: Jason Garrett-Glaser @ 2009-12-01 8:18 UTC (permalink / raw) To: Nick Piggin; +Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List On Mon, Nov 30, 2009 at 12:19 AM, Nick Piggin <npiggin@suse.de> wrote: > On Tue, Nov 24, 2009 at 09:24:26AM -0800, Jason Garrett-Glaser wrote: >> > Quite a few being one test case, and on a program with a horrible >> > parallelism design (rapid heavy weight forks to distribute small >> > units of work). >> >> > If x264 is declared dainbramaged, that's fine with me too. >> >> We did multiple benchmarks using a thread pool and it did not help. >> If you want to declare our app "braindamaged", feel free, but pooling >> threads to avoid re-creation gave no benefit whatsoever. If you think >> the parallelism methodology is wrong as a whole, you're basically >> saying that Linux shouldn't be used for video compression, because >> this is the exact same threading model used by almost every single >> video encoder ever made. There are actually a few that use >> slice-based threading, but those are actually even worse from your >> perspective, because slice-based threading spawns mulitple threads PER >> FRAME instead of one per frame. >> >> Because of the inter-frame dependencies in video coding it is >> impossible to efficiently get a granularity of more than one thread >> per frame. Pooling threads doesn't change the fact that you are >> conceptually creating a thread for each frame--it just eliminates the >> pthread_create call. In theory you could do one thread per group of >> frames, but that is completely unrealistic for real-time encoding >> (e.g. streaming), requires a catastrophically large amount of memory, >> makes it impossible to track the bit buffer, and all other sorts of >> bad stuff. > > If you can scale to N threads by having 1 frame per thread, then > you can scale to N/2 threads and have 2 frames per thread. Can't > you? > > Is your problem in scaling to a large N? > > x264's threading is described here: http://akuvian.org/src/x264/sliceless_threads.txt By example (3 threads), simplified: Step 0: Frame 0: 0% done Step 1: Frame 0: 33% done Frame 1: 0% done Step 2: Frame 0: 66% done Frame 1: 33% done Frame 2: 0% done Step 3: Frame 0: 100% done Frame 1: 66% done Frame 2: 33% done Frame 3: 0% done Step 4: Frame 1: 100% done Frame 2: 66% done Frame 3: 33% done Frame 4: 0% done (etc) The motion search is restricted so that, for example, in Step 3, frame 2 doesn't look beyond the completed 66% of frame 1. There's room reserved in terms of height for sync so that each thread doesn't have to be exactly in lock-step with the others. This avoids most unnecessary waiting. The problem is that each frame is inherently one "work unit". Its dependencies all consist on the previous frame (Frame 1 depends on Frame 0). It doesn't make any sense to try to lump multiple frames together into a work unit when the dependencies don't work that way. Just dumping two frames arbitrarily in one thread turns this into a thread pool, which as mentioned previously probably wouldn't help significantly. If you meant working on two frames simultaneously in the same thread, that's even worse--it's going to be a cache thrashing disaster, since the scheduler can no longer move two threads to separate cores, and you now have two totally separate sets of processing trying to dump themselves into the same cache. Furthermore, that doesn't reduce the main limitation on threading: the vertical height of the frame. Also, another thing to note is that "fast thread creation" isn't the only problem here: the changes to the scheduler gave x264 enormous speed boosts even at *slower* encoding modes. One user reported a gain from 25fps -> 39fps, for example; that's dozens of milliseconds per thread, far longer than I would think would cause problems due to threads being too short lived. You should probably consider doing some testing with slower encoding as well, both in terms of fast settings and high-resolution inputs--and slow settings with low-resolution inputs, where the bottleneck is purely computational. Some resources for such testing: 1. http://media.xiph.org/video/derf/ has a lot of free test clips (HD ones at the bottom). 2. x264 --help lists a set of presets from "ultrafast" to "placebo" which can be used for testing purposes. "veryslow" and "placebo" are probably not very suitable as they often tend to be horrifically lookahead-bottlenecked. Jason ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 11:22 newidle balancing in NUMA domain? Nick Piggin 2009-11-23 11:36 ` Peter Zijlstra @ 2009-11-23 14:37 ` Mike Galbraith 2009-11-23 15:11 ` Nick Piggin 1 sibling, 1 reply; 30+ messages in thread From: Mike Galbraith @ 2009-11-23 14:37 UTC (permalink / raw) To: Nick Piggin; +Cc: Linux Kernel Mailing List, Ingo Molnar, Peter Zijlstra On Mon, 2009-11-23 at 12:22 +0100, Nick Piggin wrote: > Hi, > > I wonder why it was decided to do newidle balancing in the NUMA > domain? And with newidle_idx == 0 at that. > > This means that every time the CPU goes idle, every CPU in the > system gets a remote cacheline or two hit. Not very nice O(n^2) > behaviour on the interconnect. Not to mention trashing our > NUMA locality. Painful on little boxen too if left unchained. > And then I see some proposal to do ratelimiting of newidle > balancing :( Seems like hack upon hack making behaviour much more > complex. That's mine, and yeah, it is hackish. It just keeps newidle at bay for high speed switchers while keeping it available to kick start CPUs for fork/exec loads. Suggestions welcome. I have a threaded testcase (x264) where turning the think off costs ~40% throughput. Take that same testcase (or ilk) to a big NUMA beast, and performance will very likely suck just as bad as it does on my little Q6600 box. Other than that, I'd be most happy to see the thing crawl back in it's cave and _die_ despite the little gain it provides for a kbuild. It has been (is) very annoying. > One "symptom" of bad mutex contention can be that increasing the > balancing rate can help a bit to reduce idle time (because it > can get the woken thread which is holding a semaphore to run ASAP > after we run out of runnable tasks in the system due to them > hitting contention on that semaphore). Yes, when mysql+oltp starts jamming up, load balancing helps bust up the logjam somewhat, but that's not at all why newidle was activated.. > I really hope this change wasn't done in order to help -rt or > something sad like sysbench on MySQL. Newidle was activated to improve fork/exec CPU utilization. A nasty side effect is that it tries to rip other loads to tatters. > And btw, I'll stay out of mentioning anything about CFS development, > but it really sucks to be continually making significant changes to > domains balancing *and* per-runqueue scheduling at the same time :( > It makes it even difficult to bisect things. Yeah, balancing got jumbled up with desktop tweakage. Much fallout this round, and some things still to be fixed back up. -Mike ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 14:37 ` Mike Galbraith @ 2009-11-23 15:11 ` Nick Piggin 2009-11-23 15:21 ` Peter Zijlstra 0 siblings, 1 reply; 30+ messages in thread From: Nick Piggin @ 2009-11-23 15:11 UTC (permalink / raw) To: Mike Galbraith; +Cc: Linux Kernel Mailing List, Ingo Molnar, Peter Zijlstra On Mon, Nov 23, 2009 at 03:37:39PM +0100, Mike Galbraith wrote: > On Mon, 2009-11-23 at 12:22 +0100, Nick Piggin wrote: > > Hi, > > > > I wonder why it was decided to do newidle balancing in the NUMA > > domain? And with newidle_idx == 0 at that. > > > > This means that every time the CPU goes idle, every CPU in the > > system gets a remote cacheline or two hit. Not very nice O(n^2) > > behaviour on the interconnect. Not to mention trashing our > > NUMA locality. > > Painful on little boxen too if left unchained. Yep. It's an order of magnitude more expensive to go on the interconnect rather than stay in LLC. So even on little systems, new idle balancing can become an order of magnitude more expensive. On slightly larger systems, where you have an order of magnitude more cores on remote nodes than local, new idle balancing can now be two orders of magnitude more expensive. > > And then I see some proposal to do ratelimiting of newidle > > balancing :( Seems like hack upon hack making behaviour much more > > complex. > > That's mine, and yeah, it is hackish. It just keeps newidle at bay for > high speed switchers while keeping it available to kick start CPUs for > fork/exec loads. Suggestions welcome. I have a threaded testcase > (x264) where turning the think off costs ~40% throughput. Take that > same testcase (or ilk) to a big NUMA beast, and performance will very > likely suck just as bad as it does on my little Q6600 box. > > Other than that, I'd be most happy to see the thing crawl back in it's > cave and _die_ despite the little gain it provides for a kbuild. It has > been (is) very annoying. Wait, you say it was activated to improve fork/exec CPU utilization? For the x264 load? What do you mean by this? Do you mean it is doing a lot of fork/exec/exits and load is not being spread quickly enough? Or that NUMA allocations get screwed up because tasks don't get spread out quickly enough before running? In either case, I think newidle balancing is maybe not the right solution. newidle balancing only checks the system state when the destination CPU goes idle. fork events increase load at the source CPU. So for example if you find newidle helps to pick up forks, then if the newidle event happens to come in before the fork, we'll have to wait for the next rebalance event. So possibly making fork/exec balancing more aggressive might be a better approach. This can be done by reducing the damping idx, or perhaps some other conditions to reduce eg imbalance_pct or something for forkexec balancing. Probably needs some studying of the workload to work out why forkexec is failing. > > One "symptom" of bad mutex contention can be that increasing the > > balancing rate can help a bit to reduce idle time (because it > > can get the woken thread which is holding a semaphore to run ASAP > > after we run out of runnable tasks in the system due to them > > hitting contention on that semaphore). > > Yes, when mysql+oltp starts jamming up, load balancing helps bust up the > logjam somewhat, but that's not at all why newidle was activated.. OK good to know. > > I really hope this change wasn't done in order to help -rt or > > something sad like sysbench on MySQL. > > Newidle was activated to improve fork/exec CPU utilization. A nasty > side effect is that it tries to rip other loads to tatters. > > > And btw, I'll stay out of mentioning anything about CFS development, > > but it really sucks to be continually making significant changes to > > domains balancing *and* per-runqueue scheduling at the same time :( > > It makes it even difficult to bisect things. > > Yeah, balancing got jumbled up with desktop tweakage. Much fallout this > round, and some things still to be fixed back up. OK. This would be great if fixing up involves making things closer to what they were rather than adding more complex behaviour on top of other changes that broke stuff. And doing it in 2.6.32 would be kind of nice... ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 15:11 ` Nick Piggin @ 2009-11-23 15:21 ` Peter Zijlstra 2009-11-23 15:29 ` Nick Piggin 0 siblings, 1 reply; 30+ messages in thread From: Peter Zijlstra @ 2009-11-23 15:21 UTC (permalink / raw) To: Nick Piggin; +Cc: Mike Galbraith, Linux Kernel Mailing List, Ingo Molnar On Mon, 2009-11-23 at 16:11 +0100, Nick Piggin wrote: > Wait, you say it was activated to improve fork/exec CPU utilization? > For the x264 load? What do you mean by this? Do you mean it is doing > a lot of fork/exec/exits and load is not being spread quickly enough? > Or that NUMA allocations get screwed up because tasks don't get spread > out quickly enough before running? > > In either case, I think newidle balancing is maybe not the right solution. > newidle balancing only checks the system state when the destination > CPU goes idle. fork events increase load at the source CPU. So for > example if you find newidle helps to pick up forks, then if the > newidle event happens to come in before the fork, we'll have to wait > for the next rebalance event. > > So possibly making fork/exec balancing more aggressive might be a > better approach. This can be done by reducing the damping idx, or > perhaps some other conditions to reduce eg imbalance_pct or something > for forkexec balancing. Probably needs some studying of the workload > to work out why forkexec is failing. >From what I can remember from that workload it basically spawns tons of very short lived threads, waits for a bunch to complete, goto 1. Fork balancing only works until all cpus are active. But once a core goes idle it's left idle until we hit a general load-balance cycle. Newidle helps because it picks up these threads from other cpus, completing the current batch sooner, allowing the program to continue with the next. There's just not much you can do from the fork() side of things once you've got them all running. > OK. This would be great if fixing up involves making things closer > to what they were rather than adding more complex behaviour on top > of other changes that broke stuff. And doing it in 2.6.32 would be > kind of nice... .32 is kind of closed, with us being at -rc8. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 15:21 ` Peter Zijlstra @ 2009-11-23 15:29 ` Nick Piggin 2009-11-23 15:37 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Nick Piggin @ 2009-11-23 15:29 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Mike Galbraith, Linux Kernel Mailing List, Ingo Molnar On Mon, Nov 23, 2009 at 04:21:44PM +0100, Peter Zijlstra wrote: > On Mon, 2009-11-23 at 16:11 +0100, Nick Piggin wrote: > > > Wait, you say it was activated to improve fork/exec CPU utilization? > > For the x264 load? What do you mean by this? Do you mean it is doing > > a lot of fork/exec/exits and load is not being spread quickly enough? > > Or that NUMA allocations get screwed up because tasks don't get spread > > out quickly enough before running? > > > > In either case, I think newidle balancing is maybe not the right solution. > > newidle balancing only checks the system state when the destination > > CPU goes idle. fork events increase load at the source CPU. So for > > example if you find newidle helps to pick up forks, then if the > > newidle event happens to come in before the fork, we'll have to wait > > for the next rebalance event. > > > > So possibly making fork/exec balancing more aggressive might be a > > better approach. This can be done by reducing the damping idx, or > > perhaps some other conditions to reduce eg imbalance_pct or something > > for forkexec balancing. Probably needs some studying of the workload > > to work out why forkexec is failing. > > >From what I can remember from that workload it basically spawns tons of > very short lived threads, waits for a bunch to complete, goto 1. So basically about the least well performing or scalable possible software architecture. This is exactly the wrong thing to optimise for, guys. The fact that you have to coax the scheduler into touching heaps more remote cachelines and vastly increasing the amount of inter node task migration should have been kind of a hint. > Fork balancing only works until all cpus are active. But once a core > goes idle it's left idle until we hit a general load-balance cycle. > Newidle helps because it picks up these threads from other cpus, > completing the current batch sooner, allowing the program to continue > with the next. > > There's just not much you can do from the fork() side of things once > you've got them all running. It sounds like allowing fork balancing to be more aggressive could definitely help. > > OK. This would be great if fixing up involves making things closer > > to what they were rather than adding more complex behaviour on top > > of other changes that broke stuff. And doing it in 2.6.32 would be > > kind of nice... > > .32 is kind of closed, with us being at -rc8. It's a bad regression though. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 15:29 ` Nick Piggin @ 2009-11-23 15:37 ` Peter Zijlstra 2009-11-24 6:54 ` Nick Piggin 2009-11-23 15:53 ` Mike Galbraith 2009-11-23 17:04 ` Ingo Molnar 2 siblings, 1 reply; 30+ messages in thread From: Peter Zijlstra @ 2009-11-23 15:37 UTC (permalink / raw) To: Nick Piggin; +Cc: Mike Galbraith, Linux Kernel Mailing List, Ingo Molnar On Mon, 2009-11-23 at 16:29 +0100, Nick Piggin wrote: > > > .32 is kind of closed, with us being at -rc8. > > It's a bad regression though. Still haven't heard what workload, what machine, by how-much, nor any reproducer. And again, -rc8 is really late. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 15:37 ` Peter Zijlstra @ 2009-11-24 6:54 ` Nick Piggin 0 siblings, 0 replies; 30+ messages in thread From: Nick Piggin @ 2009-11-24 6:54 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Mike Galbraith, Linux Kernel Mailing List, Ingo Molnar On Mon, Nov 23, 2009 at 04:37:27PM +0100, Peter Zijlstra wrote: > On Mon, 2009-11-23 at 16:29 +0100, Nick Piggin wrote: > > > > > .32 is kind of closed, with us being at -rc8. > > > > It's a bad regression though. > > Still haven't heard what workload, what machine, by how-much, nor any > reproducer. http://patchwork.kernel.org/patch/58931/ > And again, -rc8 is really late. Exactly the right time to sort out regressions, however. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 15:29 ` Nick Piggin 2009-11-23 15:37 ` Peter Zijlstra @ 2009-11-23 15:53 ` Mike Galbraith 2009-11-24 6:53 ` Nick Piggin 2009-11-23 17:04 ` Ingo Molnar 2 siblings, 1 reply; 30+ messages in thread From: Mike Galbraith @ 2009-11-23 15:53 UTC (permalink / raw) To: Nick Piggin; +Cc: Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar On Mon, 2009-11-23 at 16:29 +0100, Nick Piggin wrote: > So basically about the least well performing or scalable possible > software architecture. This is exactly the wrong thing to optimise > for, guys. Hm. Isn't fork/exec our daily bread? > The fact that you have to coax the scheduler into touching heaps > more remote cachelines and vastly increasing the amount of inter > node task migration should have been kind of a hint. > > > > Fork balancing only works until all cpus are active. But once a core > > goes idle it's left idle until we hit a general load-balance cycle. > > Newidle helps because it picks up these threads from other cpus, > > completing the current batch sooner, allowing the program to continue > > with the next. > > > > There's just not much you can do from the fork() side of things once > > you've got them all running. > > It sounds like allowing fork balancing to be more aggressive could > definitely help. It doesn't. Task which is _already_ forked, placed and waiting over yonder can't do spit for getting this cpu active again without running so he can phone home. This isn't only observable with x264, it just rubs our noses in it. It is also quite observable in a kbuild. What if the waiter is your next fork? -Mike ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 15:53 ` Mike Galbraith @ 2009-11-24 6:53 ` Nick Piggin 2009-11-24 8:40 ` Mike Galbraith 0 siblings, 1 reply; 30+ messages in thread From: Nick Piggin @ 2009-11-24 6:53 UTC (permalink / raw) To: Mike Galbraith; +Cc: Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar On Mon, Nov 23, 2009 at 04:53:37PM +0100, Mike Galbraith wrote: > On Mon, 2009-11-23 at 16:29 +0100, Nick Piggin wrote: > > > So basically about the least well performing or scalable possible > > software architecture. This is exactly the wrong thing to optimise > > for, guys. > > Hm. Isn't fork/exec our daily bread? No. Not for handing out tiny chunks of work and attempting to do them in parallel. There is this thing called Amdahl's law, and if you write a parallel program that wantonly uses the heaviest possible primitives in its serial sections, then it doesn't deserve to go fast. That is what IPC or shared memory is for. Vastly faster, vastly more scalable, vastly easier for scheduler balancing (both via manual or automatic placement). > > The fact that you have to coax the scheduler into touching heaps > > more remote cachelines and vastly increasing the amount of inter > > node task migration should have been kind of a hint. > > > > > > > Fork balancing only works until all cpus are active. But once a core > > > goes idle it's left idle until we hit a general load-balance cycle. > > > Newidle helps because it picks up these threads from other cpus, > > > completing the current batch sooner, allowing the program to continue > > > with the next. > > > > > > There's just not much you can do from the fork() side of things once > > > you've got them all running. > > > > It sounds like allowing fork balancing to be more aggressive could > > definitely help. > > It doesn't. Task which is _already_ forked, placed and waiting over > yonder can't do spit for getting this cpu active again without running > so he can phone home. This isn't only observable with x264, it just > rubs our noses in it. It is also quite observable in a kbuild. What if > the waiter is your next fork? I'm not saying that vastly increasing task movement between NUMA nodes won't *help* some workloads. Indeed they tend to be ones that aren't very well parallelised (then it becomes critical to wake up any waiter if a CPU becomes free because it might be holding a heavily contended resource). But can you apprciate that these are at one side of the spectrum of workloads, and that others will much prefer to keep good affinity? No matter how "nice" your workload is, you can't keep traffic off the interconnect if the kernel screws up your numa placement. And also, I'm not saying that we were at _exactly_ the right place before and there was no room for improvement, but considering that we didn't have a lot of active _regressions_ in the balancer, we can really use that to our favour and concentrate changes in code that does have regressions. And be really conservative and careful with changes to the balancer. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-24 6:53 ` Nick Piggin @ 2009-11-24 8:40 ` Mike Galbraith 2009-11-24 8:58 ` Mike Galbraith 0 siblings, 1 reply; 30+ messages in thread From: Mike Galbraith @ 2009-11-24 8:40 UTC (permalink / raw) To: Nick Piggin; +Cc: Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar On Tue, 2009-11-24 at 07:53 +0100, Nick Piggin wrote: > On Mon, Nov 23, 2009 at 04:53:37PM +0100, Mike Galbraith wrote: > > On Mon, 2009-11-23 at 16:29 +0100, Nick Piggin wrote: > > > > > So basically about the least well performing or scalable possible > > > software architecture. This is exactly the wrong thing to optimise > > > for, guys. > > > > Hm. Isn't fork/exec our daily bread? > > No. Not for handing out tiny chunks of work and attempting to do > them in parallel. There is this thing called Amdahl's law, and if > you write a parallel program that wantonly uses the heaviest > possible primitives in its serial sections, then it doesn't deserve > to go fast. OK by me. A bit if idle time for kbuild is easily cured with telling make to emit more jobs, so there's enough little jobs to go around. If x264 is declared dainbramaged, that's fine with me too. > That is what IPC or shared memory is for. Vastly faster, vastly more > scalable, vastly easier for scheduler balancing (both via manual or > automatic placement). All well and good for apps that use them. Again, _I_ don't care either way. As stated, I wouldn't cry if newidle died a gruesome death, it has irritated me more than you would ever like to hear about ;-) > > > The fact that you have to coax the scheduler into touching heaps > > > more remote cachelines and vastly increasing the amount of inter > > > node task migration should have been kind of a hint. > > > > > > > > > > Fork balancing only works until all cpus are active. But once a core > > > > goes idle it's left idle until we hit a general load-balance cycle. > > > > Newidle helps because it picks up these threads from other cpus, > > > > completing the current batch sooner, allowing the program to continue > > > > with the next. > > > > > > > > There's just not much you can do from the fork() side of things once > > > > you've got them all running. > > > > > > It sounds like allowing fork balancing to be more aggressive could > > > definitely help. > > > > It doesn't. Task which is _already_ forked, placed and waiting over > > yonder can't do spit for getting this cpu active again without running > > so he can phone home. This isn't only observable with x264, it just > > rubs our noses in it. It is also quite observable in a kbuild. What if > > the waiter is your next fork? > > I'm not saying that vastly increasing task movement between NUMA > nodes won't *help* some workloads. Indeed they tend to be ones that > aren't very well parallelised (then it becomes critical to wake up > any waiter if a CPU becomes free because it might be holding a > heavily contended resource). It absolutely will. Your counter arguments are also fully valid. > But can you apprciate that these are at one side of the spectrum of > workloads, and that others will much prefer to keep good affinity? Yup. Anything with cache footprint. That's why I thumped newidle on it's pointy head. Trouble is, there's no way to make it perfect for both. As is, there's some pain for both. Maybe TOO much for big iron. > No matter how "nice" your workload is, you can't keep traffic off > the interconnect if the kernel screws up your numa placement. Don't matter if it's NUMA pain. Pain is pain. > And also, I'm not saying that we were at _exactly_ the right place > before and there was no room for improvement, but considering that > we didn't have a lot of active _regressions_ in the balancer, we > can really use that to our favour and concentrate changes in code > that does have regressions. And be really conservative and careful > with changes to the balancer. No argument against caution. -Mike ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-24 8:40 ` Mike Galbraith @ 2009-11-24 8:58 ` Mike Galbraith 2009-11-24 9:11 ` Ingo Molnar 0 siblings, 1 reply; 30+ messages in thread From: Mike Galbraith @ 2009-11-24 8:58 UTC (permalink / raw) To: Nick Piggin; +Cc: Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar On Tue, 2009-11-24 at 09:40 +0100, Mike Galbraith wrote: > On Tue, 2009-11-24 at 07:53 +0100, Nick Piggin wrote: > > On Mon, Nov 23, 2009 at 04:53:37PM +0100, Mike Galbraith wrote: > > > On Mon, 2009-11-23 at 16:29 +0100, Nick Piggin wrote: > > > > > > > So basically about the least well performing or scalable possible > > > > software architecture. This is exactly the wrong thing to optimise > > > > for, guys. > > > > > > Hm. Isn't fork/exec our daily bread? > > > > No. Not for handing out tiny chunks of work and attempting to do > > them in parallel. There is this thing called Amdahl's law, and if > > you write a parallel program that wantonly uses the heaviest > > possible primitives in its serial sections, then it doesn't deserve > > to go fast. > > OK by me. A bit if idle time for kbuild is easily cured with telling > make to emit more jobs, so there's enough little jobs to go around. > > If x264 is declared dainbramaged, that's fine with me too. (P.S. I don't want to have to explain to users of any such thread happy applications why they suck rocks under Linux though) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-24 8:58 ` Mike Galbraith @ 2009-11-24 9:11 ` Ingo Molnar 2009-11-30 8:27 ` Nick Piggin 0 siblings, 1 reply; 30+ messages in thread From: Ingo Molnar @ 2009-11-24 9:11 UTC (permalink / raw) To: Mike Galbraith; +Cc: Nick Piggin, Peter Zijlstra, Linux Kernel Mailing List * Mike Galbraith <efault@gmx.de> wrote: > On Tue, 2009-11-24 at 09:40 +0100, Mike Galbraith wrote: > > On Tue, 2009-11-24 at 07:53 +0100, Nick Piggin wrote: > > > On Mon, Nov 23, 2009 at 04:53:37PM +0100, Mike Galbraith wrote: > > > > On Mon, 2009-11-23 at 16:29 +0100, Nick Piggin wrote: > > > > > > > > > So basically about the least well performing or scalable possible > > > > > software architecture. This is exactly the wrong thing to optimise > > > > > for, guys. > > > > > > > > Hm. Isn't fork/exec our daily bread? > > > > > > No. Not for handing out tiny chunks of work and attempting to do > > > them in parallel. There is this thing called Amdahl's law, and if > > > you write a parallel program that wantonly uses the heaviest > > > possible primitives in its serial sections, then it doesn't deserve > > > to go fast. > > > > OK by me. A bit if idle time for kbuild is easily cured with telling > > make to emit more jobs, so there's enough little jobs to go around. > > > > If x264 is declared dainbramaged, that's fine with me too. > > (P.S. I don't want to have to explain to users of any such thread > happy applications why they suck rocks under Linux though) The 68% of speedup is pretty valid for your change and the workload isnt particularly odd beyond the fast creation rate of threads - but i'd not blame the app for that, Linux creates/destroys threads very fast. Your followup load-balancing fix got queued up in the scheduler tree for v2.6.33 ~two weeks ago: 1b9508f: sched: Rate-limit newidle certainly handles some of the NUMA pain. If not, numbers telling us the other story (and patches to extend 'perf bench sched' with matching testcases) would be helpful, i'm not seeing problems on my NUMA testbox. 1b9508f was certainly too late for 2.6.32, but since it appears to be fine it can be forwarded to stable@kernel.org so it makes it into 2.6.32.1. Thanks, Ingo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-24 9:11 ` Ingo Molnar @ 2009-11-30 8:27 ` Nick Piggin 0 siblings, 0 replies; 30+ messages in thread From: Nick Piggin @ 2009-11-30 8:27 UTC (permalink / raw) To: Ingo Molnar; +Cc: Mike Galbraith, Peter Zijlstra, Linux Kernel Mailing List On Tue, Nov 24, 2009 at 10:11:24AM +0100, Ingo Molnar wrote: > > * Mike Galbraith <efault@gmx.de> wrote: > > > On Tue, 2009-11-24 at 09:40 +0100, Mike Galbraith wrote: > > > On Tue, 2009-11-24 at 07:53 +0100, Nick Piggin wrote: > > > > On Mon, Nov 23, 2009 at 04:53:37PM +0100, Mike Galbraith wrote: > > > > > On Mon, 2009-11-23 at 16:29 +0100, Nick Piggin wrote: > > > > > > > > > > > So basically about the least well performing or scalable possible > > > > > > software architecture. This is exactly the wrong thing to optimise > > > > > > for, guys. > > > > > > > > > > Hm. Isn't fork/exec our daily bread? > > > > > > > > No. Not for handing out tiny chunks of work and attempting to do > > > > them in parallel. There is this thing called Amdahl's law, and if > > > > you write a parallel program that wantonly uses the heaviest > > > > possible primitives in its serial sections, then it doesn't deserve > > > > to go fast. > > > > > > OK by me. A bit if idle time for kbuild is easily cured with telling > > > make to emit more jobs, so there's enough little jobs to go around. > > > > > > If x264 is declared dainbramaged, that's fine with me too. > > > > (P.S. I don't want to have to explain to users of any such thread > > happy applications why they suck rocks under Linux though) > > The 68% of speedup is pretty valid for your change and the workload isnt > particularly odd beyond the fast creation rate of threads - but i'd not > blame the app for that, Linux creates/destroys threads very fast. Your > followup load-balancing fix got queued up in the scheduler tree for > v2.6.33 ~two weeks ago: > > 1b9508f: sched: Rate-limit newidle > > certainly handles some of the NUMA pain. If not, numbers telling us the > other story (and patches to extend 'perf bench sched' with matching > testcases) would be helpful, i'm not seeing problems on my NUMA testbox. Yay for more complexity. And it doesn't fix all problems introduced because it doesn't improve the now inherently far weaker NUMA affinity. > 1b9508f was certainly too late for 2.6.32, but since it appears to be > fine it can be forwarded to stable@kernel.org so it makes it into > 2.6.32.1. I totally disagree with this development / release process of yours. In my opinion it is *not* ok to leave a wake of known regressions with features that help some (non-regression) case. If it was agreed that rate limiting newidle is the right fix for the regression, and it was agreed that it is too late for the next release, then IMO the correct way to approach the release is to revert the offending patch until the new, combined, approach is tested and validated. Doing it your way, you release a 2.6.32 kernel with a known regression and also a case that gets a big speedup. This makes it impossible now to revert that patch without causing a regression for someone. And then you think you have a way to fix it, but not enough testing so it could be the case that it causes yet other regressions or does not fix it well enough. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 15:29 ` Nick Piggin 2009-11-23 15:37 ` Peter Zijlstra 2009-11-23 15:53 ` Mike Galbraith @ 2009-11-23 17:04 ` Ingo Molnar 2009-11-24 6:59 ` Nick Piggin 2 siblings, 1 reply; 30+ messages in thread From: Ingo Molnar @ 2009-11-23 17:04 UTC (permalink / raw) To: Nick Piggin; +Cc: Peter Zijlstra, Mike Galbraith, Linux Kernel Mailing List * Nick Piggin <npiggin@suse.de> wrote: > > .32 is kind of closed, with us being at -rc8. > > It's a bad regression though. It's about 3 months too late for that. Ideally we want performance regressions to be looked for and reported when the patches go into the devel tree. Failing that, -rc1 would be the good time to re-test whatever workload you care about. If you cannot test it in a regular fashion you can offload the testing to us, by adding a similar/equivalent workload to 'perf bench sched'. We'll make sure it stays sane. Thanks, Ingo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-23 17:04 ` Ingo Molnar @ 2009-11-24 6:59 ` Nick Piggin 2009-11-24 9:16 ` Ingo Molnar 0 siblings, 1 reply; 30+ messages in thread From: Nick Piggin @ 2009-11-24 6:59 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Mike Galbraith, Linux Kernel Mailing List, Andrew Morton On Mon, Nov 23, 2009 at 06:04:45PM +0100, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > > .32 is kind of closed, with us being at -rc8. > > > > It's a bad regression though. > > It's about 3 months too late for that. Ideally we want performance Too late for what? Reporting and reverting a regression? I don't think so. It is not my problem if patches aren't tested well enough before being merged. If we release a kernel with this known problematic scheduler behaviour then it gives userspace application writers far harder targets, and also it will give *some* 2.6.32 users regressions if we find it has to be fixed in 2.6.33. > regressions to be looked for and reported when the patches go into the > devel tree. Failing that, -rc1 would be the good time to re-test > whatever workload you care about. > > If you cannot test it in a regular fashion you can offload the testing > to us, by adding a similar/equivalent workload to 'perf bench sched'. > We'll make sure it stays sane. What exactly tests *were* done on it? Anything that might possibly trigger its obvious possibility for detremental behaviour? Something like netperf? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: newidle balancing in NUMA domain? 2009-11-24 6:59 ` Nick Piggin @ 2009-11-24 9:16 ` Ingo Molnar 0 siblings, 0 replies; 30+ messages in thread From: Ingo Molnar @ 2009-11-24 9:16 UTC (permalink / raw) To: Nick Piggin Cc: Peter Zijlstra, Mike Galbraith, Linux Kernel Mailing List, Andrew Morton * Nick Piggin <npiggin@suse.de> wrote: > On Mon, Nov 23, 2009 at 06:04:45PM +0100, Ingo Molnar wrote: > > > > * Nick Piggin <npiggin@suse.de> wrote: > > > > > > .32 is kind of closed, with us being at -rc8. > > > > > > It's a bad regression though. > > > > It's about 3 months too late for that. Ideally we want performance > > Too late for what? Reporting and reverting a regression? Yes, the revert would be too intrusive so late in the release cycle, sorry. For such types of scheduler reverts we rarely go beyond -rc5 - the risks of breaking something are too big. Note that we have queued up a fix for that about two weeks ago for v2.6.33 - that fix could be forwarded to stable@kernel.org for 2.6.32.1 merging. > [...] I don't think so. It is not my problem if patches aren't tested > well enough before being merged. There's certainly a wide performance testing being done for scheduler relevant workloads all the time. If you dont like the end result you are welcome to help out testing and fixing things (and certainly both). I'd also welcome help from you to extend 'perf bench sched' with new testcases. (it has only two so far - the tool just got started so there's lots of low hanging fruits.) 'perf bench sched numa-cost' would certainly be usefulful to have. Thanks, Ingo ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2009-12-01 8:18 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-11-23 11:22 newidle balancing in NUMA domain? Nick Piggin 2009-11-23 11:36 ` Peter Zijlstra 2009-11-23 11:43 ` Nick Piggin 2009-11-23 11:50 ` Peter Zijlstra 2009-11-23 12:16 ` Nick Piggin 2009-11-23 11:45 ` Ingo Molnar 2009-11-23 12:01 ` Nick Piggin 2009-11-23 12:08 ` Ingo Molnar 2009-11-23 12:27 ` Nick Piggin 2009-11-23 12:46 ` Ingo Molnar 2009-11-24 6:36 ` Nick Piggin 2009-11-24 17:24 ` Jason Garrett-Glaser 2009-11-24 18:09 ` Mike Galbraith 2009-11-30 8:19 ` Nick Piggin 2009-12-01 8:18 ` Jason Garrett-Glaser 2009-11-23 14:37 ` Mike Galbraith 2009-11-23 15:11 ` Nick Piggin 2009-11-23 15:21 ` Peter Zijlstra 2009-11-23 15:29 ` Nick Piggin 2009-11-23 15:37 ` Peter Zijlstra 2009-11-24 6:54 ` Nick Piggin 2009-11-23 15:53 ` Mike Galbraith 2009-11-24 6:53 ` Nick Piggin 2009-11-24 8:40 ` Mike Galbraith 2009-11-24 8:58 ` Mike Galbraith 2009-11-24 9:11 ` Ingo Molnar 2009-11-30 8:27 ` Nick Piggin 2009-11-23 17:04 ` Ingo Molnar 2009-11-24 6:59 ` Nick Piggin 2009-11-24 9:16 ` Ingo Molnar
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.