* [PATCH v3] sched,fair: skip newidle_balance if a wakeup is pending @ 2021-04-20 16:07 Rik van Riel 2021-04-21 17:27 ` Vincent Guittot 2021-04-22 7:36 ` [tip: sched/core] " tip-bot2 for Rik van Riel 0 siblings, 2 replies; 6+ messages in thread From: Rik van Riel @ 2021-04-20 16:07 UTC (permalink / raw) To: linux-kernel Cc: Kernel Team, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Vincent Guittot, Mel Gorman, Valentin Schneider The try_to_wake_up function has an optimization where it can queue a task for wakeup on its previous CPU, if the task is still in the middle of going to sleep inside schedule(). Once schedule() re-enables IRQs, the task will be woken up with an IPI, and placed back on the runqueue. If we have such a wakeup pending, there is no need to search other CPUs for runnable tasks. Just skip (or bail out early from) newidle balancing, and run the just woken up task. For a memcache like workload test, this reduces total CPU use by about 2%, proportionally split between user and system time, and p99 and p95 application response time by 10% on average. The schedstats run_delay number shows a similar improvement. Signed-off-by: Rik van Riel <riel@surriel.com> --- kernel/sched/fair.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 69680158963f..fd80175c3b3e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10594,6 +10594,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) u64 curr_cost = 0; update_misfit_status(NULL, this_rq); + + /* + * There is a task waiting to run. No need to search for one. + * Return 0; the task will be enqueued when switching to idle. + */ + if (this_rq->ttwu_pending) + return 0; + /* * We must set idle_stamp _before_ calling idle_balance(), such that we * measure the duration of idle_balance() as idle time. @@ -10661,7 +10669,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) * Stop searching for tasks to pull if there are * now runnable tasks on this rq. */ - if (pulled_task || this_rq->nr_running > 0) + if (pulled_task || this_rq->nr_running > 0 || + this_rq->ttwu_pending) break; } rcu_read_unlock(); @@ -10688,7 +10697,12 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) if (this_rq->nr_running != this_rq->cfs.h_nr_running) pulled_task = -1; - if (pulled_task) + /* + * If we are no longer idle, do not let the time spent here pull + * down this_rq->avg_idle. That could lead to newidle_balance not + * doing enough work, and the CPU actually going idle. + */ + if (pulled_task || this_rq->ttwu_pending) this_rq->idle_stamp = 0; rq_repin_lock(this_rq, rf); -- 2.25.4 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v3] sched,fair: skip newidle_balance if a wakeup is pending 2021-04-20 16:07 [PATCH v3] sched,fair: skip newidle_balance if a wakeup is pending Rik van Riel @ 2021-04-21 17:27 ` Vincent Guittot 2021-04-22 8:37 ` Vincent Guittot 2021-04-22 7:36 ` [tip: sched/core] " tip-bot2 for Rik van Riel 1 sibling, 1 reply; 6+ messages in thread From: Vincent Guittot @ 2021-04-21 17:27 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, Kernel Team, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Mel Gorman, Valentin Schneider Hi Rik, On Tue, 20 Apr 2021 at 18:07, Rik van Riel <riel@surriel.com> wrote: > > The try_to_wake_up function has an optimization where it can queue > a task for wakeup on its previous CPU, if the task is still in the > middle of going to sleep inside schedule(). > > Once schedule() re-enables IRQs, the task will be woken up with an > IPI, and placed back on the runqueue. > > If we have such a wakeup pending, there is no need to search other > CPUs for runnable tasks. Just skip (or bail out early from) newidle > balancing, and run the just woken up task. > > For a memcache like workload test, this reduces total CPU use by > about 2%, proportionally split between user and system time, > and p99 and p95 application response time by 10% on average. > The schedstats run_delay number shows a similar improvement. > > Signed-off-by: Rik van Riel <riel@surriel.com> > --- > kernel/sched/fair.c | 18 ++++++++++++++++-- > 1 file changed, 16 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 69680158963f..fd80175c3b3e 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10594,6 +10594,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > u64 curr_cost = 0; > > update_misfit_status(NULL, this_rq); > + > + /* > + * There is a task waiting to run. No need to search for one. > + * Return 0; the task will be enqueued when switching to idle. > + */ > + if (this_rq->ttwu_pending) > + return 0; > + > /* > * We must set idle_stamp _before_ calling idle_balance(), such that we > * measure the duration of idle_balance() as idle time. > @@ -10661,7 +10669,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > * Stop searching for tasks to pull if there are > * now runnable tasks on this rq. > */ > - if (pulled_task || this_rq->nr_running > 0) > + if (pulled_task || this_rq->nr_running > 0 || > + this_rq->ttwu_pending) > break; > } > rcu_read_unlock(); > @@ -10688,7 +10697,12 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > if (this_rq->nr_running != this_rq->cfs.h_nr_running) > pulled_task = -1; > > - if (pulled_task) > + /* > + * If we are no longer idle, do not let the time spent here pull > + * down this_rq->avg_idle. That could lead to newidle_balance not > + * doing enough work, and the CPU actually going idle. > + */ > + if (pulled_task || this_rq->ttwu_pending) I'm still running some benchmarks to evaluate the impact of your patch and more especially the line above which clears this_rq->idle_stamp and skips the time spent in newidle_balance from being accounted for in avg_idle. I have some results which show some regression because of this test especially with hackbench. On large system, the time spent in newidle_balance can be significant and we can't ignore it just because this_rq->ttwu_pending is set while looping the domains because without newidle_balance the idle time would have been large and we end up screwing up the metric > this_rq->idle_stamp = 0; > > rq_repin_lock(this_rq, rf); > -- > 2.25.4 > > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3] sched,fair: skip newidle_balance if a wakeup is pending 2021-04-21 17:27 ` Vincent Guittot @ 2021-04-22 8:37 ` Vincent Guittot 2021-04-22 16:47 ` Rik van Riel 0 siblings, 1 reply; 6+ messages in thread From: Vincent Guittot @ 2021-04-22 8:37 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, Kernel Team, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Mel Gorman, Valentin Schneider On Wed, 21 Apr 2021 at 19:27, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > Hi Rik, > > On Tue, 20 Apr 2021 at 18:07, Rik van Riel <riel@surriel.com> wrote: > > > > The try_to_wake_up function has an optimization where it can queue > > a task for wakeup on its previous CPU, if the task is still in the > > middle of going to sleep inside schedule(). > > > > Once schedule() re-enables IRQs, the task will be woken up with an > > IPI, and placed back on the runqueue. > > > > If we have such a wakeup pending, there is no need to search other > > CPUs for runnable tasks. Just skip (or bail out early from) newidle > > balancing, and run the just woken up task. > > > > For a memcache like workload test, this reduces total CPU use by > > about 2%, proportionally split between user and system time, > > and p99 and p95 application response time by 10% on average. > > The schedstats run_delay number shows a similar improvement. > > > > Signed-off-by: Rik van Riel <riel@surriel.com> > > --- > > kernel/sched/fair.c | 18 ++++++++++++++++-- > > 1 file changed, 16 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 69680158963f..fd80175c3b3e 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -10594,6 +10594,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > > u64 curr_cost = 0; > > > > update_misfit_status(NULL, this_rq); > > + > > + /* > > + * There is a task waiting to run. No need to search for one. > > + * Return 0; the task will be enqueued when switching to idle. > > + */ > > + if (this_rq->ttwu_pending) > > + return 0; > > + > > /* > > * We must set idle_stamp _before_ calling idle_balance(), such that we > > * measure the duration of idle_balance() as idle time. > > @@ -10661,7 +10669,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > > * Stop searching for tasks to pull if there are > > * now runnable tasks on this rq. > > */ > > - if (pulled_task || this_rq->nr_running > 0) > > + if (pulled_task || this_rq->nr_running > 0 || > > + this_rq->ttwu_pending) > > break; > > } > > rcu_read_unlock(); > > @@ -10688,7 +10697,12 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > > if (this_rq->nr_running != this_rq->cfs.h_nr_running) > > pulled_task = -1; > > > > - if (pulled_task) > > + /* > > + * If we are no longer idle, do not let the time spent here pull > > + * down this_rq->avg_idle. That could lead to newidle_balance not > > + * doing enough work, and the CPU actually going idle. > > + */ > > + if (pulled_task || this_rq->ttwu_pending) > > I'm still running some benchmarks to evaluate the impact of your patch > and more especially the line above which clears this_rq->idle_stamp > and skips the time spent in newidle_balance from being accounted for > in avg_idle. I have some results which show some regression because > of this test especially with hackbench. > On large system, the time spent in newidle_balance can be significant > and we can't ignore it just because this_rq->ttwu_pending is set while > looping the domains because without newidle_balance the idle time > would have been large and we end up screwing up the metric I confirmed that the line above generate hackbench regression on my large arm64 system (2 * 112 CPUs) I'm testing hackbench with various number of group : 1, 2, 4, 16, 32, 64, 128, 256 but I have only put the 2 results which significantly regress. The other ones are in the +/-1% variation range hackbench -g $group group v5.12-rc8+tip w/ this patch w/ this patch without the line above 64 2.862(+/- 9%) 2.952(+/-11%) -3% 2.807(+/- 7%) +2% 128 3.334(+/-10%) 3.561-+/-13%) -7% 3.181(+/- 6%) +4% > > > this_rq->idle_stamp = 0; > > > > rq_repin_lock(this_rq, rf); > > -- > > 2.25.4 > > > > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3] sched,fair: skip newidle_balance if a wakeup is pending 2021-04-22 8:37 ` Vincent Guittot @ 2021-04-22 16:47 ` Rik van Riel 0 siblings, 0 replies; 6+ messages in thread From: Rik van Riel @ 2021-04-22 16:47 UTC (permalink / raw) To: Vincent Guittot Cc: linux-kernel, Kernel Team, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Mel Gorman, Valentin Schneider [-- Attachment #1: Type: text/plain, Size: 1286 bytes --] On Thu, 2021-04-22 at 10:37 +0200, Vincent Guittot wrote: > On Wed, 21 Apr 2021 at 19:27, Vincent Guittot > <vincent.guittot@linaro.org> wrote: > > > > > - if (pulled_task) > > > + /* > > > + * If we are no longer idle, do not let the time spent > > > here pull > > > + * down this_rq->avg_idle. That could lead to > > > newidle_balance not > > > + * doing enough work, and the CPU actually going idle. > > > + */ > > > + if (pulled_task || this_rq->ttwu_pending) > > > I confirmed that the line above generate hackbench regression on my > large arm64 system (2 * 112 CPUs) > I'm testing hackbench with various number of group : 1, 2, 4, 16, 32, > 64, 128, 256 but I have only put the 2 results which significantly > regress. The other ones are in the +/-1% variation range > > hackbench -g $group > > group v5.12-rc8+tip w/ this patch w/ this patch > without > the line above > 64 2.862(+/- 9%) 2.952(+/-11%) -3% 2.807(+/- 7%) +2% > 128 3.334(+/-10%) 3.561-+/-13%) -7% 3.181(+/- 6%) +4% OK, I guess this part of the patch needs additional work. I'll send a v4 with just the first two changes. Thank you for running those tests. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* [tip: sched/core] sched,fair: skip newidle_balance if a wakeup is pending 2021-04-20 16:07 [PATCH v3] sched,fair: skip newidle_balance if a wakeup is pending Rik van Riel 2021-04-21 17:27 ` Vincent Guittot @ 2021-04-22 7:36 ` tip-bot2 for Rik van Riel 2021-04-22 8:15 ` Peter Zijlstra 1 sibling, 1 reply; 6+ messages in thread From: tip-bot2 for Rik van Riel @ 2021-04-22 7:36 UTC (permalink / raw) To: linux-tip-commits; +Cc: Rik van Riel, Peter Zijlstra (Intel), x86, linux-kernel The following commit has been merged into the sched/core branch of tip: Commit-ID: 9c9f520a14670ad59da2f700660f7601ec9e0b07 Gitweb: https://git.kernel.org/tip/9c9f520a14670ad59da2f700660f7601ec9e0b07 Author: Rik van Riel <riel@surriel.com> AuthorDate: Tue, 20 Apr 2021 12:07:05 -04:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Wed, 21 Apr 2021 13:55:43 +02:00 sched,fair: skip newidle_balance if a wakeup is pending The try_to_wake_up function has an optimization where it can queue a task for wakeup on its previous CPU, if the task is still in the middle of going to sleep inside schedule(). Once schedule() re-enables IRQs, the task will be woken up with an IPI, and placed back on the runqueue. If we have such a wakeup pending, there is no need to search other CPUs for runnable tasks. Just skip (or bail out early from) newidle balancing, and run the just woken up task. For a memcache like workload test, this reduces total CPU use by about 2%, proportionally split between user and system time, and p99 and p95 application response time by 10% on average. The schedstats run_delay number shows a similar improvement. Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210420120705.5c705d4b@imladris.surriel.com --- kernel/sched/fair.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1d75af1..83cd2bd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10592,6 +10592,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) u64 curr_cost = 0; update_misfit_status(NULL, this_rq); + + /* + * There is a task waiting to run. No need to search for one. + * Return 0; the task will be enqueued when switching to idle. + */ + if (this_rq->ttwu_pending) + return 0; + /* * We must set idle_stamp _before_ calling idle_balance(), such that we * measure the duration of idle_balance() as idle time. @@ -10657,7 +10665,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) * Stop searching for tasks to pull if there are * now runnable tasks on this rq. */ - if (pulled_task || this_rq->nr_running > 0) + if (pulled_task || this_rq->nr_running > 0 || + this_rq->ttwu_pending) break; } rcu_read_unlock(); @@ -10684,7 +10693,12 @@ out: if (time_after(this_rq->next_balance, next_balance)) this_rq->next_balance = next_balance; - if (pulled_task) + /* + * If we are no longer idle, do not let the time spent here pull + * down this_rq->avg_idle. That could lead to newidle_balance not + * doing enough work, and the CPU actually going idle. + */ + if (pulled_task || this_rq->ttwu_pending) this_rq->idle_stamp = 0; else nohz_newidle_balance(this_rq); ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [tip: sched/core] sched,fair: skip newidle_balance if a wakeup is pending 2021-04-22 7:36 ` [tip: sched/core] " tip-bot2 for Rik van Riel @ 2021-04-22 8:15 ` Peter Zijlstra 0 siblings, 0 replies; 6+ messages in thread From: Peter Zijlstra @ 2021-04-22 8:15 UTC (permalink / raw) To: linux-kernel; +Cc: linux-tip-commits, Rik van Riel, x86, Vincent Guittot On Thu, Apr 22, 2021 at 07:36:00AM -0000, tip-bot2 for Rik van Riel wrote: > @@ -10684,7 +10693,12 @@ out: > if (time_after(this_rq->next_balance, next_balance)) > this_rq->next_balance = next_balance; > > - if (pulled_task) > + /* > + * If we are no longer idle, do not let the time spent here pull > + * down this_rq->avg_idle. That could lead to newidle_balance not > + * doing enough work, and the CPU actually going idle. > + */ > + if (pulled_task || this_rq->ttwu_pending) > this_rq->idle_stamp = 0; I've un-committed this patch, because vingu was reporting increased idle time because of this hunk. I had mistakenly assumed that was sorted with v3, sorry for not keeping better track of things. (also, now that I look again, please also fix the Subject to have a capital after the :) ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-04-22 16:47 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-04-20 16:07 [PATCH v3] sched,fair: skip newidle_balance if a wakeup is pending Rik van Riel 2021-04-21 17:27 ` Vincent Guittot 2021-04-22 8:37 ` Vincent Guittot 2021-04-22 16:47 ` Rik van Riel 2021-04-22 7:36 ` [tip: sched/core] " tip-bot2 for Rik van Riel 2021-04-22 8:15 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).