All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] sched,fair: skip newidle_balance if a wakeup is pending
@ 2021-04-20 16:07 Rik van Riel
  2021-04-21 17:27 ` Vincent Guittot
  2021-04-22  7:36 ` [tip: sched/core] " tip-bot2 for Rik van Riel
  0 siblings, 2 replies; 7+ messages in thread
From: Rik van Riel @ 2021-04-20 16:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kernel Team, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Vincent Guittot, Mel Gorman, Valentin Schneider

The try_to_wake_up function has an optimization where it can queue
a task for wakeup on its previous CPU, if the task is still in the
middle of going to sleep inside schedule().

Once schedule() re-enables IRQs, the task will be woken up with an
IPI, and placed back on the runqueue.

If we have such a wakeup pending, there is no need to search other
CPUs for runnable tasks. Just skip (or bail out early from) newidle
balancing, and run the just woken up task.

For a memcache like workload test, this reduces total CPU use by
about 2%, proportionally split between user and system time,
and p99 and p95 application response time by 10% on average.
The schedstats run_delay number shows a similar improvement.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69680158963f..fd80175c3b3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10594,6 +10594,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	u64 curr_cost = 0;
 
 	update_misfit_status(NULL, this_rq);
+
+	/*
+	 * There is a task waiting to run. No need to search for one.
+	 * Return 0; the task will be enqueued when switching to idle.
+	 */
+	if (this_rq->ttwu_pending)
+		return 0;
+
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 	 * measure the duration of idle_balance() as idle time.
@@ -10661,7 +10669,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		 * Stop searching for tasks to pull if there are
 		 * now runnable tasks on this rq.
 		 */
-		if (pulled_task || this_rq->nr_running > 0)
+		if (pulled_task || this_rq->nr_running > 0 ||
+		    this_rq->ttwu_pending)
 			break;
 	}
 	rcu_read_unlock();
@@ -10688,7 +10697,12 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (this_rq->nr_running != this_rq->cfs.h_nr_running)
 		pulled_task = -1;
 
-	if (pulled_task)
+	/*
+	 * If we are no longer idle, do not let the time spent here pull
+	 * down this_rq->avg_idle. That could lead to newidle_balance not
+	 * doing enough work, and the CPU actually going idle.
+	 */
+	if (pulled_task || this_rq->ttwu_pending)
 		this_rq->idle_stamp = 0;
 
 	rq_repin_lock(this_rq, rf);
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] sched,fair: skip newidle_balance if a wakeup is pending
  2021-04-20 16:07 [PATCH v3] sched,fair: skip newidle_balance if a wakeup is pending Rik van Riel
@ 2021-04-21 17:27 ` Vincent Guittot
  2021-04-22  8:37   ` Vincent Guittot
  2021-04-22  7:36 ` [tip: sched/core] " tip-bot2 for Rik van Riel
  1 sibling, 1 reply; 7+ messages in thread
From: Vincent Guittot @ 2021-04-21 17:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Kernel Team, Peter Zijlstra, Ingo Molnar,
	Dietmar Eggemann, Mel Gorman, Valentin Schneider

Hi Rik,

On Tue, 20 Apr 2021 at 18:07, Rik van Riel <riel@surriel.com> wrote:
>
> The try_to_wake_up function has an optimization where it can queue
> a task for wakeup on its previous CPU, if the task is still in the
> middle of going to sleep inside schedule().
>
> Once schedule() re-enables IRQs, the task will be woken up with an
> IPI, and placed back on the runqueue.
>
> If we have such a wakeup pending, there is no need to search other
> CPUs for runnable tasks. Just skip (or bail out early from) newidle
> balancing, and run the just woken up task.
>
> For a memcache like workload test, this reduces total CPU use by
> about 2%, proportionally split between user and system time,
> and p99 and p95 application response time by 10% on average.
> The schedstats run_delay number shows a similar improvement.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  kernel/sched/fair.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 69680158963f..fd80175c3b3e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10594,6 +10594,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>         u64 curr_cost = 0;
>
>         update_misfit_status(NULL, this_rq);
> +
> +       /*
> +        * There is a task waiting to run. No need to search for one.
> +        * Return 0; the task will be enqueued when switching to idle.
> +        */
> +       if (this_rq->ttwu_pending)
> +               return 0;
> +
>         /*
>          * We must set idle_stamp _before_ calling idle_balance(), such that we
>          * measure the duration of idle_balance() as idle time.
> @@ -10661,7 +10669,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>                  * Stop searching for tasks to pull if there are
>                  * now runnable tasks on this rq.
>                  */
> -               if (pulled_task || this_rq->nr_running > 0)
> +               if (pulled_task || this_rq->nr_running > 0 ||
> +                   this_rq->ttwu_pending)
>                         break;
>         }
>         rcu_read_unlock();
> @@ -10688,7 +10697,12 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>         if (this_rq->nr_running != this_rq->cfs.h_nr_running)
>                 pulled_task = -1;
>
> -       if (pulled_task)
> +       /*
> +        * If we are no longer idle, do not let the time spent here pull
> +        * down this_rq->avg_idle. That could lead to newidle_balance not
> +        * doing enough work, and the CPU actually going idle.
> +        */
> +       if (pulled_task || this_rq->ttwu_pending)

I'm still running some benchmarks to evaluate the impact of your patch
and more especially the line above which clears this_rq->idle_stamp
and skips the time spent in newidle_balance from being accounted for
in avg_idle. I have some results which  show some regression because
of this test especially with hackbench.
On large system, the time spent in newidle_balance can be significant
and we can't ignore it just because this_rq->ttwu_pending is set while
looping the domains because without newidle_balance the idle time
would have been large and we end up screwing up the metric

>                 this_rq->idle_stamp = 0;
>
>         rq_repin_lock(this_rq, rf);
> --
> 2.25.4
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [tip: sched/core] sched,fair: skip newidle_balance if a wakeup is pending
  2021-04-20 16:07 [PATCH v3] sched,fair: skip newidle_balance if a wakeup is pending Rik van Riel
  2021-04-21 17:27 ` Vincent Guittot
@ 2021-04-22  7:36 ` tip-bot2 for Rik van Riel
  2021-04-22  8:15   ` Peter Zijlstra
  1 sibling, 1 reply; 7+ messages in thread
From: tip-bot2 for Rik van Riel @ 2021-04-22  7:36 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Rik van Riel, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     9c9f520a14670ad59da2f700660f7601ec9e0b07
Gitweb:        https://git.kernel.org/tip/9c9f520a14670ad59da2f700660f7601ec9e0b07
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 20 Apr 2021 12:07:05 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 21 Apr 2021 13:55:43 +02:00

sched,fair: skip newidle_balance if a wakeup is pending

The try_to_wake_up function has an optimization where it can queue
a task for wakeup on its previous CPU, if the task is still in the
middle of going to sleep inside schedule().

Once schedule() re-enables IRQs, the task will be woken up with an
IPI, and placed back on the runqueue.

If we have such a wakeup pending, there is no need to search other
CPUs for runnable tasks. Just skip (or bail out early from) newidle
balancing, and run the just woken up task.

For a memcache like workload test, this reduces total CPU use by
about 2%, proportionally split between user and system time,
and p99 and p95 application response time by 10% on average.
The schedstats run_delay number shows a similar improvement.

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210420120705.5c705d4b@imladris.surriel.com
---
 kernel/sched/fair.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d75af1..83cd2bd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10592,6 +10592,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	u64 curr_cost = 0;
 
 	update_misfit_status(NULL, this_rq);
+
+	/*
+	 * There is a task waiting to run. No need to search for one.
+	 * Return 0; the task will be enqueued when switching to idle.
+	 */
+	if (this_rq->ttwu_pending)
+		return 0;
+
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 	 * measure the duration of idle_balance() as idle time.
@@ -10657,7 +10665,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		 * Stop searching for tasks to pull if there are
 		 * now runnable tasks on this rq.
 		 */
-		if (pulled_task || this_rq->nr_running > 0)
+		if (pulled_task || this_rq->nr_running > 0 ||
+		    this_rq->ttwu_pending)
 			break;
 	}
 	rcu_read_unlock();
@@ -10684,7 +10693,12 @@ out:
 	if (time_after(this_rq->next_balance, next_balance))
 		this_rq->next_balance = next_balance;
 
-	if (pulled_task)
+	/*
+	 * If we are no longer idle, do not let the time spent here pull
+	 * down this_rq->avg_idle. That could lead to newidle_balance not
+	 * doing enough work, and the CPU actually going idle.
+	 */
+	if (pulled_task || this_rq->ttwu_pending)
 		this_rq->idle_stamp = 0;
 	else
 		nohz_newidle_balance(this_rq);

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [tip: sched/core] sched,fair: skip newidle_balance if a wakeup is pending
  2021-04-22  7:36 ` [tip: sched/core] " tip-bot2 for Rik van Riel
@ 2021-04-22  8:15   ` Peter Zijlstra
  0 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2021-04-22  8:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-tip-commits, Rik van Riel, x86, Vincent Guittot

On Thu, Apr 22, 2021 at 07:36:00AM -0000, tip-bot2 for Rik van Riel wrote:
> @@ -10684,7 +10693,12 @@ out:
>  	if (time_after(this_rq->next_balance, next_balance))
>  		this_rq->next_balance = next_balance;
>  
> -	if (pulled_task)
> +	/*
> +	 * If we are no longer idle, do not let the time spent here pull
> +	 * down this_rq->avg_idle. That could lead to newidle_balance not
> +	 * doing enough work, and the CPU actually going idle.
> +	 */
> +	if (pulled_task || this_rq->ttwu_pending)
>  		this_rq->idle_stamp = 0;

I've un-committed this patch, because vingu was reporting increased idle
time because of this hunk.  I had mistakenly assumed that was sorted
with v3, sorry for not keeping better track of things.

(also, now that I look again, please also fix the Subject to have a
capital after the :)



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] sched,fair: skip newidle_balance if a wakeup is pending
  2021-04-21 17:27 ` Vincent Guittot
@ 2021-04-22  8:37   ` Vincent Guittot
  2021-04-22 16:47     ` Rik van Riel
  0 siblings, 1 reply; 7+ messages in thread
From: Vincent Guittot @ 2021-04-22  8:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Kernel Team, Peter Zijlstra, Ingo Molnar,
	Dietmar Eggemann, Mel Gorman, Valentin Schneider

On Wed, 21 Apr 2021 at 19:27, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> Hi Rik,
>
> On Tue, 20 Apr 2021 at 18:07, Rik van Riel <riel@surriel.com> wrote:
> >
> > The try_to_wake_up function has an optimization where it can queue
> > a task for wakeup on its previous CPU, if the task is still in the
> > middle of going to sleep inside schedule().
> >
> > Once schedule() re-enables IRQs, the task will be woken up with an
> > IPI, and placed back on the runqueue.
> >
> > If we have such a wakeup pending, there is no need to search other
> > CPUs for runnable tasks. Just skip (or bail out early from) newidle
> > balancing, and run the just woken up task.
> >
> > For a memcache like workload test, this reduces total CPU use by
> > about 2%, proportionally split between user and system time,
> > and p99 and p95 application response time by 10% on average.
> > The schedstats run_delay number shows a similar improvement.
> >
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > ---
> >  kernel/sched/fair.c | 18 ++++++++++++++++--
> >  1 file changed, 16 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 69680158963f..fd80175c3b3e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10594,6 +10594,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >         u64 curr_cost = 0;
> >
> >         update_misfit_status(NULL, this_rq);
> > +
> > +       /*
> > +        * There is a task waiting to run. No need to search for one.
> > +        * Return 0; the task will be enqueued when switching to idle.
> > +        */
> > +       if (this_rq->ttwu_pending)
> > +               return 0;
> > +
> >         /*
> >          * We must set idle_stamp _before_ calling idle_balance(), such that we
> >          * measure the duration of idle_balance() as idle time.
> > @@ -10661,7 +10669,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >                  * Stop searching for tasks to pull if there are
> >                  * now runnable tasks on this rq.
> >                  */
> > -               if (pulled_task || this_rq->nr_running > 0)
> > +               if (pulled_task || this_rq->nr_running > 0 ||
> > +                   this_rq->ttwu_pending)
> >                         break;
> >         }
> >         rcu_read_unlock();
> > @@ -10688,7 +10697,12 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >         if (this_rq->nr_running != this_rq->cfs.h_nr_running)
> >                 pulled_task = -1;
> >
> > -       if (pulled_task)
> > +       /*
> > +        * If we are no longer idle, do not let the time spent here pull
> > +        * down this_rq->avg_idle. That could lead to newidle_balance not
> > +        * doing enough work, and the CPU actually going idle.
> > +        */
> > +       if (pulled_task || this_rq->ttwu_pending)
>
> I'm still running some benchmarks to evaluate the impact of your patch
> and more especially the line above which clears this_rq->idle_stamp
> and skips the time spent in newidle_balance from being accounted for
> in avg_idle. I have some results which  show some regression because
> of this test especially with hackbench.
> On large system, the time spent in newidle_balance can be significant
> and we can't ignore it just because this_rq->ttwu_pending is set while
> looping the domains because without newidle_balance the idle time
> would have been large and we end up screwing up the metric

I confirmed that the line above generate hackbench regression on my
large arm64 system (2 * 112 CPUs)
I'm testing hackbench with various number of group : 1, 2, 4, 16, 32,
64, 128, 256 but I have only put the 2 results which significantly
regress. The other ones are in the +/-1% variation range

hackbench -g $group

group    v5.12-rc8+tip    w/ this patch          w/ this patch without
the line above
64       2.862(+/- 9%)    2.952(+/-11%) -3%      2.807(+/- 7%) +2%
128      3.334(+/-10%)    3.561-+/-13%) -7%      3.181(+/- 6%) +4%



>
> >                 this_rq->idle_stamp = 0;
> >
> >         rq_repin_lock(this_rq, rf);
> > --
> > 2.25.4
> >
> >

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] sched,fair: skip newidle_balance if a wakeup is pending
  2021-04-22  8:37   ` Vincent Guittot
@ 2021-04-22 16:47     ` Rik van Riel
  0 siblings, 0 replies; 7+ messages in thread
From: Rik van Riel @ 2021-04-22 16:47 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Kernel Team, Peter Zijlstra, Ingo Molnar,
	Dietmar Eggemann, Mel Gorman, Valentin Schneider

[-- Attachment #1: Type: text/plain, Size: 1286 bytes --]

On Thu, 2021-04-22 at 10:37 +0200, Vincent Guittot wrote:
> On Wed, 21 Apr 2021 at 19:27, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> > 
> > > -       if (pulled_task)
> > > +       /*
> > > +        * If we are no longer idle, do not let the time spent
> > > here pull
> > > +        * down this_rq->avg_idle. That could lead to
> > > newidle_balance not
> > > +        * doing enough work, and the CPU actually going idle.
> > > +        */
> > > +       if (pulled_task || this_rq->ttwu_pending)
> > 
> I confirmed that the line above generate hackbench regression on my
> large arm64 system (2 * 112 CPUs)
> I'm testing hackbench with various number of group : 1, 2, 4, 16, 32,
> 64, 128, 256 but I have only put the 2 results which significantly
> regress. The other ones are in the +/-1% variation range
> 
> hackbench -g $group
> 
> group    v5.12-rc8+tip    w/ this patch          w/ this patch
> without
> the line above
> 64       2.862(+/- 9%)    2.952(+/-11%) -3%      2.807(+/- 7%) +2%
> 128      3.334(+/-10%)    3.561-+/-13%) -7%      3.181(+/- 6%) +4%

OK, I guess this part of the patch needs additional work.

I'll send a v4 with just the first two changes.

Thank you for running those tests.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [tip: sched/core] sched,fair: Skip newidle_balance if a wakeup is pending
  2021-04-22 17:02 [PATCH v4] sched,fair: Skip " Rik van Riel
@ 2021-05-12 10:28 ` tip-bot2 for Rik van Riel
  0 siblings, 0 replies; 7+ messages in thread
From: tip-bot2 for Rik van Riel @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Rik van Riel, Peter Zijlstra (Intel),
	Vincent Guittot, Mel Gorman, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e5e678e4fea26d73444f4427cbbaeab4fa79ecee
Gitweb:        https://git.kernel.org/tip/e5e678e4fea26d73444f4427cbbaeab4fa79ecee
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Thu, 22 Apr 2021 13:02:36 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:23 +02:00

sched,fair: Skip newidle_balance if a wakeup is pending

The try_to_wake_up function has an optimization where it can queue
a task for wakeup on its previous CPU, if the task is still in the
middle of going to sleep inside schedule().

Once schedule() re-enables IRQs, the task will be woken up with an
IPI, and placed back on the runqueue.

If we have such a wakeup pending, there is no need to search other
CPUs for runnable tasks. Just skip (or bail out early from) newidle
balancing, and run the just woken up task.

For a memcache like workload test, this reduces total CPU use by
about 2%, proportionally split between user and system time,
and p99 and p95 application response time by 10% on average.
The schedstats run_delay number shows a similar improvement.

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20210422130236.0bb353df@imladris.surriel.com
---
 kernel/sched/fair.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3248e24..d10c6cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10592,6 +10592,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	u64 curr_cost = 0;
 
 	update_misfit_status(NULL, this_rq);
+
+	/*
+	 * There is a task waiting to run. No need to search for one.
+	 * Return 0; the task will be enqueued when switching to idle.
+	 */
+	if (this_rq->ttwu_pending)
+		return 0;
+
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 	 * measure the duration of idle_balance() as idle time.
@@ -10657,7 +10665,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		 * Stop searching for tasks to pull if there are
 		 * now runnable tasks on this rq.
 		 */
-		if (pulled_task || this_rq->nr_running > 0)
+		if (pulled_task || this_rq->nr_running > 0 ||
+		    this_rq->ttwu_pending)
 			break;
 	}
 	rcu_read_unlock();

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-05-12 10:30 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-20 16:07 [PATCH v3] sched,fair: skip newidle_balance if a wakeup is pending Rik van Riel
2021-04-21 17:27 ` Vincent Guittot
2021-04-22  8:37   ` Vincent Guittot
2021-04-22 16:47     ` Rik van Riel
2021-04-22  7:36 ` [tip: sched/core] " tip-bot2 for Rik van Riel
2021-04-22  8:15   ` Peter Zijlstra
2021-04-22 17:02 [PATCH v4] sched,fair: Skip " Rik van Riel
2021-05-12 10:28 ` [tip: sched/core] " tip-bot2 for Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.