linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] sched/fair: Active balancer RT/DL preemption fix
@ 2019-08-15 14:51 Valentin Schneider
  2019-08-15 14:51 ` [PATCH v2 1/4] sched/fair: Make need_active_balance() return bools Valentin Schneider
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Valentin Schneider @ 2019-08-15 14:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, vincent.guittot, tglx, qais.yousef

Vincent's load balance rework [1] got me thinking about how and where we
use rq.nr_running vs rq.cfs.h_nr_running checks, and this lead me to
stare intently at the active load balancer.

I haven't seen it happen (yet), but from reading the code it really looks
like we can have some scenarios where the cpu_stopper ends up preempting
a > CFS class task, since we never actually look at what's the remote rq's
running task.

This series shuffles things around the CFS active load balancer to prevent
this from happening.

- Patch 1 is a freebie cleanup
- Patch 2 is a preparatory code move
- Patch 3 adds h_nr_running checks
- Patch 4 adds a sched class check + detach_one_task() to the active balance

This is based on top of today's tip/sched/core:
  a46d14eca7b7 ("sched/fair: Use rq_lock/unlock in online_fair_sched_group")

v1 -> v2:
  - (new patch) Added need_active_balance() cleanup

  - Tweaked active balance code move to respect existing
    sd->nr_balance_failed modifications
  - Added explicit checks of active_load_balance()'s return value
  
  - Added an h_nr_running < 1 check before kicking the cpu_stopper

  - Added a detach_one_task() call in active_load_balance() when the remote
    rq's running task is > CFS

[1]: https://lore.kernel.org/lkml/1564670424-26023-1-git-send-email-vincent.guittot@linaro.org/

Valentin Schneider (4):
  sched/fair: Make need_active_balance() return bools
  sched/fair: Move active balance logic to its own function
  sched/fair: Check for CFS tasks before detach_one_task()
  sched/fair: Prevent active LB from preempting higher sched classes

 kernel/sched/fair.c | 151 ++++++++++++++++++++++++++++----------------
 1 file changed, 95 insertions(+), 56 deletions(-)

--
2.22.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/4] sched/fair: Make need_active_balance() return bools
  2019-08-15 14:51 [PATCH v2 0/4] sched/fair: Active balancer RT/DL preemption fix Valentin Schneider
@ 2019-08-15 14:51 ` Valentin Schneider
  2019-08-15 14:51 ` [PATCH v2 2/4] sched/fair: Move active balance logic to its own function Valentin Schneider
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2019-08-15 14:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, vincent.guittot, tglx, qais.yousef

That function has no need to return ints, make it return bools.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1054d2cf6aaa..22be1ca8d117 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8734,12 +8734,12 @@ voluntary_active_balance(struct lb_env *env)
 	return 0;
 }
 
-static int need_active_balance(struct lb_env *env)
+static bool need_active_balance(struct lb_env *env)
 {
 	struct sched_domain *sd = env->sd;
 
 	if (voluntary_active_balance(env))
-		return 1;
+		return true;
 
 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
 }
-- 
2.22.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 2/4] sched/fair: Move active balance logic to its own function
  2019-08-15 14:51 [PATCH v2 0/4] sched/fair: Active balancer RT/DL preemption fix Valentin Schneider
  2019-08-15 14:51 ` [PATCH v2 1/4] sched/fair: Make need_active_balance() return bools Valentin Schneider
@ 2019-08-15 14:51 ` Valentin Schneider
  2019-10-01 11:36   ` Srikar Dronamraju
  2019-08-15 14:51 ` [PATCH v2 3/4] sched/fair: Check for CFS tasks before detach_one_task() Valentin Schneider
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Valentin Schneider @ 2019-08-15 14:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, vincent.guittot, tglx, qais.yousef

The logic to trigger an active balance is already quite lengthy, and
as I'm about to add yet another unlock condition it's probably better
to give it its own corner.

The only annoying requirement is that we need to branch to
out_one_pinned when the active balance is cancelled due to the running
task's affinity. Something like < 0, == 0 and > 0 return values could
suffice, but I went for an enum for clarity.

No change in functionality intended.

Note that the final

  status != cancelled_affinity

check is there to maintain the existing behaviour, i.e. bump
sd->nr_balance_failed both when the cpu_stopper has been kicked and
when it hasn't, which doesn't sound entirely right to me.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 126 ++++++++++++++++++++++++++------------------
 1 file changed, 74 insertions(+), 52 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22be1ca8d117..751f41085f47 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8746,6 +8746,72 @@ static bool need_active_balance(struct lb_env *env)
 
 static int active_load_balance_cpu_stop(void *data);
 
+/* Active load balance */
+enum alb_status {
+	cancelled = 0,
+	cancelled_affinity,
+	started
+};
+
+/* Attempt to move a remote rq's running task to env's dst_cpu */
+static inline enum alb_status active_load_balance(struct lb_env *env)
+{
+	enum alb_status status = cancelled;
+	struct sched_domain *sd = env->sd;
+	struct rq *busiest = env->src_rq;
+	unsigned long flags;
+
+	schedstat_inc(sd->lb_failed[env->idle]);
+	/*
+	 * Increment the failure counter only on periodic balance.
+	 * We do not want newidle balance, which can be very frequent, pollute
+	 * the failure counter causing excessive cache_hot migrations and
+	 * active balances.
+	 */
+	if (env->idle != CPU_NEWLY_IDLE)
+		sd->nr_balance_failed++;
+
+	if (!need_active_balance(env))
+		goto out;
+
+	raw_spin_lock_irqsave(&busiest->lock, flags);
+
+	/*
+	 * Don't kick the active_load_balance_cpu_stop, if the curr task on
+	 * busiest CPU can't be moved to dst_cpu:
+	 */
+	if (!cpumask_test_cpu(env->dst_cpu, busiest->curr->cpus_ptr)) {
+		env->flags |= LBF_ALL_PINNED;
+		status = cancelled_affinity;
+		goto unlock;
+	}
+
+	/*
+	 * ->active_balance synchronizes accesses to ->active_balance_work.
+	 * Once set, it's cleared only after active load balance is finished.
+	 */
+	if (!busiest->active_balance) {
+		busiest->active_balance = 1;
+		busiest->push_cpu = env->dst_cpu;
+		status = started;
+	}
+
+unlock:
+	raw_spin_unlock_irqrestore(&busiest->lock, flags);
+
+	if (status == started)
+		stop_one_cpu_nowait(cpu_of(busiest),
+				    active_load_balance_cpu_stop, busiest,
+				    &busiest->active_balance_work);
+
+	/* We've kicked active balancing, force task migration. */
+	if (status != cancelled_affinity)
+		sd->nr_balance_failed = sd->cache_nice_tries + 1;
+
+out:
+	return status;
+}
+
 static int should_we_balance(struct lb_env *env)
 {
 	struct sched_group *sg = env->sd->groups;
@@ -8792,7 +8858,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			struct sched_domain *sd, enum cpu_idle_type idle,
 			int *continue_balancing)
 {
-	int ld_moved, cur_ld_moved, active_balance = 0;
+	enum alb_status active_balance = cancelled;
+	int ld_moved, cur_ld_moved;
 	struct sched_domain *sd_parent = sd->parent;
 	struct sched_group *group;
 	struct rq *busiest;
@@ -8950,61 +9017,16 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		}
 	}
 
-	if (!ld_moved) {
-		schedstat_inc(sd->lb_failed[idle]);
-		/*
-		 * Increment the failure counter only on periodic balance.
-		 * We do not want newidle balance, which can be very
-		 * frequent, pollute the failure counter causing
-		 * excessive cache_hot migrations and active balances.
-		 */
-		if (idle != CPU_NEWLY_IDLE)
-			sd->nr_balance_failed++;
-
-		if (need_active_balance(&env)) {
-			unsigned long flags;
-
-			raw_spin_lock_irqsave(&busiest->lock, flags);
-
-			/*
-			 * Don't kick the active_load_balance_cpu_stop,
-			 * if the curr task on busiest CPU can't be
-			 * moved to this_cpu:
-			 */
-			if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
-				raw_spin_unlock_irqrestore(&busiest->lock,
-							    flags);
-				env.flags |= LBF_ALL_PINNED;
-				goto out_one_pinned;
-			}
-
-			/*
-			 * ->active_balance synchronizes accesses to
-			 * ->active_balance_work.  Once set, it's cleared
-			 * only after active load balance is finished.
-			 */
-			if (!busiest->active_balance) {
-				busiest->active_balance = 1;
-				busiest->push_cpu = this_cpu;
-				active_balance = 1;
-			}
-			raw_spin_unlock_irqrestore(&busiest->lock, flags);
-
-			if (active_balance) {
-				stop_one_cpu_nowait(cpu_of(busiest),
-					active_load_balance_cpu_stop, busiest,
-					&busiest->active_balance_work);
-			}
-
-			/* We've kicked active balancing, force task migration. */
-			sd->nr_balance_failed = sd->cache_nice_tries+1;
-		}
-	} else
+	if (!ld_moved)
+		active_balance = active_load_balance(&env);
+	else
 		sd->nr_balance_failed = 0;
 
-	if (likely(!active_balance) || voluntary_active_balance(&env)) {
+	if (likely(active_balance == cancelled) || voluntary_active_balance(&env)) {
 		/* We were unbalanced, so reset the balancing interval */
 		sd->balance_interval = sd->min_interval;
+	} else if (active_balance == cancelled_affinity) {
+		goto out_one_pinned;
 	} else {
 		/*
 		 * If we've begun active balancing, start to back off. This
-- 
2.22.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 3/4] sched/fair: Check for CFS tasks before detach_one_task()
  2019-08-15 14:51 [PATCH v2 0/4] sched/fair: Active balancer RT/DL preemption fix Valentin Schneider
  2019-08-15 14:51 ` [PATCH v2 1/4] sched/fair: Make need_active_balance() return bools Valentin Schneider
  2019-08-15 14:51 ` [PATCH v2 2/4] sched/fair: Move active balance logic to its own function Valentin Schneider
@ 2019-08-15 14:51 ` Valentin Schneider
  2019-08-15 14:51 ` [PATCH v2 4/4] sched/fair: Prevent active LB from preempting higher sched classes Valentin Schneider
  2019-10-01 10:29 ` [PATCH v2 0/4] sched/fair: Active balancer RT/DL preemption fix Valentin Schneider
  4 siblings, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2019-08-15 14:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, vincent.guittot, tglx, qais.yousef

detach_one_task() (and the CFS load-balancer in general) can only ever
pull CFS tasks, so we should check if there's any before invoking the
cpu stopper.

Likewise, when the cpu stopper is already up, we can bail out slightly
earlier when we see there's no CFS task to detach.

Check for CFS tasks at the top of need_active_balance(), and add a
similar check in active_load_balance_cpu_stop().

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
Note that this only touches the active load balance paths. IMO
load_balance() could go through a similar treatment
(s/rq.nr_running/rq.cfs.h_nr_running/ before going to detach_tasks())
but that conflicts with Vincent's rework, so I've not included it
here.
---
 kernel/sched/fair.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 751f41085f47..8f5f6cad5008 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8776,6 +8776,10 @@ static inline enum alb_status active_load_balance(struct lb_env *env)
 
 	raw_spin_lock_irqsave(&busiest->lock, flags);
 
+	/* Is there actually anything to pull? */
+	if (busiest->cfs.h_nr_running < 1)
+		goto unlock;
+
 	/*
 	 * Don't kick the active_load_balance_cpu_stop, if the curr task on
 	 * busiest CPU can't be moved to dst_cpu:
@@ -9142,8 +9146,8 @@ static int active_load_balance_cpu_stop(void *data)
 		     !busiest_rq->active_balance))
 		goto out_unlock;
 
-	/* Is there any task to move? */
-	if (busiest_rq->nr_running <= 1)
+	/* Is there any CFS task to move? */
+	if (busiest_rq->cfs.h_nr_running < 1)
 		goto out_unlock;
 
 	/*
-- 
2.22.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 4/4] sched/fair: Prevent active LB from preempting higher sched classes
  2019-08-15 14:51 [PATCH v2 0/4] sched/fair: Active balancer RT/DL preemption fix Valentin Schneider
                   ` (2 preceding siblings ...)
  2019-08-15 14:51 ` [PATCH v2 3/4] sched/fair: Check for CFS tasks before detach_one_task() Valentin Schneider
@ 2019-08-15 14:51 ` Valentin Schneider
  2019-08-27 12:28   ` Vincent Guittot
  2019-10-01 10:29 ` [PATCH v2 0/4] sched/fair: Active balancer RT/DL preemption fix Valentin Schneider
  4 siblings, 1 reply; 14+ messages in thread
From: Valentin Schneider @ 2019-08-15 14:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, vincent.guittot, tglx, qais.yousef

The CFS load balancer can cause the cpu_stopper to run a function to
try and steal a remote rq's running task. However, it so happens
that while only CFS tasks will ever be migrated by that function, we
can end up preempting higher sched class tasks, since it is executed
by the cpu_stopper.

This can potentially occur whenever a rq's running task is > CFS but
the rq has runnable CFS tasks.

Check the sched class of the remote rq's running task after we've
grabbed its lock. If it's CFS, carry on, otherwise run
detach_one_task() locally since we don't need the cpu_stopper (that
!CFS task is doing the exact same thing).

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f5f6cad5008..bf4f7f43252f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8759,6 +8759,7 @@ static inline enum alb_status active_load_balance(struct lb_env *env)
 	enum alb_status status = cancelled;
 	struct sched_domain *sd = env->sd;
 	struct rq *busiest = env->src_rq;
+	struct task_struct *p = NULL;
 	unsigned long flags;
 
 	schedstat_inc(sd->lb_failed[env->idle]);
@@ -8780,6 +8781,16 @@ static inline enum alb_status active_load_balance(struct lb_env *env)
 	if (busiest->cfs.h_nr_running < 1)
 		goto unlock;
 
+	/*
+	 * If busiest->curr isn't CFS, then there's no point in running the
+	 * cpu_stopper. We can try to nab one CFS task though, since they're
+	 * all ready to be plucked.
+	 */
+	if (busiest->curr->sched_class != &fair_sched_class) {
+		p = detach_one_task(env);
+		goto unlock;
+	}
+
 	/*
 	 * Don't kick the active_load_balance_cpu_stop, if the curr task on
 	 * busiest CPU can't be moved to dst_cpu:
@@ -8803,7 +8814,9 @@ static inline enum alb_status active_load_balance(struct lb_env *env)
 unlock:
 	raw_spin_unlock_irqrestore(&busiest->lock, flags);
 
-	if (status == started)
+	if (p)
+		attach_one_task(env->dst_rq, p);
+	else if (status == started)
 		stop_one_cpu_nowait(cpu_of(busiest),
 				    active_load_balance_cpu_stop, busiest,
 				    &busiest->active_balance_work);
-- 
2.22.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] sched/fair: Prevent active LB from preempting higher sched classes
  2019-08-15 14:51 ` [PATCH v2 4/4] sched/fair: Prevent active LB from preempting higher sched classes Valentin Schneider
@ 2019-08-27 12:28   ` Vincent Guittot
  2019-08-28  9:46     ` Valentin Schneider
  0 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2019-08-27 12:28 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Qais Yousef

On Thu, 15 Aug 2019 at 16:52, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> The CFS load balancer can cause the cpu_stopper to run a function to
> try and steal a remote rq's running task. However, it so happens
> that while only CFS tasks will ever be migrated by that function, we
> can end up preempting higher sched class tasks, since it is executed
> by the cpu_stopper.
>
> This can potentially occur whenever a rq's running task is > CFS but
> the rq has runnable CFS tasks.
>
> Check the sched class of the remote rq's running task after we've
> grabbed its lock. If it's CFS, carry on, otherwise run
> detach_one_task() locally since we don't need the cpu_stopper (that
> !CFS task is doing the exact same thing).

AFAICT, this doesn't prevent from preempting !CFS task but only reduce
the window.
As soon as you unlock, !CFS task can preempt CFS before you start stop thread

testing  busiest->cfs.h_nr_running < 1 and/or
busiest->curr->sched_class != &fair_sched_class
in need_active_balance() will do almost the same and is much simpler
than this patchset IMO.

>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8f5f6cad5008..bf4f7f43252f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8759,6 +8759,7 @@ static inline enum alb_status active_load_balance(struct lb_env *env)
>         enum alb_status status = cancelled;
>         struct sched_domain *sd = env->sd;
>         struct rq *busiest = env->src_rq;
> +       struct task_struct *p = NULL;
>         unsigned long flags;
>
>         schedstat_inc(sd->lb_failed[env->idle]);
> @@ -8780,6 +8781,16 @@ static inline enum alb_status active_load_balance(struct lb_env *env)
>         if (busiest->cfs.h_nr_running < 1)
>                 goto unlock;
>
> +       /*
> +        * If busiest->curr isn't CFS, then there's no point in running the
> +        * cpu_stopper. We can try to nab one CFS task though, since they're
> +        * all ready to be plucked.
> +        */
> +       if (busiest->curr->sched_class != &fair_sched_class) {
> +               p = detach_one_task(env);
> +               goto unlock;
> +       }
> +
>         /*
>          * Don't kick the active_load_balance_cpu_stop, if the curr task on
>          * busiest CPU can't be moved to dst_cpu:
> @@ -8803,7 +8814,9 @@ static inline enum alb_status active_load_balance(struct lb_env *env)
>  unlock:
>         raw_spin_unlock_irqrestore(&busiest->lock, flags);
>
> -       if (status == started)
> +       if (p)
> +               attach_one_task(env->dst_rq, p);
> +       else if (status == started)
>                 stop_one_cpu_nowait(cpu_of(busiest),
>                                     active_load_balance_cpu_stop, busiest,
>                                     &busiest->active_balance_work);
> --
> 2.22.0
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] sched/fair: Prevent active LB from preempting higher sched classes
  2019-08-27 12:28   ` Vincent Guittot
@ 2019-08-28  9:46     ` Valentin Schneider
  2019-08-29 14:19       ` Vincent Guittot
  0 siblings, 1 reply; 14+ messages in thread
From: Valentin Schneider @ 2019-08-28  9:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Qais Yousef

On 27/08/2019 13:28, Vincent Guittot wrote:
> On Thu, 15 Aug 2019 at 16:52, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>>
>> The CFS load balancer can cause the cpu_stopper to run a function to
>> try and steal a remote rq's running task. However, it so happens
>> that while only CFS tasks will ever be migrated by that function, we
>> can end up preempting higher sched class tasks, since it is executed
>> by the cpu_stopper.
>>
>> This can potentially occur whenever a rq's running task is > CFS but
>> the rq has runnable CFS tasks.
>>
>> Check the sched class of the remote rq's running task after we've
>> grabbed its lock. If it's CFS, carry on, otherwise run
>> detach_one_task() locally since we don't need the cpu_stopper (that
>> !CFS task is doing the exact same thing).
> 
> AFAICT, this doesn't prevent from preempting !CFS task but only reduce
> the window.
> As soon as you unlock, !CFS task can preempt CFS before you start stop thread
> 

Right, if we end up kicking the cpu_stopper this can still happen (since
we drop the lock). Thing is, you can't detect it on the cpu_stopper side,
since the currently running is obviously not going to be CFS (and it's
too late anyway, we already preempted whatever was running there). Though
I should probably change the name of the patch to reflect that it's not a
100% cure.

I tweaked the nr_running check of the cpu_stop callback in patch 3/4 to try
to bail out early, but AFAICT that's the best we can do without big changes
elsewhere.

If we wanted to prevent those preemptions at all cost, I suppose we'd want
the equivalent of a sched class sitting between CFS and RT: have the
callback only run when there's no runnable > CFS tasks. But then by the
time we execute it we may no longer need to balance anything...

At the very least, what I'm proposing here alleviates *some* of the
preemption cases without swinging the wrecking ball too hard (and without
delaying the balancing either).

> testing  busiest->cfs.h_nr_running < 1 and/or
> busiest->curr->sched_class != &fair_sched_class
> in need_active_balance() will do almost the same and is much simpler
> than this patchset IMO.
> 

I had this initially but convinced myself out of it: since we hold no
lock in need_active_balance(), the information we get on the current task
(and, arguably, on the h_nr_running) is too volatile to be of any use.

I do believe those checks have their place in active_load_balance()'s
critical section, as that's the most accurate we're going to get. On the
plus side, if we *do* detect the remote rq's current task isn't CFS, we
can run detach_one_task() locally, which is an improvement IMO.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] sched/fair: Prevent active LB from preempting higher sched classes
  2019-08-28  9:46     ` Valentin Schneider
@ 2019-08-29 14:19       ` Vincent Guittot
  2019-08-30 15:44         ` Valentin Schneider
  0 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2019-08-29 14:19 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Qais Yousef

On Wed, 28 Aug 2019 at 11:46, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 27/08/2019 13:28, Vincent Guittot wrote:
> > On Thu, 15 Aug 2019 at 16:52, Valentin Schneider
> > <valentin.schneider@arm.com> wrote:
> >>
> >> The CFS load balancer can cause the cpu_stopper to run a function to
> >> try and steal a remote rq's running task. However, it so happens
> >> that while only CFS tasks will ever be migrated by that function, we
> >> can end up preempting higher sched class tasks, since it is executed
> >> by the cpu_stopper.
> >>
> >> This can potentially occur whenever a rq's running task is > CFS but
> >> the rq has runnable CFS tasks.
> >>
> >> Check the sched class of the remote rq's running task after we've
> >> grabbed its lock. If it's CFS, carry on, otherwise run
> >> detach_one_task() locally since we don't need the cpu_stopper (that
> >> !CFS task is doing the exact same thing).
> >
> > AFAICT, this doesn't prevent from preempting !CFS task but only reduce
> > the window.
> > As soon as you unlock, !CFS task can preempt CFS before you start stop thread
> >
>
> Right, if we end up kicking the cpu_stopper this can still happen (since
> we drop the lock). Thing is, you can't detect it on the cpu_stopper side,
> since the currently running is obviously not going to be CFS (and it's
> too late anyway, we already preempted whatever was running there). Though
> I should probably change the name of the patch to reflect that it's not a
> 100% cure.
>
> I tweaked the nr_running check of the cpu_stop callback in patch 3/4 to try
> to bail out early, but AFAICT that's the best we can do without big changes
> elsewhere.
>
> If we wanted to prevent those preemptions at all cost, I suppose we'd want

I'm not sure that it's worth the effort and the complexity

> the equivalent of a sched class sitting between CFS and RT: have the
> callback only run when there's no runnable > CFS tasks. But then by the
> time we execute it we may no longer need to balance anything...
>
> At the very least, what I'm proposing here alleviates *some* of the
> preemption cases without swinging the wrecking ball too hard (and without
> delaying the balancing either).
>
> > testing  busiest->cfs.h_nr_running < 1 and/or
> > busiest->curr->sched_class != &fair_sched_class
> > in need_active_balance() will do almost the same and is much simpler
> > than this patchset IMO.
> >
>
> I had this initially but convinced myself out of it: since we hold no
> lock in need_active_balance(), the information we get on the current task
> (and, arguably, on the h_nr_running) is too volatile to be of any use.

But since the lock is released anyway, everything will always be too
volatile in this case.

>
> I do believe those checks have their place in active_load_balance()'s
> critical section, as that's the most accurate we're going to get. On the
> plus side, if we *do* detect the remote rq's current task isn't CFS, we
> can run detach_one_task() locally, which is an improvement IMO.

This add complexity in the code by adding another path to detach attach task(s).
We could simply bail out and wait the next load balance (which is
already the case sometime) or if you really want to detach a task jump
back to more_balance

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] sched/fair: Prevent active LB from preempting higher sched classes
  2019-08-29 14:19       ` Vincent Guittot
@ 2019-08-30 15:44         ` Valentin Schneider
  0 siblings, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2019-08-30 15:44 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Qais Yousef

On 29/08/2019 15:19, Vincent Guittot wrote:
[...]
>> Right, if we end up kicking the cpu_stopper this can still happen (since
>> we drop the lock). Thing is, you can't detect it on the cpu_stopper side,
>> since the currently running is obviously not going to be CFS (and it's
>> too late anyway, we already preempted whatever was running there). Though
>> I should probably change the name of the patch to reflect that it's not a
>> 100% cure.
>>
>> I tweaked the nr_running check of the cpu_stop callback in patch 3/4 to try
>> to bail out early, but AFAICT that's the best we can do without big changes
>> elsewhere.
>>
>> If we wanted to prevent those preemptions at all cost, I suppose we'd want
> 
> I'm not sure that it's worth the effort and the complexity
> 

My point exactly :)

[...]
>> I had this initially but convinced myself out of it: since we hold no
>> lock in need_active_balance(), the information we get on the current task
>> (and, arguably, on the h_nr_running) is too volatile to be of any use.
> 
> But since the lock is released anyway, everything will always be too
> volatile in this case.

We do release the lock if we go kick the cpu_stopper, but can nevertheless
make a decision with the most up to date information. I'd say it's for
similar reasons that we check busiest->curr->cpus_ptr right before
kicking the cpu_stopper rather than in need_active_balance().

The majority of the checks in need_active_balance() (all but one) depend
on env/sd stats which aren't volatile.

>>
>> I do believe those checks have their place in active_load_balance()'s
>> critical section, as that's the most accurate we're going to get. On the
>> plus side, if we *do* detect the remote rq's current task isn't CFS, we
>> can run detach_one_task() locally, which is an improvement IMO.
> 
> This add complexity in the code by adding another path to detach attach task(s).

Note that it's not a new detach/attach per se, rather it's about doing it
in active_load_balance() rather than active_load_balance_cpu_stop() in some
cases.

> We could simply bail out and wait the next load balance (which is
> already the case sometime) or if you really want to detach a task jump
> back to more_balance
> 

A simple bail-out is what I had in v1, but following Qais' comments I
figured I could add the detach_one_tasks().

Jumping back to more_balance is quite different than doing a detach_one_task().

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/4] sched/fair: Active balancer RT/DL preemption fix
  2019-08-15 14:51 [PATCH v2 0/4] sched/fair: Active balancer RT/DL preemption fix Valentin Schneider
                   ` (3 preceding siblings ...)
  2019-08-15 14:51 ` [PATCH v2 4/4] sched/fair: Prevent active LB from preempting higher sched classes Valentin Schneider
@ 2019-10-01 10:29 ` Valentin Schneider
  2019-10-01 13:31   ` Juri Lelli
  4 siblings, 1 reply; 14+ messages in thread
From: Valentin Schneider @ 2019-10-01 10:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, vincent.guittot, tglx, qais.yousef, Juri Lelli,
	Steven Rostedt

(expanded the Cc list)
RT/DL folks, any thought on the thing?

On 15/08/2019 15:51, Valentin Schneider wrote:
> Vincent's load balance rework [1] got me thinking about how and where we
> use rq.nr_running vs rq.cfs.h_nr_running checks, and this lead me to
> stare intently at the active load balancer.
> 
> I haven't seen it happen (yet), but from reading the code it really looks
> like we can have some scenarios where the cpu_stopper ends up preempting
> a > CFS class task, since we never actually look at what's the remote rq's
> running task.
> 
> This series shuffles things around the CFS active load balancer to prevent
> this from happening.
> 
> - Patch 1 is a freebie cleanup
> - Patch 2 is a preparatory code move
> - Patch 3 adds h_nr_running checks
> - Patch 4 adds a sched class check + detach_one_task() to the active balance
> 
> This is based on top of today's tip/sched/core:
>   a46d14eca7b7 ("sched/fair: Use rq_lock/unlock in online_fair_sched_group")
> 
> v1 -> v2:
>   - (new patch) Added need_active_balance() cleanup
> 
>   - Tweaked active balance code move to respect existing
>     sd->nr_balance_failed modifications
>   - Added explicit checks of active_load_balance()'s return value
>   
>   - Added an h_nr_running < 1 check before kicking the cpu_stopper
> 
>   - Added a detach_one_task() call in active_load_balance() when the remote
>     rq's running task is > CFS
> 
> [1]: https://lore.kernel.org/lkml/1564670424-26023-1-git-send-email-vincent.guittot@linaro.org/
> 
> Valentin Schneider (4):
>   sched/fair: Make need_active_balance() return bools
>   sched/fair: Move active balance logic to its own function
>   sched/fair: Check for CFS tasks before detach_one_task()
>   sched/fair: Prevent active LB from preempting higher sched classes
> 
>  kernel/sched/fair.c | 151 ++++++++++++++++++++++++++++----------------
>  1 file changed, 95 insertions(+), 56 deletions(-)
> 
> --
> 2.22.0
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: Move active balance logic to its own function
  2019-08-15 14:51 ` [PATCH v2 2/4] sched/fair: Move active balance logic to its own function Valentin Schneider
@ 2019-10-01 11:36   ` Srikar Dronamraju
  2019-10-01 11:48     ` Valentin Schneider
  0 siblings, 1 reply; 14+ messages in thread
From: Srikar Dronamraju @ 2019-10-01 11:36 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, mingo, peterz, vincent.guittot, tglx, qais.yousef

> +unlock:
> +	raw_spin_unlock_irqrestore(&busiest->lock, flags);
> +
> +	if (status == started)
> +		stop_one_cpu_nowait(cpu_of(busiest),
> +				    active_load_balance_cpu_stop, busiest,
> +				    &busiest->active_balance_work);
> +
> +	/* We've kicked active balancing, force task migration. */
> +	if (status != cancelled_affinity)
> +		sd->nr_balance_failed = sd->cache_nice_tries + 1;

Should we really update nr_balance_failed if status is cancelled?
I do understand this behaviour was present even before this change. But
still dont understand why we need to update if the current operation didn't
kick active_load_balance.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: Move active balance logic to its own function
  2019-10-01 11:36   ` Srikar Dronamraju
@ 2019-10-01 11:48     ` Valentin Schneider
  0 siblings, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2019-10-01 11:48 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: linux-kernel, mingo, peterz, vincent.guittot, tglx, qais.yousef

On 01/10/2019 12:36, Srikar Dronamraju wrote:
>> +unlock:
>> +	raw_spin_unlock_irqrestore(&busiest->lock, flags);
>> +
>> +	if (status == started)
>> +		stop_one_cpu_nowait(cpu_of(busiest),
>> +				    active_load_balance_cpu_stop, busiest,
>> +				    &busiest->active_balance_work);
>> +
>> +	/* We've kicked active balancing, force task migration. */
>> +	if (status != cancelled_affinity)
>> +		sd->nr_balance_failed = sd->cache_nice_tries + 1;
> 
> Should we really update nr_balance_failed if status is cancelled?
> I do understand this behaviour was present even before this change. But
> still dont understand why we need to update if the current operation didn't
> kick active_load_balance.
> 

Agreed, I kept it as is to keep this as pure a code movement as possible,
but I don't see why this wouldn't be valid wouldn't be valid
(PoV of the current code):

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1fac444a4831..59f9e3583482 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9023,10 +9023,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 				stop_one_cpu_nowait(cpu_of(busiest),
 					active_load_balance_cpu_stop, busiest,
 					&busiest->active_balance_work);
-			}
 
-			/* We've kicked active balancing, force task migration. */
-			sd->nr_balance_failed = sd->cache_nice_tries+1;
+				/* We've kicked active balancing, force task migration. */
+				sd->nr_balance_failed = sd->cache_nice_tries+1;
+			}
 		}
 	} else
 		sd->nr_balance_failed = 0;
---

Or even better, fold it in active_load_balance_cpu_stop(). I could add that
after the move.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/4] sched/fair: Active balancer RT/DL preemption fix
  2019-10-01 10:29 ` [PATCH v2 0/4] sched/fair: Active balancer RT/DL preemption fix Valentin Schneider
@ 2019-10-01 13:31   ` Juri Lelli
  2019-10-01 14:15     ` Valentin Schneider
  0 siblings, 1 reply; 14+ messages in thread
From: Juri Lelli @ 2019-10-01 13:31 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, mingo, peterz, vincent.guittot, tglx, qais.yousef,
	Steven Rostedt

Hi Valentin,

On 01/10/19 11:29, Valentin Schneider wrote:
> (expanded the Cc list)
> RT/DL folks, any thought on the thing?

Even if I like your idea and it looks theoretically the right thing to
do, I'm not sure we want it in practice if it adds complexity to CFS.

I personally never noticed this kind of interference from CFS, but, at
the same time, for RT we usually like more to be safe than sorry.
However, since this doesn't seem to be bullet-proof (as you also say), I
guess it all boils down again to complexity vs. practical benefits.

Best,

Juri


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/4] sched/fair: Active balancer RT/DL preemption fix
  2019-10-01 13:31   ` Juri Lelli
@ 2019-10-01 14:15     ` Valentin Schneider
  0 siblings, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2019-10-01 14:15 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, mingo, peterz, vincent.guittot, tglx, qais.yousef,
	Steven Rostedt

Hi Juri,

On 01/10/2019 14:31, Juri Lelli wrote:
> Hi Valentin,
> 
> On 01/10/19 11:29, Valentin Schneider wrote:
>> (expanded the Cc list)
>> RT/DL folks, any thought on the thing?
> 
> Even if I like your idea and it looks theoretically the right thing to
> do, I'm not sure we want it in practice if it adds complexity to CFS.
> 
> I personally never noticed this kind of interference from CFS, but, at
> the same time, for RT we usually like more to be safe than sorry.
> However, since this doesn't seem to be bullet-proof (as you also say), I
> guess it all boils down again to complexity vs. practical benefits.
> 

Thanks for having a look.

IMO worst part is the local detach_one_task() thing, I added that after v1
following Qais' comments but perhaps it doesn't gain us much.

I'll try to cook something up with rt-app and see if I can get sensible
numbers.

> Best,
> 
> Juri
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-10-01 14:15 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-15 14:51 [PATCH v2 0/4] sched/fair: Active balancer RT/DL preemption fix Valentin Schneider
2019-08-15 14:51 ` [PATCH v2 1/4] sched/fair: Make need_active_balance() return bools Valentin Schneider
2019-08-15 14:51 ` [PATCH v2 2/4] sched/fair: Move active balance logic to its own function Valentin Schneider
2019-10-01 11:36   ` Srikar Dronamraju
2019-10-01 11:48     ` Valentin Schneider
2019-08-15 14:51 ` [PATCH v2 3/4] sched/fair: Check for CFS tasks before detach_one_task() Valentin Schneider
2019-08-15 14:51 ` [PATCH v2 4/4] sched/fair: Prevent active LB from preempting higher sched classes Valentin Schneider
2019-08-27 12:28   ` Vincent Guittot
2019-08-28  9:46     ` Valentin Schneider
2019-08-29 14:19       ` Vincent Guittot
2019-08-30 15:44         ` Valentin Schneider
2019-10-01 10:29 ` [PATCH v2 0/4] sched/fair: Active balancer RT/DL preemption fix Valentin Schneider
2019-10-01 13:31   ` Juri Lelli
2019-10-01 14:15     ` Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).