linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup
@ 2020-12-10  1:43 Aubrey Li
  2020-12-11 15:07 ` Vincent Guittot
  0 siblings, 1 reply; 10+ messages in thread
From: Aubrey Li @ 2020-12-10  1:43 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, mgorman,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall
  Cc: tim.c.chen, linux-kernel, Aubrey Li, Mel Gorman, Jiang Biao

Add idle cpumask to track idle cpus in sched domain. Every time
a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
target. And if the CPU is not in idle, the CPU is cleared in idle
cpumask during scheduler tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has lower cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

Benchmarks including hackbench, schbench, uperf, sysbench mysql and
kbuild have been tested on a x86 4 socket system with 24 cores per
socket and 2 hyperthreads per core, total 192 CPUs, no regression
found.

v7->v8:
- refine update_idle_cpumask, no functionality change
- fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y

v6->v7:
- place the whole idle cpumask mechanism under CONFIG_SMP

v5->v6:
- decouple idle cpumask update from stop_tick signal, set idle CPU
  in idle cpumask every time the CPU enters idle

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
  idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
  if cpu driver is available
- move clearing idle cpumask to scheduler_tick to decouple nohz mode

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Qais Yousef <qais.yousef@arm.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jiang Biao <benbjiang@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
---
 include/linux/sched/topology.h | 13 ++++++++++
 kernel/sched/core.c            |  2 ++
 kernel/sched/fair.c            | 45 +++++++++++++++++++++++++++++++++-
 kernel/sched/idle.c            |  5 ++++
 kernel/sched/sched.h           |  4 +++
 kernel/sched/topology.c        |  3 ++-
 6 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 820511289857..b47b85163607 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
+	/*
+	 * Span of all idle CPUs in this domain.
+	 *
+	 * NOTE: this field is variable length. (Allocated dynamically
+	 * by attaching extra space to the end of the structure,
+	 * depending on how many CPUs the kernel has booted up with)
+	 */
+	unsigned long	idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+	return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
 	/* These fields must be setup */
 	struct sched_domain __rcu *parent;	/* top domain must be null terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c4da7e17b906..b136e2440ea4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4011,6 +4011,7 @@ void scheduler_tick(void)
 
 #ifdef CONFIG_SMP
 	rq->idle_balance = idle_cpu(cpu);
+	update_idle_cpumask(cpu, rq->idle_balance);
 	trigger_load_balance(rq);
 #endif
 }
@@ -7186,6 +7187,7 @@ void __init sched_init(void)
 		rq->idle_stamp = 0;
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+		rq->last_idle_state = 1;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0c4d9ad7da8..25f36ecfee54 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 
 	time = cpu_clock(this);
 
-	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+	/*
+	 * sched_domain_shared is set only at shared cache level,
+	 * this works only because select_idle_cpu is called with
+	 * sd_llc.
+	 */
+	cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
 	for_each_cpu_wrap(cpu, cpus, target) {
 		if (!--nr)
@@ -6806,6 +6811,44 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	return newidle_balance(rq, rf) != 0;
 }
+
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ *
+ * This function is called with interrupts disabled.
+ */
+void update_idle_cpumask(int cpu, bool idle)
+{
+	struct sched_domain *sd;
+	struct rq *rq = cpu_rq(cpu);
+	int idle_state;
+
+	/*
+	 * Also set SCHED_IDLE cpu in idle cpumask to
+	 * allow SCHED_IDLE cpu as a wakeup target.
+	 */
+	idle_state = idle || sched_idle_cpu(cpu);
+	/*
+	 * No need to update idle cpumask if the state
+	 * does not change.
+	 */
+	if (rq->last_idle_state == idle_state)
+		return;
+	/*
+	 * Called with irq disabled, rcu protection is not needed.
+	 */
+	sd = per_cpu(sd_llc, cpu);
+	if (unlikely(!sd))
+		return;
+
+	if (idle_state)
+		cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
+	else
+		cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
+
+	rq->last_idle_state = idle_state;
+}
 #endif /* CONFIG_SMP */
 
 static unsigned long wakeup_gran(struct sched_entity *se)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f324dc36fc43..2c517d6a061a 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -257,6 +257,11 @@ static void do_idle(void)
 			cpuhp_report_idle_dead();
 			arch_cpu_idle_dead();
 		}
+		/*
+		 * The CPU is about to go idle, set it in idle cpumask
+		 * to be a wake up target.
+		 */
+		update_idle_cpumask(cpu, true);
 
 		arch_cpu_idle_enter();
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8d1ca65db3b0..4041d5a10de5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -976,6 +976,7 @@ struct rq {
 
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
+	unsigned char		last_idle_state;
 
 	unsigned long		misfit_task_load;
 
@@ -1516,6 +1517,8 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
 
 extern int group_balance_cpu(struct sched_group *sg);
 
+void update_idle_cpumask(int cpu, bool idle);
+
 #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
 void register_sched_domain_sysctl(void);
 void dirty_sched_domain_sysctl(int cpu);
@@ -1536,6 +1539,7 @@ extern void flush_smp_call_function_from_idle(void);
 
 #else /* !CONFIG_SMP: */
 static inline void flush_smp_call_function_from_idle(void) { }
+static inline void update_idle_cpumask(int cpu, bool idle) { }
 #endif
 
 #include "stats.h"
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1bd7e3af904f..541bd3a7de4d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+		cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
 	}
 
 	sd->private = sdd;
@@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
 			*per_cpu_ptr(sdd->sd, j) = sd;
 
-			sds = kzalloc_node(sizeof(struct sched_domain_shared),
+			sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
 			if (!sds)
 				return -ENOMEM;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-10  1:43 [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup Aubrey Li
@ 2020-12-11 15:07 ` Vincent Guittot
  2020-12-11 15:18   ` Li, Aubrey
  2021-03-04 13:51   ` Li, Aubrey
  0 siblings, 2 replies; 10+ messages in thread
From: Vincent Guittot @ 2020-12-11 15:07 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Mel Gorman,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao

On Thu, 10 Dec 2020 at 02:44, Aubrey Li <aubrey.li@linux.intel.com> wrote:
>
> Add idle cpumask to track idle cpus in sched domain. Every time
> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
> target. And if the CPU is not in idle, the CPU is cleared in idle
> cpumask during scheduler tick to ratelimit idle cpumask update.
>
> When a task wakes up to select an idle cpu, scanning idle cpumask
> has lower cost than scanning all the cpus in last level cache domain,
> especially when the system is heavily loaded.
>
> Benchmarks including hackbench, schbench, uperf, sysbench mysql and
> kbuild have been tested on a x86 4 socket system with 24 cores per
> socket and 2 hyperthreads per core, total 192 CPUs, no regression
> found.
>
> v7->v8:
> - refine update_idle_cpumask, no functionality change
> - fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y
>
> v6->v7:
> - place the whole idle cpumask mechanism under CONFIG_SMP
>
> v5->v6:
> - decouple idle cpumask update from stop_tick signal, set idle CPU
>   in idle cpumask every time the CPU enters idle
>
> v4->v5:
> - add update_idle_cpumask for s2idle case
> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>   idle_cpumask() everywhere
>
> v3->v4:
> - change setting idle cpumask from every idle entry to tickless idle
>   if cpu driver is available
> - move clearing idle cpumask to scheduler_tick to decouple nohz mode
>
> v2->v3:
> - change setting idle cpumask to every idle entry, otherwise schbench
>   has a regression of 99th percentile latency
> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>   idle cpumask is ratelimited in the idle exiting path
> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target
>
> v1->v2:
> - idle cpumask is updated in the nohz routines, by initializing idle
>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>   behavior
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Qais Yousef <qais.yousef@arm.com>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jiang Biao <benbjiang@gmail.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>

This version looks good to me. I don't see regressions of v5 anymore
and see some improvements on heavy cases

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  include/linux/sched/topology.h | 13 ++++++++++
>  kernel/sched/core.c            |  2 ++
>  kernel/sched/fair.c            | 45 +++++++++++++++++++++++++++++++++-
>  kernel/sched/idle.c            |  5 ++++
>  kernel/sched/sched.h           |  4 +++
>  kernel/sched/topology.c        |  3 ++-
>  6 files changed, 70 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 820511289857..b47b85163607 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>         atomic_t        ref;
>         atomic_t        nr_busy_cpus;
>         int             has_idle_cores;
> +       /*
> +        * Span of all idle CPUs in this domain.
> +        *
> +        * NOTE: this field is variable length. (Allocated dynamically
> +        * by attaching extra space to the end of the structure,
> +        * depending on how many CPUs the kernel has booted up with)
> +        */
> +       unsigned long   idle_cpus_span[];
>  };
>
> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> +{
> +       return to_cpumask(sds->idle_cpus_span);
> +}
> +
>  struct sched_domain {
>         /* These fields must be setup */
>         struct sched_domain __rcu *parent;      /* top domain must be null terminated */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c4da7e17b906..b136e2440ea4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>
>  #ifdef CONFIG_SMP
>         rq->idle_balance = idle_cpu(cpu);
> +       update_idle_cpumask(cpu, rq->idle_balance);
>         trigger_load_balance(rq);
>  #endif
>  }
> @@ -7186,6 +7187,7 @@ void __init sched_init(void)
>                 rq->idle_stamp = 0;
>                 rq->avg_idle = 2*sysctl_sched_migration_cost;
>                 rq->max_idle_balance_cost = sysctl_sched_migration_cost;
> +               rq->last_idle_state = 1;
>
>                 INIT_LIST_HEAD(&rq->cfs_tasks);
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c0c4d9ad7da8..25f36ecfee54 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>
>         time = cpu_clock(this);
>
> -       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +       /*
> +        * sched_domain_shared is set only at shared cache level,
> +        * this works only because select_idle_cpu is called with
> +        * sd_llc.
> +        */
> +       cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>
>         for_each_cpu_wrap(cpu, cpus, target) {
>                 if (!--nr)
> @@ -6806,6 +6811,44 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>
>         return newidle_balance(rq, rf) != 0;
>  }
> +
> +/*
> + * Update cpu idle state and record this information
> + * in sd_llc_shared->idle_cpus_span.
> + *
> + * This function is called with interrupts disabled.
> + */
> +void update_idle_cpumask(int cpu, bool idle)
> +{
> +       struct sched_domain *sd;
> +       struct rq *rq = cpu_rq(cpu);
> +       int idle_state;
> +
> +       /*
> +        * Also set SCHED_IDLE cpu in idle cpumask to
> +        * allow SCHED_IDLE cpu as a wakeup target.
> +        */
> +       idle_state = idle || sched_idle_cpu(cpu);
> +       /*
> +        * No need to update idle cpumask if the state
> +        * does not change.
> +        */
> +       if (rq->last_idle_state == idle_state)
> +               return;
> +       /*
> +        * Called with irq disabled, rcu protection is not needed.
> +        */
> +       sd = per_cpu(sd_llc, cpu);
> +       if (unlikely(!sd))
> +               return;
> +
> +       if (idle_state)
> +               cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> +       else
> +               cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> +
> +       rq->last_idle_state = idle_state;
> +}
>  #endif /* CONFIG_SMP */
>
>  static unsigned long wakeup_gran(struct sched_entity *se)
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index f324dc36fc43..2c517d6a061a 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -257,6 +257,11 @@ static void do_idle(void)
>                         cpuhp_report_idle_dead();
>                         arch_cpu_idle_dead();
>                 }
> +               /*
> +                * The CPU is about to go idle, set it in idle cpumask
> +                * to be a wake up target.
> +                */
> +               update_idle_cpumask(cpu, true);
>
>                 arch_cpu_idle_enter();
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 8d1ca65db3b0..4041d5a10de5 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -976,6 +976,7 @@ struct rq {
>
>         unsigned char           nohz_idle_balance;
>         unsigned char           idle_balance;
> +       unsigned char           last_idle_state;
>
>         unsigned long           misfit_task_load;
>
> @@ -1516,6 +1517,8 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
>
>  extern int group_balance_cpu(struct sched_group *sg);
>
> +void update_idle_cpumask(int cpu, bool idle);
> +
>  #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
>  void register_sched_domain_sysctl(void);
>  void dirty_sched_domain_sysctl(int cpu);
> @@ -1536,6 +1539,7 @@ extern void flush_smp_call_function_from_idle(void);
>
>  #else /* !CONFIG_SMP: */
>  static inline void flush_smp_call_function_from_idle(void) { }
> +static inline void update_idle_cpumask(int cpu, bool idle) { }
>  #endif
>
>  #include "stats.h"
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 1bd7e3af904f..541bd3a7de4d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
>                 sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
>                 atomic_inc(&sd->shared->ref);
>                 atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> +               cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
>         }
>
>         sd->private = sdd;
> @@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>
>                         *per_cpu_ptr(sdd->sd, j) = sd;
>
> -                       sds = kzalloc_node(sizeof(struct sched_domain_shared),
> +                       sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
>                                         GFP_KERNEL, cpu_to_node(j));
>                         if (!sds)
>                                 return -ENOMEM;
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-11 15:07 ` Vincent Guittot
@ 2020-12-11 15:18   ` Li, Aubrey
  2020-12-11 15:22     ` Vincent Guittot
  2021-03-04 13:51   ` Li, Aubrey
  1 sibling, 1 reply; 10+ messages in thread
From: Li, Aubrey @ 2020-12-11 15:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Mel Gorman,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao

On 2020/12/11 23:07, Vincent Guittot wrote:
> On Thu, 10 Dec 2020 at 02:44, Aubrey Li <aubrey.li@linux.intel.com> wrote:
>>
>> Add idle cpumask to track idle cpus in sched domain. Every time
>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>> target. And if the CPU is not in idle, the CPU is cleared in idle
>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has lower cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks including hackbench, schbench, uperf, sysbench mysql and
>> kbuild have been tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>> found.
>>
>> v7->v8:
>> - refine update_idle_cpumask, no functionality change
>> - fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y
>>
>> v6->v7:
>> - place the whole idle cpumask mechanism under CONFIG_SMP
>>
>> v5->v6:
>> - decouple idle cpumask update from stop_tick signal, set idle CPU
>>   in idle cpumask every time the CPU enters idle
>>
>> v4->v5:
>> - add update_idle_cpumask for s2idle case
>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>   idle_cpumask() everywhere
>>
>> v3->v4:
>> - change setting idle cpumask from every idle entry to tickless idle
>>   if cpu driver is available
>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode
>>
>> v2->v3:
>> - change setting idle cpumask to every idle entry, otherwise schbench
>>   has a regression of 99th percentile latency
>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>>   idle cpumask is ratelimited in the idle exiting path
>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target
>>
>> v1->v2:
>> - idle cpumask is updated in the nohz routines, by initializing idle
>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>>   behavior
>>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Cc: Qais Yousef <qais.yousef@arm.com>
>> Cc: Valentin Schneider <valentin.schneider@arm.com>
>> Cc: Jiang Biao <benbjiang@gmail.com>
>> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> 
> This version looks good to me. I don't see regressions of v5 anymore
> and see some improvements on heavy cases

v5 or v8?

> 
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> 
>> ---
>>  include/linux/sched/topology.h | 13 ++++++++++
>>  kernel/sched/core.c            |  2 ++
>>  kernel/sched/fair.c            | 45 +++++++++++++++++++++++++++++++++-
>>  kernel/sched/idle.c            |  5 ++++
>>  kernel/sched/sched.h           |  4 +++
>>  kernel/sched/topology.c        |  3 ++-
>>  6 files changed, 70 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 820511289857..b47b85163607 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>         atomic_t        ref;
>>         atomic_t        nr_busy_cpus;
>>         int             has_idle_cores;
>> +       /*
>> +        * Span of all idle CPUs in this domain.
>> +        *
>> +        * NOTE: this field is variable length. (Allocated dynamically
>> +        * by attaching extra space to the end of the structure,
>> +        * depending on how many CPUs the kernel has booted up with)
>> +        */
>> +       unsigned long   idle_cpus_span[];
>>  };
>>
>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>> +{
>> +       return to_cpumask(sds->idle_cpus_span);
>> +}
>> +
>>  struct sched_domain {
>>         /* These fields must be setup */
>>         struct sched_domain __rcu *parent;      /* top domain must be null terminated */
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c4da7e17b906..b136e2440ea4 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>>
>>  #ifdef CONFIG_SMP
>>         rq->idle_balance = idle_cpu(cpu);
>> +       update_idle_cpumask(cpu, rq->idle_balance);
>>         trigger_load_balance(rq);
>>  #endif
>>  }
>> @@ -7186,6 +7187,7 @@ void __init sched_init(void)
>>                 rq->idle_stamp = 0;
>>                 rq->avg_idle = 2*sysctl_sched_migration_cost;
>>                 rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>> +               rq->last_idle_state = 1;
>>
>>                 INIT_LIST_HEAD(&rq->cfs_tasks);
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c0c4d9ad7da8..25f36ecfee54 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>
>>         time = cpu_clock(this);
>>
>> -       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +       /*
>> +        * sched_domain_shared is set only at shared cache level,
>> +        * this works only because select_idle_cpu is called with
>> +        * sd_llc.
>> +        */
>> +       cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>>
>>         for_each_cpu_wrap(cpu, cpus, target) {
>>                 if (!--nr)
>> @@ -6806,6 +6811,44 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>>
>>         return newidle_balance(rq, rf) != 0;
>>  }
>> +
>> +/*
>> + * Update cpu idle state and record this information
>> + * in sd_llc_shared->idle_cpus_span.
>> + *
>> + * This function is called with interrupts disabled.
>> + */
>> +void update_idle_cpumask(int cpu, bool idle)
>> +{
>> +       struct sched_domain *sd;
>> +       struct rq *rq = cpu_rq(cpu);
>> +       int idle_state;
>> +
>> +       /*
>> +        * Also set SCHED_IDLE cpu in idle cpumask to
>> +        * allow SCHED_IDLE cpu as a wakeup target.
>> +        */
>> +       idle_state = idle || sched_idle_cpu(cpu);
>> +       /*
>> +        * No need to update idle cpumask if the state
>> +        * does not change.
>> +        */
>> +       if (rq->last_idle_state == idle_state)
>> +               return;
>> +       /*
>> +        * Called with irq disabled, rcu protection is not needed.
>> +        */
>> +       sd = per_cpu(sd_llc, cpu);
>> +       if (unlikely(!sd))
>> +               return;
>> +
>> +       if (idle_state)
>> +               cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
>> +       else
>> +               cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
>> +
>> +       rq->last_idle_state = idle_state;
>> +}
>>  #endif /* CONFIG_SMP */
>>
>>  static unsigned long wakeup_gran(struct sched_entity *se)
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index f324dc36fc43..2c517d6a061a 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -257,6 +257,11 @@ static void do_idle(void)
>>                         cpuhp_report_idle_dead();
>>                         arch_cpu_idle_dead();
>>                 }
>> +               /*
>> +                * The CPU is about to go idle, set it in idle cpumask
>> +                * to be a wake up target.
>> +                */
>> +               update_idle_cpumask(cpu, true);
>>
>>                 arch_cpu_idle_enter();
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 8d1ca65db3b0..4041d5a10de5 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -976,6 +976,7 @@ struct rq {
>>
>>         unsigned char           nohz_idle_balance;
>>         unsigned char           idle_balance;
>> +       unsigned char           last_idle_state;
>>
>>         unsigned long           misfit_task_load;
>>
>> @@ -1516,6 +1517,8 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
>>
>>  extern int group_balance_cpu(struct sched_group *sg);
>>
>> +void update_idle_cpumask(int cpu, bool idle);
>> +
>>  #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
>>  void register_sched_domain_sysctl(void);
>>  void dirty_sched_domain_sysctl(int cpu);
>> @@ -1536,6 +1539,7 @@ extern void flush_smp_call_function_from_idle(void);
>>
>>  #else /* !CONFIG_SMP: */
>>  static inline void flush_smp_call_function_from_idle(void) { }
>> +static inline void update_idle_cpumask(int cpu, bool idle) { }
>>  #endif
>>
>>  #include "stats.h"
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 1bd7e3af904f..541bd3a7de4d 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
>>                 sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
>>                 atomic_inc(&sd->shared->ref);
>>                 atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
>> +               cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
>>         }
>>
>>         sd->private = sdd;
>> @@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>>
>>                         *per_cpu_ptr(sdd->sd, j) = sd;
>>
>> -                       sds = kzalloc_node(sizeof(struct sched_domain_shared),
>> +                       sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
>>                                         GFP_KERNEL, cpu_to_node(j));
>>                         if (!sds)
>>                                 return -ENOMEM;
>> --
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-11 15:18   ` Li, Aubrey
@ 2020-12-11 15:22     ` Vincent Guittot
  2020-12-11 15:24       ` Li, Aubrey
  0 siblings, 1 reply; 10+ messages in thread
From: Vincent Guittot @ 2020-12-11 15:22 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Mel Gorman,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao

On Fri, 11 Dec 2020 at 16:19, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>
> On 2020/12/11 23:07, Vincent Guittot wrote:
> > On Thu, 10 Dec 2020 at 02:44, Aubrey Li <aubrey.li@linux.intel.com> wrote:
> >>
> >> Add idle cpumask to track idle cpus in sched domain. Every time
> >> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
> >> target. And if the CPU is not in idle, the CPU is cleared in idle
> >> cpumask during scheduler tick to ratelimit idle cpumask update.
> >>
> >> When a task wakes up to select an idle cpu, scanning idle cpumask
> >> has lower cost than scanning all the cpus in last level cache domain,
> >> especially when the system is heavily loaded.
> >>
> >> Benchmarks including hackbench, schbench, uperf, sysbench mysql and
> >> kbuild have been tested on a x86 4 socket system with 24 cores per
> >> socket and 2 hyperthreads per core, total 192 CPUs, no regression
> >> found.
> >>
> >> v7->v8:
> >> - refine update_idle_cpumask, no functionality change
> >> - fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y
> >>
> >> v6->v7:
> >> - place the whole idle cpumask mechanism under CONFIG_SMP
> >>
> >> v5->v6:
> >> - decouple idle cpumask update from stop_tick signal, set idle CPU
> >>   in idle cpumask every time the CPU enters idle
> >>
> >> v4->v5:
> >> - add update_idle_cpumask for s2idle case
> >> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
> >>   idle_cpumask() everywhere
> >>
> >> v3->v4:
> >> - change setting idle cpumask from every idle entry to tickless idle
> >>   if cpu driver is available
> >> - move clearing idle cpumask to scheduler_tick to decouple nohz mode
> >>
> >> v2->v3:
> >> - change setting idle cpumask to every idle entry, otherwise schbench
> >>   has a regression of 99th percentile latency
> >> - change clearing idle cpumask to nohz_balancer_kick(), so updating
> >>   idle cpumask is ratelimited in the idle exiting path
> >> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target
> >>
> >> v1->v2:
> >> - idle cpumask is updated in the nohz routines, by initializing idle
> >>   cpumask with sched_domain_span(sd), nohz=off case remains the original
> >>   behavior
> >>
> >> Cc: Peter Zijlstra <peterz@infradead.org>
> >> Cc: Mel Gorman <mgorman@suse.de>
> >> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> >> Cc: Qais Yousef <qais.yousef@arm.com>
> >> Cc: Valentin Schneider <valentin.schneider@arm.com>
> >> Cc: Jiang Biao <benbjiang@gmail.com>
> >> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> >> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> >
> > This version looks good to me. I don't see regressions of v5 anymore
> > and see some improvements on heavy cases
>
> v5 or v8?

the v8 looks good to me and I don't see the regressions that I have
seen with the v5 anymore


>
> >
> > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> >
> >> ---
> >>  include/linux/sched/topology.h | 13 ++++++++++
> >>  kernel/sched/core.c            |  2 ++
> >>  kernel/sched/fair.c            | 45 +++++++++++++++++++++++++++++++++-
> >>  kernel/sched/idle.c            |  5 ++++
> >>  kernel/sched/sched.h           |  4 +++
> >>  kernel/sched/topology.c        |  3 ++-
> >>  6 files changed, 70 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> >> index 820511289857..b47b85163607 100644
> >> --- a/include/linux/sched/topology.h
> >> +++ b/include/linux/sched/topology.h
> >> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> >>         atomic_t        ref;
> >>         atomic_t        nr_busy_cpus;
> >>         int             has_idle_cores;
> >> +       /*
> >> +        * Span of all idle CPUs in this domain.
> >> +        *
> >> +        * NOTE: this field is variable length. (Allocated dynamically
> >> +        * by attaching extra space to the end of the structure,
> >> +        * depending on how many CPUs the kernel has booted up with)
> >> +        */
> >> +       unsigned long   idle_cpus_span[];
> >>  };
> >>
> >> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> >> +{
> >> +       return to_cpumask(sds->idle_cpus_span);
> >> +}
> >> +
> >>  struct sched_domain {
> >>         /* These fields must be setup */
> >>         struct sched_domain __rcu *parent;      /* top domain must be null terminated */
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index c4da7e17b906..b136e2440ea4 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
> >>
> >>  #ifdef CONFIG_SMP
> >>         rq->idle_balance = idle_cpu(cpu);
> >> +       update_idle_cpumask(cpu, rq->idle_balance);
> >>         trigger_load_balance(rq);
> >>  #endif
> >>  }
> >> @@ -7186,6 +7187,7 @@ void __init sched_init(void)
> >>                 rq->idle_stamp = 0;
> >>                 rq->avg_idle = 2*sysctl_sched_migration_cost;
> >>                 rq->max_idle_balance_cost = sysctl_sched_migration_cost;
> >> +               rq->last_idle_state = 1;
> >>
> >>                 INIT_LIST_HEAD(&rq->cfs_tasks);
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index c0c4d9ad7da8..25f36ecfee54 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >>
> >>         time = cpu_clock(this);
> >>
> >> -       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >> +       /*
> >> +        * sched_domain_shared is set only at shared cache level,
> >> +        * this works only because select_idle_cpu is called with
> >> +        * sd_llc.
> >> +        */
> >> +       cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> >>
> >>         for_each_cpu_wrap(cpu, cpus, target) {
> >>                 if (!--nr)
> >> @@ -6806,6 +6811,44 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >>
> >>         return newidle_balance(rq, rf) != 0;
> >>  }
> >> +
> >> +/*
> >> + * Update cpu idle state and record this information
> >> + * in sd_llc_shared->idle_cpus_span.
> >> + *
> >> + * This function is called with interrupts disabled.
> >> + */
> >> +void update_idle_cpumask(int cpu, bool idle)
> >> +{
> >> +       struct sched_domain *sd;
> >> +       struct rq *rq = cpu_rq(cpu);
> >> +       int idle_state;
> >> +
> >> +       /*
> >> +        * Also set SCHED_IDLE cpu in idle cpumask to
> >> +        * allow SCHED_IDLE cpu as a wakeup target.
> >> +        */
> >> +       idle_state = idle || sched_idle_cpu(cpu);
> >> +       /*
> >> +        * No need to update idle cpumask if the state
> >> +        * does not change.
> >> +        */
> >> +       if (rq->last_idle_state == idle_state)
> >> +               return;
> >> +       /*
> >> +        * Called with irq disabled, rcu protection is not needed.
> >> +        */
> >> +       sd = per_cpu(sd_llc, cpu);
> >> +       if (unlikely(!sd))
> >> +               return;
> >> +
> >> +       if (idle_state)
> >> +               cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> >> +       else
> >> +               cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> >> +
> >> +       rq->last_idle_state = idle_state;
> >> +}
> >>  #endif /* CONFIG_SMP */
> >>
> >>  static unsigned long wakeup_gran(struct sched_entity *se)
> >> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >> index f324dc36fc43..2c517d6a061a 100644
> >> --- a/kernel/sched/idle.c
> >> +++ b/kernel/sched/idle.c
> >> @@ -257,6 +257,11 @@ static void do_idle(void)
> >>                         cpuhp_report_idle_dead();
> >>                         arch_cpu_idle_dead();
> >>                 }
> >> +               /*
> >> +                * The CPU is about to go idle, set it in idle cpumask
> >> +                * to be a wake up target.
> >> +                */
> >> +               update_idle_cpumask(cpu, true);
> >>
> >>                 arch_cpu_idle_enter();
> >>
> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> index 8d1ca65db3b0..4041d5a10de5 100644
> >> --- a/kernel/sched/sched.h
> >> +++ b/kernel/sched/sched.h
> >> @@ -976,6 +976,7 @@ struct rq {
> >>
> >>         unsigned char           nohz_idle_balance;
> >>         unsigned char           idle_balance;
> >> +       unsigned char           last_idle_state;
> >>
> >>         unsigned long           misfit_task_load;
> >>
> >> @@ -1516,6 +1517,8 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
> >>
> >>  extern int group_balance_cpu(struct sched_group *sg);
> >>
> >> +void update_idle_cpumask(int cpu, bool idle);
> >> +
> >>  #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
> >>  void register_sched_domain_sysctl(void);
> >>  void dirty_sched_domain_sysctl(int cpu);
> >> @@ -1536,6 +1539,7 @@ extern void flush_smp_call_function_from_idle(void);
> >>
> >>  #else /* !CONFIG_SMP: */
> >>  static inline void flush_smp_call_function_from_idle(void) { }
> >> +static inline void update_idle_cpumask(int cpu, bool idle) { }
> >>  #endif
> >>
> >>  #include "stats.h"
> >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> >> index 1bd7e3af904f..541bd3a7de4d 100644
> >> --- a/kernel/sched/topology.c
> >> +++ b/kernel/sched/topology.c
> >> @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
> >>                 sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> >>                 atomic_inc(&sd->shared->ref);
> >>                 atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> >> +               cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
> >>         }
> >>
> >>         sd->private = sdd;
> >> @@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
> >>
> >>                         *per_cpu_ptr(sdd->sd, j) = sd;
> >>
> >> -                       sds = kzalloc_node(sizeof(struct sched_domain_shared),
> >> +                       sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
> >>                                         GFP_KERNEL, cpu_to_node(j));
> >>                         if (!sds)
> >>                                 return -ENOMEM;
> >> --
> >> 2.25.1
> >>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-11 15:22     ` Vincent Guittot
@ 2020-12-11 15:24       ` Li, Aubrey
  2020-12-13 23:29         ` Song Bao Hua (Barry Song)
  0 siblings, 1 reply; 10+ messages in thread
From: Li, Aubrey @ 2020-12-11 15:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Mel Gorman,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao

On 2020/12/11 23:22, Vincent Guittot wrote:
> On Fri, 11 Dec 2020 at 16:19, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>>
>> On 2020/12/11 23:07, Vincent Guittot wrote:
>>> On Thu, 10 Dec 2020 at 02:44, Aubrey Li <aubrey.li@linux.intel.com> wrote:
>>>>
>>>> Add idle cpumask to track idle cpus in sched domain. Every time
>>>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>>>> target. And if the CPU is not in idle, the CPU is cleared in idle
>>>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>>>
>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>> has lower cost than scanning all the cpus in last level cache domain,
>>>> especially when the system is heavily loaded.
>>>>
>>>> Benchmarks including hackbench, schbench, uperf, sysbench mysql and
>>>> kbuild have been tested on a x86 4 socket system with 24 cores per
>>>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>>>> found.
>>>>
>>>> v7->v8:
>>>> - refine update_idle_cpumask, no functionality change
>>>> - fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y
>>>>
>>>> v6->v7:
>>>> - place the whole idle cpumask mechanism under CONFIG_SMP
>>>>
>>>> v5->v6:
>>>> - decouple idle cpumask update from stop_tick signal, set idle CPU
>>>>   in idle cpumask every time the CPU enters idle
>>>>
>>>> v4->v5:
>>>> - add update_idle_cpumask for s2idle case
>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>>>   idle_cpumask() everywhere
>>>>
>>>> v3->v4:
>>>> - change setting idle cpumask from every idle entry to tickless idle
>>>>   if cpu driver is available
>>>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode
>>>>
>>>> v2->v3:
>>>> - change setting idle cpumask to every idle entry, otherwise schbench
>>>>   has a regression of 99th percentile latency
>>>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>>>>   idle cpumask is ratelimited in the idle exiting path
>>>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target
>>>>
>>>> v1->v2:
>>>> - idle cpumask is updated in the nohz routines, by initializing idle
>>>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>>>>   behavior
>>>>
>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>> Cc: Mel Gorman <mgorman@suse.de>
>>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>>>> Cc: Qais Yousef <qais.yousef@arm.com>
>>>> Cc: Valentin Schneider <valentin.schneider@arm.com>
>>>> Cc: Jiang Biao <benbjiang@gmail.com>
>>>> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>>>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>>>
>>> This version looks good to me. I don't see regressions of v5 anymore
>>> and see some improvements on heavy cases
>>
>> v5 or v8?
> 
> the v8 looks good to me and I don't see the regressions that I have
> seen with the v5 anymore
> 
Sounds great, thanks, :)

> 
>>
>>>
>>> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
>>>
>>>> ---
>>>>  include/linux/sched/topology.h | 13 ++++++++++
>>>>  kernel/sched/core.c            |  2 ++
>>>>  kernel/sched/fair.c            | 45 +++++++++++++++++++++++++++++++++-
>>>>  kernel/sched/idle.c            |  5 ++++
>>>>  kernel/sched/sched.h           |  4 +++
>>>>  kernel/sched/topology.c        |  3 ++-
>>>>  6 files changed, 70 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>>>> index 820511289857..b47b85163607 100644
>>>> --- a/include/linux/sched/topology.h
>>>> +++ b/include/linux/sched/topology.h
>>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>>>         atomic_t        ref;
>>>>         atomic_t        nr_busy_cpus;
>>>>         int             has_idle_cores;
>>>> +       /*
>>>> +        * Span of all idle CPUs in this domain.
>>>> +        *
>>>> +        * NOTE: this field is variable length. (Allocated dynamically
>>>> +        * by attaching extra space to the end of the structure,
>>>> +        * depending on how many CPUs the kernel has booted up with)
>>>> +        */
>>>> +       unsigned long   idle_cpus_span[];
>>>>  };
>>>>
>>>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>>>> +{
>>>> +       return to_cpumask(sds->idle_cpus_span);
>>>> +}
>>>> +
>>>>  struct sched_domain {
>>>>         /* These fields must be setup */
>>>>         struct sched_domain __rcu *parent;      /* top domain must be null terminated */
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index c4da7e17b906..b136e2440ea4 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>>>>
>>>>  #ifdef CONFIG_SMP
>>>>         rq->idle_balance = idle_cpu(cpu);
>>>> +       update_idle_cpumask(cpu, rq->idle_balance);
>>>>         trigger_load_balance(rq);
>>>>  #endif
>>>>  }
>>>> @@ -7186,6 +7187,7 @@ void __init sched_init(void)
>>>>                 rq->idle_stamp = 0;
>>>>                 rq->avg_idle = 2*sysctl_sched_migration_cost;
>>>>                 rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>>>> +               rq->last_idle_state = 1;
>>>>
>>>>                 INIT_LIST_HEAD(&rq->cfs_tasks);
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index c0c4d9ad7da8..25f36ecfee54 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>>
>>>>         time = cpu_clock(this);
>>>>
>>>> -       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>> +       /*
>>>> +        * sched_domain_shared is set only at shared cache level,
>>>> +        * this works only because select_idle_cpu is called with
>>>> +        * sd_llc.
>>>> +        */
>>>> +       cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>>>>
>>>>         for_each_cpu_wrap(cpu, cpus, target) {
>>>>                 if (!--nr)
>>>> @@ -6806,6 +6811,44 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>>>>
>>>>         return newidle_balance(rq, rf) != 0;
>>>>  }
>>>> +
>>>> +/*
>>>> + * Update cpu idle state and record this information
>>>> + * in sd_llc_shared->idle_cpus_span.
>>>> + *
>>>> + * This function is called with interrupts disabled.
>>>> + */
>>>> +void update_idle_cpumask(int cpu, bool idle)
>>>> +{
>>>> +       struct sched_domain *sd;
>>>> +       struct rq *rq = cpu_rq(cpu);
>>>> +       int idle_state;
>>>> +
>>>> +       /*
>>>> +        * Also set SCHED_IDLE cpu in idle cpumask to
>>>> +        * allow SCHED_IDLE cpu as a wakeup target.
>>>> +        */
>>>> +       idle_state = idle || sched_idle_cpu(cpu);
>>>> +       /*
>>>> +        * No need to update idle cpumask if the state
>>>> +        * does not change.
>>>> +        */
>>>> +       if (rq->last_idle_state == idle_state)
>>>> +               return;
>>>> +       /*
>>>> +        * Called with irq disabled, rcu protection is not needed.
>>>> +        */
>>>> +       sd = per_cpu(sd_llc, cpu);
>>>> +       if (unlikely(!sd))
>>>> +               return;
>>>> +
>>>> +       if (idle_state)
>>>> +               cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
>>>> +       else
>>>> +               cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
>>>> +
>>>> +       rq->last_idle_state = idle_state;
>>>> +}
>>>>  #endif /* CONFIG_SMP */
>>>>
>>>>  static unsigned long wakeup_gran(struct sched_entity *se)
>>>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>>>> index f324dc36fc43..2c517d6a061a 100644
>>>> --- a/kernel/sched/idle.c
>>>> +++ b/kernel/sched/idle.c
>>>> @@ -257,6 +257,11 @@ static void do_idle(void)
>>>>                         cpuhp_report_idle_dead();
>>>>                         arch_cpu_idle_dead();
>>>>                 }
>>>> +               /*
>>>> +                * The CPU is about to go idle, set it in idle cpumask
>>>> +                * to be a wake up target.
>>>> +                */
>>>> +               update_idle_cpumask(cpu, true);
>>>>
>>>>                 arch_cpu_idle_enter();
>>>>
>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>>> index 8d1ca65db3b0..4041d5a10de5 100644
>>>> --- a/kernel/sched/sched.h
>>>> +++ b/kernel/sched/sched.h
>>>> @@ -976,6 +976,7 @@ struct rq {
>>>>
>>>>         unsigned char           nohz_idle_balance;
>>>>         unsigned char           idle_balance;
>>>> +       unsigned char           last_idle_state;
>>>>
>>>>         unsigned long           misfit_task_load;
>>>>
>>>> @@ -1516,6 +1517,8 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
>>>>
>>>>  extern int group_balance_cpu(struct sched_group *sg);
>>>>
>>>> +void update_idle_cpumask(int cpu, bool idle);
>>>> +
>>>>  #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
>>>>  void register_sched_domain_sysctl(void);
>>>>  void dirty_sched_domain_sysctl(int cpu);
>>>> @@ -1536,6 +1539,7 @@ extern void flush_smp_call_function_from_idle(void);
>>>>
>>>>  #else /* !CONFIG_SMP: */
>>>>  static inline void flush_smp_call_function_from_idle(void) { }
>>>> +static inline void update_idle_cpumask(int cpu, bool idle) { }
>>>>  #endif
>>>>
>>>>  #include "stats.h"
>>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>>> index 1bd7e3af904f..541bd3a7de4d 100644
>>>> --- a/kernel/sched/topology.c
>>>> +++ b/kernel/sched/topology.c
>>>> @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
>>>>                 sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
>>>>                 atomic_inc(&sd->shared->ref);
>>>>                 atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
>>>> +               cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
>>>>         }
>>>>
>>>>         sd->private = sdd;
>>>> @@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>>>>
>>>>                         *per_cpu_ptr(sdd->sd, j) = sd;
>>>>
>>>> -                       sds = kzalloc_node(sizeof(struct sched_domain_shared),
>>>> +                       sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
>>>>                                         GFP_KERNEL, cpu_to_node(j));
>>>>                         if (!sds)
>>>>                                 return -ENOMEM;
>>>> --
>>>> 2.25.1
>>>>
>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-11 15:24       ` Li, Aubrey
@ 2020-12-13 23:29         ` Song Bao Hua (Barry Song)
  2020-12-15 12:41           ` Li, Aubrey
  0 siblings, 1 reply; 10+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-12-13 23:29 UTC (permalink / raw)
  To: Li, Aubrey, Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Mel Gorman,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao



> -----Original Message-----
> From: Li, Aubrey [mailto:aubrey.li@linux.intel.com]
> Sent: Saturday, December 12, 2020 4:25 AM
> To: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Ingo Molnar <mingo@redhat.com>; Peter Zijlstra <peterz@infradead.org>;
> Juri Lelli <juri.lelli@redhat.com>; Mel Gorman <mgorman@techsingularity.net>;
> Valentin Schneider <valentin.schneider@arm.com>; Qais Yousef
> <qais.yousef@arm.com>; Dietmar Eggemann <dietmar.eggemann@arm.com>; Steven
> Rostedt <rostedt@goodmis.org>; Ben Segall <bsegall@google.com>; Tim Chen
> <tim.c.chen@linux.intel.com>; linux-kernel <linux-kernel@vger.kernel.org>;
> Mel Gorman <mgorman@suse.de>; Jiang Biao <benbjiang@gmail.com>
> Subject: Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for
> task wakeup
> 
> On 2020/12/11 23:22, Vincent Guittot wrote:
> > On Fri, 11 Dec 2020 at 16:19, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
> >>
> >> On 2020/12/11 23:07, Vincent Guittot wrote:
> >>> On Thu, 10 Dec 2020 at 02:44, Aubrey Li <aubrey.li@linux.intel.com> wrote:
> >>>>
> >>>> Add idle cpumask to track idle cpus in sched domain. Every time
> >>>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
> >>>> target. And if the CPU is not in idle, the CPU is cleared in idle
> >>>> cpumask during scheduler tick to ratelimit idle cpumask update.
> >>>>
> >>>> When a task wakes up to select an idle cpu, scanning idle cpumask
> >>>> has lower cost than scanning all the cpus in last level cache domain,
> >>>> especially when the system is heavily loaded.
> >>>>
> >>>> Benchmarks including hackbench, schbench, uperf, sysbench mysql and
> >>>> kbuild have been tested on a x86 4 socket system with 24 cores per
> >>>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
> >>>> found.
> >>>>
> >>>> v7->v8:
> >>>> - refine update_idle_cpumask, no functionality change
> >>>> - fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y
> >>>>
> >>>> v6->v7:
> >>>> - place the whole idle cpumask mechanism under CONFIG_SMP
> >>>>
> >>>> v5->v6:
> >>>> - decouple idle cpumask update from stop_tick signal, set idle CPU
> >>>>   in idle cpumask every time the CPU enters idle
> >>>>
> >>>> v4->v5:
> >>>> - add update_idle_cpumask for s2idle case
> >>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
> >>>>   idle_cpumask() everywhere
> >>>>
> >>>> v3->v4:
> >>>> - change setting idle cpumask from every idle entry to tickless idle
> >>>>   if cpu driver is available
> >>>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode
> >>>>
> >>>> v2->v3:
> >>>> - change setting idle cpumask to every idle entry, otherwise schbench
> >>>>   has a regression of 99th percentile latency
> >>>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
> >>>>   idle cpumask is ratelimited in the idle exiting path
> >>>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target
> >>>>
> >>>> v1->v2:
> >>>> - idle cpumask is updated in the nohz routines, by initializing idle
> >>>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
> >>>>   behavior
> >>>>
> >>>> Cc: Peter Zijlstra <peterz@infradead.org>
> >>>> Cc: Mel Gorman <mgorman@suse.de>
> >>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> >>>> Cc: Qais Yousef <qais.yousef@arm.com>
> >>>> Cc: Valentin Schneider <valentin.schneider@arm.com>
> >>>> Cc: Jiang Biao <benbjiang@gmail.com>
> >>>> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> >>>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> >>>
> >>> This version looks good to me. I don't see regressions of v5 anymore
> >>> and see some improvements on heavy cases
> >>
> >> v5 or v8?
> >
> > the v8 looks good to me and I don't see the regressions that I have
> > seen with the v5 anymore
> >
> Sounds great, thanks, :)


Hi Aubrey,

The patch looks great. But I didn't find any hackbench improvement
on kunpeng 920 which has 24 cores for each llc span. Llc span is also
one numa node. The topology is like:
# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 128669 MB
node 0 free: 126995 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
43 44 45 46 47
node 1 size: 128997 MB
node 1 free: 127539 MB
node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
67 68 69 70 71
node 2 size: 129021 MB
node 2 free: 127106 MB
node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
91 92 93 94 95
node 3 size: 127993 MB
node 3 free: 126739 MB
node distances:
node   0   1   2   3
  0:  10  12  20  22
  1:  12  10  22  24
  2:  20  22  10  12
  3:  22  24  12  10

Benchmark command:
numactl -N 0-1 hackbench -p -T -l 20000 -g $1

for each g, I ran 10 times to get the average time. And I tested
g from 1 to 10.

g     1      2      3      4      5      6       7     8        9       10
w/o   1.4733 1.5992 1.9353 2.1563 2.8448 3.3305 3.9616 4.4870 5.0786 5.6983
w/    1.4709 1.6152 1.9474 2.1512 2.8298 3.2998 3.9472 4.4803 5.0462 5.6505

Is it because the core number is small in llc span in my test?

Thanks
Barry

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-13 23:29         ` Song Bao Hua (Barry Song)
@ 2020-12-15 12:41           ` Li, Aubrey
  0 siblings, 0 replies; 10+ messages in thread
From: Li, Aubrey @ 2020-12-15 12:41 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song), Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Mel Gorman,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao

Hi Bao Hua,

Sorry I almost missed this message, :(

On 2020/12/14 7:29, Song Bao Hua (Barry Song) wrote:
> 
> Hi Aubrey,
> 
> The patch looks great. But I didn't find any hackbench improvement
> on kunpeng 920 which has 24 cores for each llc span. Llc span is also
> one numa node. The topology is like:
> # numactl --hardware
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> node 0 size: 128669 MB
> node 0 free: 126995 MB
> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
> 43 44 45 46 47
> node 1 size: 128997 MB
> node 1 free: 127539 MB
> node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
> 67 68 69 70 71
> node 2 size: 129021 MB
> node 2 free: 127106 MB
> node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
> 91 92 93 94 95
> node 3 size: 127993 MB
> node 3 free: 126739 MB
> node distances:
> node   0   1   2   3
>   0:  10  12  20  22
>   1:  12  10  22  24
>   2:  20  22  10  12
>   3:  22  24  12  10
> 
> Benchmark command:
> numactl -N 0-1 hackbench -p -T -l 20000 -g $1
> 
> for each g, I ran 10 times to get the average time. And I tested
> g from 1 to 10.
> 
> g     1      2      3      4      5      6       7     8        9       10
> w/o   1.4733 1.5992 1.9353 2.1563 2.8448 3.3305 3.9616 4.4870 5.0786 5.6983
> w/    1.4709 1.6152 1.9474 2.1512 2.8298 3.2998 3.9472 4.4803 5.0462 5.6505
> 
> Is it because the core number is small in llc span in my test?

I guess it is with SIS_PROP, when the system is very busy that idle cpu scan
loop is throttled by nr(4). The patch actually reduces 4 times scan so the
data change looks marginal. Vincent mentioned a notable change at here:
	
	https://lkml.org/lkml/2020/12/14/109

Maybe you can increase the group number to see if it can be reproduced on
your side.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-11 15:07 ` Vincent Guittot
  2020-12-11 15:18   ` Li, Aubrey
@ 2021-03-04 13:51   ` Li, Aubrey
  2021-03-08 11:30     ` Vincent Guittot
  1 sibling, 1 reply; 10+ messages in thread
From: Li, Aubrey @ 2021-03-04 13:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, Juri Lelli, Mel Gorman,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao

Hi Peter,

On 2020/12/11 23:07, Vincent Guittot wrote:
> On Thu, 10 Dec 2020 at 02:44, Aubrey Li <aubrey.li@linux.intel.com> wrote:
>>
>> Add idle cpumask to track idle cpus in sched domain. Every time
>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>> target. And if the CPU is not in idle, the CPU is cleared in idle
>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has lower cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks including hackbench, schbench, uperf, sysbench mysql and
>> kbuild have been tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>> found.
>>
----snip----
>>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Cc: Qais Yousef <qais.yousef@arm.com>
>> Cc: Valentin Schneider <valentin.schneider@arm.com>
>> Cc: Jiang Biao <benbjiang@gmail.com>
>> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> 
> This version looks good to me. I don't see regressions of v5 anymore
> and see some improvements on heavy cases
> 
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

May I know your thoughts about this patch? 
Is it cpumask operation potentially too expensive to be here?

Thanks,
-Aubrey
> 
>> ---
>>  include/linux/sched/topology.h | 13 ++++++++++
>>  kernel/sched/core.c            |  2 ++
>>  kernel/sched/fair.c            | 45 +++++++++++++++++++++++++++++++++-
>>  kernel/sched/idle.c            |  5 ++++
>>  kernel/sched/sched.h           |  4 +++
>>  kernel/sched/topology.c        |  3 ++-
>>  6 files changed, 70 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 820511289857..b47b85163607 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>         atomic_t        ref;
>>         atomic_t        nr_busy_cpus;
>>         int             has_idle_cores;
>> +       /*
>> +        * Span of all idle CPUs in this domain.
>> +        *
>> +        * NOTE: this field is variable length. (Allocated dynamically
>> +        * by attaching extra space to the end of the structure,
>> +        * depending on how many CPUs the kernel has booted up with)
>> +        */
>> +       unsigned long   idle_cpus_span[];
>>  };
>>
>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>> +{
>> +       return to_cpumask(sds->idle_cpus_span);
>> +}
>> +
>>  struct sched_domain {
>>         /* These fields must be setup */
>>         struct sched_domain __rcu *parent;      /* top domain must be null terminated */
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c4da7e17b906..b136e2440ea4 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>>
>>  #ifdef CONFIG_SMP
>>         rq->idle_balance = idle_cpu(cpu);
>> +       update_idle_cpumask(cpu, rq->idle_balance);
>>         trigger_load_balance(rq);
>>  #endif
>>  }
>> @@ -7186,6 +7187,7 @@ void __init sched_init(void)
>>                 rq->idle_stamp = 0;
>>                 rq->avg_idle = 2*sysctl_sched_migration_cost;
>>                 rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>> +               rq->last_idle_state = 1;
>>
>>                 INIT_LIST_HEAD(&rq->cfs_tasks);
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c0c4d9ad7da8..25f36ecfee54 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>
>>         time = cpu_clock(this);
>>
>> -       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +       /*
>> +        * sched_domain_shared is set only at shared cache level,
>> +        * this works only because select_idle_cpu is called with
>> +        * sd_llc.
>> +        */
>> +       cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>>
>>         for_each_cpu_wrap(cpu, cpus, target) {
>>                 if (!--nr)
>> @@ -6806,6 +6811,44 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>>
>>         return newidle_balance(rq, rf) != 0;
>>  }
>> +
>> +/*
>> + * Update cpu idle state and record this information
>> + * in sd_llc_shared->idle_cpus_span.
>> + *
>> + * This function is called with interrupts disabled.
>> + */
>> +void update_idle_cpumask(int cpu, bool idle)
>> +{
>> +       struct sched_domain *sd;
>> +       struct rq *rq = cpu_rq(cpu);
>> +       int idle_state;
>> +
>> +       /*
>> +        * Also set SCHED_IDLE cpu in idle cpumask to
>> +        * allow SCHED_IDLE cpu as a wakeup target.
>> +        */
>> +       idle_state = idle || sched_idle_cpu(cpu);
>> +       /*
>> +        * No need to update idle cpumask if the state
>> +        * does not change.
>> +        */
>> +       if (rq->last_idle_state == idle_state)
>> +               return;
>> +       /*
>> +        * Called with irq disabled, rcu protection is not needed.
>> +        */
>> +       sd = per_cpu(sd_llc, cpu);
>> +       if (unlikely(!sd))
>> +               return;
>> +
>> +       if (idle_state)
>> +               cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
>> +       else
>> +               cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
>> +
>> +       rq->last_idle_state = idle_state;
>> +}
>>  #endif /* CONFIG_SMP */
>>
>>  static unsigned long wakeup_gran(struct sched_entity *se)
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index f324dc36fc43..2c517d6a061a 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -257,6 +257,11 @@ static void do_idle(void)
>>                         cpuhp_report_idle_dead();
>>                         arch_cpu_idle_dead();
>>                 }
>> +               /*
>> +                * The CPU is about to go idle, set it in idle cpumask
>> +                * to be a wake up target.
>> +                */
>> +               update_idle_cpumask(cpu, true);
>>
>>                 arch_cpu_idle_enter();
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 8d1ca65db3b0..4041d5a10de5 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -976,6 +976,7 @@ struct rq {
>>
>>         unsigned char           nohz_idle_balance;
>>         unsigned char           idle_balance;
>> +       unsigned char           last_idle_state;
>>
>>         unsigned long           misfit_task_load;
>>
>> @@ -1516,6 +1517,8 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
>>
>>  extern int group_balance_cpu(struct sched_group *sg);
>>
>> +void update_idle_cpumask(int cpu, bool idle);
>> +
>>  #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
>>  void register_sched_domain_sysctl(void);
>>  void dirty_sched_domain_sysctl(int cpu);
>> @@ -1536,6 +1539,7 @@ extern void flush_smp_call_function_from_idle(void);
>>
>>  #else /* !CONFIG_SMP: */
>>  static inline void flush_smp_call_function_from_idle(void) { }
>> +static inline void update_idle_cpumask(int cpu, bool idle) { }
>>  #endif
>>
>>  #include "stats.h"
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 1bd7e3af904f..541bd3a7de4d 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
>>                 sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
>>                 atomic_inc(&sd->shared->ref);
>>                 atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
>> +               cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
>>         }
>>
>>         sd->private = sdd;
>> @@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>>
>>                         *per_cpu_ptr(sdd->sd, j) = sd;
>>
>> -                       sds = kzalloc_node(sizeof(struct sched_domain_shared),
>> +                       sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
>>                                         GFP_KERNEL, cpu_to_node(j));
>>                         if (!sds)
>>                                 return -ENOMEM;
>> --
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup
  2021-03-04 13:51   ` Li, Aubrey
@ 2021-03-08 11:30     ` Vincent Guittot
  2021-03-08 13:50       ` Li, Aubrey
  0 siblings, 1 reply; 10+ messages in thread
From: Vincent Guittot @ 2021-03-08 11:30 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Mel Gorman,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao

Hi Aubrey,

On Thu, 4 Mar 2021 at 14:51, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>
> Hi Peter,
>
> On 2020/12/11 23:07, Vincent Guittot wrote:
> > On Thu, 10 Dec 2020 at 02:44, Aubrey Li <aubrey.li@linux.intel.com> wrote:
> >>
> >> Add idle cpumask to track idle cpus in sched domain. Every time
> >> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
> >> target. And if the CPU is not in idle, the CPU is cleared in idle
> >> cpumask during scheduler tick to ratelimit idle cpumask update.
> >>
> >> When a task wakes up to select an idle cpu, scanning idle cpumask
> >> has lower cost than scanning all the cpus in last level cache domain,
> >> especially when the system is heavily loaded.
> >>
> >> Benchmarks including hackbench, schbench, uperf, sysbench mysql and
> >> kbuild have been tested on a x86 4 socket system with 24 cores per
> >> socket and 2 hyperthreads per core, total 192 CPUs, no regression
> >> found.
> >>
> ----snip----
> >>
> >> Cc: Peter Zijlstra <peterz@infradead.org>
> >> Cc: Mel Gorman <mgorman@suse.de>
> >> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> >> Cc: Qais Yousef <qais.yousef@arm.com>
> >> Cc: Valentin Schneider <valentin.schneider@arm.com>
> >> Cc: Jiang Biao <benbjiang@gmail.com>
> >> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> >> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> >
> > This version looks good to me. I don't see regressions of v5 anymore
> > and see some improvements on heavy cases
> >
> > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> May I know your thoughts about this patch?
> Is it cpumask operation potentially too expensive to be here?

Could you rebase your patch ? It doesn't apply anymore on
tip/sched/core was recent changes

>
> Thanks,
> -Aubrey
> >
> >> ---
> >>  include/linux/sched/topology.h | 13 ++++++++++
> >>  kernel/sched/core.c            |  2 ++
> >>  kernel/sched/fair.c            | 45 +++++++++++++++++++++++++++++++++-
> >>  kernel/sched/idle.c            |  5 ++++
> >>  kernel/sched/sched.h           |  4 +++
> >>  kernel/sched/topology.c        |  3 ++-
> >>  6 files changed, 70 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> >> index 820511289857..b47b85163607 100644
> >> --- a/include/linux/sched/topology.h
> >> +++ b/include/linux/sched/topology.h
> >> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> >>         atomic_t        ref;
> >>         atomic_t        nr_busy_cpus;
> >>         int             has_idle_cores;
> >> +       /*
> >> +        * Span of all idle CPUs in this domain.
> >> +        *
> >> +        * NOTE: this field is variable length. (Allocated dynamically
> >> +        * by attaching extra space to the end of the structure,
> >> +        * depending on how many CPUs the kernel has booted up with)
> >> +        */
> >> +       unsigned long   idle_cpus_span[];
> >>  };
> >>
> >> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> >> +{
> >> +       return to_cpumask(sds->idle_cpus_span);
> >> +}
> >> +
> >>  struct sched_domain {
> >>         /* These fields must be setup */
> >>         struct sched_domain __rcu *parent;      /* top domain must be null terminated */
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index c4da7e17b906..b136e2440ea4 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
> >>
> >>  #ifdef CONFIG_SMP
> >>         rq->idle_balance = idle_cpu(cpu);
> >> +       update_idle_cpumask(cpu, rq->idle_balance);
> >>         trigger_load_balance(rq);
> >>  #endif
> >>  }
> >> @@ -7186,6 +7187,7 @@ void __init sched_init(void)
> >>                 rq->idle_stamp = 0;
> >>                 rq->avg_idle = 2*sysctl_sched_migration_cost;
> >>                 rq->max_idle_balance_cost = sysctl_sched_migration_cost;
> >> +               rq->last_idle_state = 1;
> >>
> >>                 INIT_LIST_HEAD(&rq->cfs_tasks);
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index c0c4d9ad7da8..25f36ecfee54 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >>
> >>         time = cpu_clock(this);
> >>
> >> -       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >> +       /*
> >> +        * sched_domain_shared is set only at shared cache level,
> >> +        * this works only because select_idle_cpu is called with
> >> +        * sd_llc.
> >> +        */
> >> +       cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> >>
> >>         for_each_cpu_wrap(cpu, cpus, target) {
> >>                 if (!--nr)
> >> @@ -6806,6 +6811,44 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >>
> >>         return newidle_balance(rq, rf) != 0;
> >>  }
> >> +
> >> +/*
> >> + * Update cpu idle state and record this information
> >> + * in sd_llc_shared->idle_cpus_span.
> >> + *
> >> + * This function is called with interrupts disabled.
> >> + */
> >> +void update_idle_cpumask(int cpu, bool idle)
> >> +{
> >> +       struct sched_domain *sd;
> >> +       struct rq *rq = cpu_rq(cpu);
> >> +       int idle_state;
> >> +
> >> +       /*
> >> +        * Also set SCHED_IDLE cpu in idle cpumask to
> >> +        * allow SCHED_IDLE cpu as a wakeup target.
> >> +        */
> >> +       idle_state = idle || sched_idle_cpu(cpu);
> >> +       /*
> >> +        * No need to update idle cpumask if the state
> >> +        * does not change.
> >> +        */
> >> +       if (rq->last_idle_state == idle_state)
> >> +               return;
> >> +       /*
> >> +        * Called with irq disabled, rcu protection is not needed.
> >> +        */
> >> +       sd = per_cpu(sd_llc, cpu);
> >> +       if (unlikely(!sd))
> >> +               return;
> >> +
> >> +       if (idle_state)
> >> +               cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> >> +       else
> >> +               cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> >> +
> >> +       rq->last_idle_state = idle_state;
> >> +}
> >>  #endif /* CONFIG_SMP */
> >>
> >>  static unsigned long wakeup_gran(struct sched_entity *se)
> >> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >> index f324dc36fc43..2c517d6a061a 100644
> >> --- a/kernel/sched/idle.c
> >> +++ b/kernel/sched/idle.c
> >> @@ -257,6 +257,11 @@ static void do_idle(void)
> >>                         cpuhp_report_idle_dead();
> >>                         arch_cpu_idle_dead();
> >>                 }
> >> +               /*
> >> +                * The CPU is about to go idle, set it in idle cpumask
> >> +                * to be a wake up target.
> >> +                */
> >> +               update_idle_cpumask(cpu, true);
> >>
> >>                 arch_cpu_idle_enter();
> >>
> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> index 8d1ca65db3b0..4041d5a10de5 100644
> >> --- a/kernel/sched/sched.h
> >> +++ b/kernel/sched/sched.h
> >> @@ -976,6 +976,7 @@ struct rq {
> >>
> >>         unsigned char           nohz_idle_balance;
> >>         unsigned char           idle_balance;
> >> +       unsigned char           last_idle_state;
> >>
> >>         unsigned long           misfit_task_load;
> >>
> >> @@ -1516,6 +1517,8 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
> >>
> >>  extern int group_balance_cpu(struct sched_group *sg);
> >>
> >> +void update_idle_cpumask(int cpu, bool idle);
> >> +
> >>  #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
> >>  void register_sched_domain_sysctl(void);
> >>  void dirty_sched_domain_sysctl(int cpu);
> >> @@ -1536,6 +1539,7 @@ extern void flush_smp_call_function_from_idle(void);
> >>
> >>  #else /* !CONFIG_SMP: */
> >>  static inline void flush_smp_call_function_from_idle(void) { }
> >> +static inline void update_idle_cpumask(int cpu, bool idle) { }
> >>  #endif
> >>
> >>  #include "stats.h"
> >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> >> index 1bd7e3af904f..541bd3a7de4d 100644
> >> --- a/kernel/sched/topology.c
> >> +++ b/kernel/sched/topology.c
> >> @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
> >>                 sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> >>                 atomic_inc(&sd->shared->ref);
> >>                 atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> >> +               cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
> >>         }
> >>
> >>         sd->private = sdd;
> >> @@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
> >>
> >>                         *per_cpu_ptr(sdd->sd, j) = sd;
> >>
> >> -                       sds = kzalloc_node(sizeof(struct sched_domain_shared),
> >> +                       sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
> >>                                         GFP_KERNEL, cpu_to_node(j));
> >>                         if (!sds)
> >>                                 return -ENOMEM;
> >> --
> >> 2.25.1
> >>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup
  2021-03-08 11:30     ` Vincent Guittot
@ 2021-03-08 13:50       ` Li, Aubrey
  0 siblings, 0 replies; 10+ messages in thread
From: Li, Aubrey @ 2021-03-08 13:50 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Mel Gorman,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao

On 2021/3/8 19:30, Vincent Guittot wrote:
> Hi Aubrey,
> 
> On Thu, 4 Mar 2021 at 14:51, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>>
>> Hi Peter,
>>
>> On 2020/12/11 23:07, Vincent Guittot wrote:
>>> On Thu, 10 Dec 2020 at 02:44, Aubrey Li <aubrey.li@linux.intel.com> wrote:
>>>>
>>>> Add idle cpumask to track idle cpus in sched domain. Every time
>>>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>>>> target. And if the CPU is not in idle, the CPU is cleared in idle
>>>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>>>
>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>> has lower cost than scanning all the cpus in last level cache domain,
>>>> especially when the system is heavily loaded.
>>>>
>>>> Benchmarks including hackbench, schbench, uperf, sysbench mysql and
>>>> kbuild have been tested on a x86 4 socket system with 24 cores per
>>>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>>>> found.
>>>>
>> ----snip----
>>>>
>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>> Cc: Mel Gorman <mgorman@suse.de>
>>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>>>> Cc: Qais Yousef <qais.yousef@arm.com>
>>>> Cc: Valentin Schneider <valentin.schneider@arm.com>
>>>> Cc: Jiang Biao <benbjiang@gmail.com>
>>>> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>>>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>>>
>>> This version looks good to me. I don't see regressions of v5 anymore
>>> and see some improvements on heavy cases
>>>
>>> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
>>
>> May I know your thoughts about this patch?
>> Is it cpumask operation potentially too expensive to be here?
> 
> Could you rebase your patch ? It doesn't apply anymore on
> tip/sched/core was recent changes

Okay, I'll rebase it and send a v9 out soon, thanks Vincent.

> 
>>
>> Thanks,
>> -Aubrey
>>>
>>>> ---
>>>>  include/linux/sched/topology.h | 13 ++++++++++
>>>>  kernel/sched/core.c            |  2 ++
>>>>  kernel/sched/fair.c            | 45 +++++++++++++++++++++++++++++++++-
>>>>  kernel/sched/idle.c            |  5 ++++
>>>>  kernel/sched/sched.h           |  4 +++
>>>>  kernel/sched/topology.c        |  3 ++-
>>>>  6 files changed, 70 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>>>> index 820511289857..b47b85163607 100644
>>>> --- a/include/linux/sched/topology.h
>>>> +++ b/include/linux/sched/topology.h
>>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>>>         atomic_t        ref;
>>>>         atomic_t        nr_busy_cpus;
>>>>         int             has_idle_cores;
>>>> +       /*
>>>> +        * Span of all idle CPUs in this domain.
>>>> +        *
>>>> +        * NOTE: this field is variable length. (Allocated dynamically
>>>> +        * by attaching extra space to the end of the structure,
>>>> +        * depending on how many CPUs the kernel has booted up with)
>>>> +        */
>>>> +       unsigned long   idle_cpus_span[];
>>>>  };
>>>>
>>>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>>>> +{
>>>> +       return to_cpumask(sds->idle_cpus_span);
>>>> +}
>>>> +
>>>>  struct sched_domain {
>>>>         /* These fields must be setup */
>>>>         struct sched_domain __rcu *parent;      /* top domain must be null terminated */
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index c4da7e17b906..b136e2440ea4 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>>>>
>>>>  #ifdef CONFIG_SMP
>>>>         rq->idle_balance = idle_cpu(cpu);
>>>> +       update_idle_cpumask(cpu, rq->idle_balance);
>>>>         trigger_load_balance(rq);
>>>>  #endif
>>>>  }
>>>> @@ -7186,6 +7187,7 @@ void __init sched_init(void)
>>>>                 rq->idle_stamp = 0;
>>>>                 rq->avg_idle = 2*sysctl_sched_migration_cost;
>>>>                 rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>>>> +               rq->last_idle_state = 1;
>>>>
>>>>                 INIT_LIST_HEAD(&rq->cfs_tasks);
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index c0c4d9ad7da8..25f36ecfee54 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>>
>>>>         time = cpu_clock(this);
>>>>
>>>> -       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>> +       /*
>>>> +        * sched_domain_shared is set only at shared cache level,
>>>> +        * this works only because select_idle_cpu is called with
>>>> +        * sd_llc.
>>>> +        */
>>>> +       cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>>>>
>>>>         for_each_cpu_wrap(cpu, cpus, target) {
>>>>                 if (!--nr)
>>>> @@ -6806,6 +6811,44 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>>>>
>>>>         return newidle_balance(rq, rf) != 0;
>>>>  }
>>>> +
>>>> +/*
>>>> + * Update cpu idle state and record this information
>>>> + * in sd_llc_shared->idle_cpus_span.
>>>> + *
>>>> + * This function is called with interrupts disabled.
>>>> + */
>>>> +void update_idle_cpumask(int cpu, bool idle)
>>>> +{
>>>> +       struct sched_domain *sd;
>>>> +       struct rq *rq = cpu_rq(cpu);
>>>> +       int idle_state;
>>>> +
>>>> +       /*
>>>> +        * Also set SCHED_IDLE cpu in idle cpumask to
>>>> +        * allow SCHED_IDLE cpu as a wakeup target.
>>>> +        */
>>>> +       idle_state = idle || sched_idle_cpu(cpu);
>>>> +       /*
>>>> +        * No need to update idle cpumask if the state
>>>> +        * does not change.
>>>> +        */
>>>> +       if (rq->last_idle_state == idle_state)
>>>> +               return;
>>>> +       /*
>>>> +        * Called with irq disabled, rcu protection is not needed.
>>>> +        */
>>>> +       sd = per_cpu(sd_llc, cpu);
>>>> +       if (unlikely(!sd))
>>>> +               return;
>>>> +
>>>> +       if (idle_state)
>>>> +               cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
>>>> +       else
>>>> +               cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
>>>> +
>>>> +       rq->last_idle_state = idle_state;
>>>> +}
>>>>  #endif /* CONFIG_SMP */
>>>>
>>>>  static unsigned long wakeup_gran(struct sched_entity *se)
>>>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>>>> index f324dc36fc43..2c517d6a061a 100644
>>>> --- a/kernel/sched/idle.c
>>>> +++ b/kernel/sched/idle.c
>>>> @@ -257,6 +257,11 @@ static void do_idle(void)
>>>>                         cpuhp_report_idle_dead();
>>>>                         arch_cpu_idle_dead();
>>>>                 }
>>>> +               /*
>>>> +                * The CPU is about to go idle, set it in idle cpumask
>>>> +                * to be a wake up target.
>>>> +                */
>>>> +               update_idle_cpumask(cpu, true);
>>>>
>>>>                 arch_cpu_idle_enter();
>>>>
>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>>> index 8d1ca65db3b0..4041d5a10de5 100644
>>>> --- a/kernel/sched/sched.h
>>>> +++ b/kernel/sched/sched.h
>>>> @@ -976,6 +976,7 @@ struct rq {
>>>>
>>>>         unsigned char           nohz_idle_balance;
>>>>         unsigned char           idle_balance;
>>>> +       unsigned char           last_idle_state;
>>>>
>>>>         unsigned long           misfit_task_load;
>>>>
>>>> @@ -1516,6 +1517,8 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
>>>>
>>>>  extern int group_balance_cpu(struct sched_group *sg);
>>>>
>>>> +void update_idle_cpumask(int cpu, bool idle);
>>>> +
>>>>  #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
>>>>  void register_sched_domain_sysctl(void);
>>>>  void dirty_sched_domain_sysctl(int cpu);
>>>> @@ -1536,6 +1539,7 @@ extern void flush_smp_call_function_from_idle(void);
>>>>
>>>>  #else /* !CONFIG_SMP: */
>>>>  static inline void flush_smp_call_function_from_idle(void) { }
>>>> +static inline void update_idle_cpumask(int cpu, bool idle) { }
>>>>  #endif
>>>>
>>>>  #include "stats.h"
>>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>>> index 1bd7e3af904f..541bd3a7de4d 100644
>>>> --- a/kernel/sched/topology.c
>>>> +++ b/kernel/sched/topology.c
>>>> @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
>>>>                 sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
>>>>                 atomic_inc(&sd->shared->ref);
>>>>                 atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
>>>> +               cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
>>>>         }
>>>>
>>>>         sd->private = sdd;
>>>> @@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>>>>
>>>>                         *per_cpu_ptr(sdd->sd, j) = sd;
>>>>
>>>> -                       sds = kzalloc_node(sizeof(struct sched_domain_shared),
>>>> +                       sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
>>>>                                         GFP_KERNEL, cpu_to_node(j));
>>>>                         if (!sds)
>>>>                                 return -ENOMEM;
>>>> --
>>>> 2.25.1
>>>>
>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-03-08 13:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-10  1:43 [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup Aubrey Li
2020-12-11 15:07 ` Vincent Guittot
2020-12-11 15:18   ` Li, Aubrey
2020-12-11 15:22     ` Vincent Guittot
2020-12-11 15:24       ` Li, Aubrey
2020-12-13 23:29         ` Song Bao Hua (Barry Song)
2020-12-15 12:41           ` Li, Aubrey
2021-03-04 13:51   ` Li, Aubrey
2021-03-08 11:30     ` Vincent Guittot
2021-03-08 13:50       ` Li, Aubrey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).