linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
@ 2020-12-09  6:24 Aubrey Li
  2020-12-09  8:15 ` Vincent Guittot
  2020-12-09 14:36 ` Mel Gorman
  0 siblings, 2 replies; 22+ messages in thread
From: Aubrey Li @ 2020-12-09  6:24 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, mgorman,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall
  Cc: tim.c.chen, linux-kernel, Aubrey Li, Mel Gorman, Jiang Biao

Add idle cpumask to track idle cpus in sched domain. Every time
a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
target. And if the CPU is not in idle, the CPU is cleared in idle
cpumask during scheduler tick to ratelimit idle cpumask update.

When a task wakes up to select an idle cpu, scanning idle cpumask
has lower cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.

Benchmarks including hackbench, schbench, uperf, sysbench mysql
and kbuild were tested on a x86 4 socket system with 24 cores per
socket and 2 hyperthreads per core, total 192 CPUs, no regression
found.

v6->v7:
- place the whole idle cpumask mechanism under CONFIG_SMP.

v5->v6:
- decouple idle cpumask update from stop_tick signal, set idle CPU
  in idle cpumask every time the CPU enters idle

v4->v5:
- add update_idle_cpumask for s2idle case
- keep the same ordering of tick_nohz_idle_stop_tick() and update_
  idle_cpumask() everywhere

v3->v4:
- change setting idle cpumask from every idle entry to tickless idle
  if cpu driver is available.
- move clearing idle cpumask to scheduler_tick to decouple nohz mode.

v2->v3:
- change setting idle cpumask to every idle entry, otherwise schbench
  has a regression of 99th percentile latency.
- change clearing idle cpumask to nohz_balancer_kick(), so updating
  idle cpumask is ratelimited in the idle exiting path.
- set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.

v1->v2:
- idle cpumask is updated in the nohz routines, by initializing idle
  cpumask with sched_domain_span(sd), nohz=off case remains the original
  behavior.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Qais Yousef <qais.yousef@arm.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jiang Biao <benbjiang@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
---
 include/linux/sched/topology.h | 13 +++++++++
 kernel/sched/core.c            |  2 ++
 kernel/sched/fair.c            | 51 +++++++++++++++++++++++++++++++++-
 kernel/sched/idle.c            |  5 ++++
 kernel/sched/sched.h           |  4 +++
 kernel/sched/topology.c        |  3 +-
 6 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 820511289857..b47b85163607 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
+	/*
+	 * Span of all idle CPUs in this domain.
+	 *
+	 * NOTE: this field is variable length. (Allocated dynamically
+	 * by attaching extra space to the end of the structure,
+	 * depending on how many CPUs the kernel has booted up with)
+	 */
+	unsigned long	idle_cpus_span[];
 };
 
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+	return to_cpumask(sds->idle_cpus_span);
+}
+
 struct sched_domain {
 	/* These fields must be setup */
 	struct sched_domain __rcu *parent;	/* top domain must be null terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c4da7e17b906..c4c51ff3402a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4011,6 +4011,7 @@ void scheduler_tick(void)
 
 #ifdef CONFIG_SMP
 	rq->idle_balance = idle_cpu(cpu);
+	update_idle_cpumask(cpu, false);
 	trigger_load_balance(rq);
 #endif
 }
@@ -7186,6 +7187,7 @@ void __init sched_init(void)
 		rq->idle_stamp = 0;
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+		rq->last_idle_state = 1;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0c4d9ad7da8..7306f8886120 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 
 	time = cpu_clock(this);
 
-	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+	/*
+	 * sched_domain_shared is set only at shared cache level,
+	 * this works only because select_idle_cpu is called with
+	 * sd_llc.
+	 */
+	cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
 
 	for_each_cpu_wrap(cpu, cpus, target) {
 		if (!--nr)
@@ -6806,6 +6811,50 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	return newidle_balance(rq, rf) != 0;
 }
+
+/*
+ * Update cpu idle state and record this information
+ * in sd_llc_shared->idle_cpus_span.
+ */
+void update_idle_cpumask(int cpu, bool set_idle)
+{
+	struct sched_domain *sd;
+	struct rq *rq = cpu_rq(cpu);
+	int idle_state;
+
+	/*
+	 * If called from scheduler tick, only update
+	 * idle cpumask if the CPU is busy, as idle
+	 * cpumask is also updated on idle entry.
+	 *
+	 */
+	if (!set_idle && rq->idle_balance)
+		return;
+	/*
+	 * Also set SCHED_IDLE cpu in idle cpumask to
+	 * allow SCHED_IDLE cpu as a wakeup target
+	 */
+	idle_state = set_idle || sched_idle_cpu(cpu);
+	/*
+	 * No need to update idle cpumask if the state
+	 * does not change.
+	 */
+	if (rq->last_idle_state == idle_state)
+		return;
+	/*
+	 * Called with irq disabled, rcu_read_lock() is not needed.
+	 */
+	sd = rcu_dereference(per_cpu(sd_llc, cpu));
+	if (unlikely(!sd))
+		return;
+
+	if (idle_state)
+		cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
+	else
+		cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
+
+	rq->last_idle_state = idle_state;
+}
 #endif /* CONFIG_SMP */
 
 static unsigned long wakeup_gran(struct sched_entity *se)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f324dc36fc43..2c517d6a061a 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -257,6 +257,11 @@ static void do_idle(void)
 			cpuhp_report_idle_dead();
 			arch_cpu_idle_dead();
 		}
+		/*
+		 * The CPU is about to go idle, set it in idle cpumask
+		 * to be a wake up target.
+		 */
+		update_idle_cpumask(cpu, true);
 
 		arch_cpu_idle_enter();
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8d1ca65db3b0..2167ca48f3aa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -976,6 +976,7 @@ struct rq {
 
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
+	unsigned char		last_idle_state;
 
 	unsigned long		misfit_task_load;
 
@@ -1516,6 +1517,8 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
 
 extern int group_balance_cpu(struct sched_group *sg);
 
+void update_idle_cpumask(int cpu, bool set_idle);
+
 #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
 void register_sched_domain_sysctl(void);
 void dirty_sched_domain_sysctl(int cpu);
@@ -1536,6 +1539,7 @@ extern void flush_smp_call_function_from_idle(void);
 
 #else /* !CONFIG_SMP: */
 static inline void flush_smp_call_function_from_idle(void) { }
+static inline void update_idle_cpumask(int cpu, bool set_idle) { }
 #endif
 
 #include "stats.h"
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1bd7e3af904f..541bd3a7de4d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+		cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
 	}
 
 	sd->private = sdd;
@@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
 			*per_cpu_ptr(sdd->sd, j) = sd;
 
-			sds = kzalloc_node(sizeof(struct sched_domain_shared),
+			sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
 			if (!sds)
 				return -ENOMEM;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-09  6:24 [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup Aubrey Li
@ 2020-12-09  8:15 ` Vincent Guittot
  2020-12-09 10:58   ` Li, Aubrey
  2020-12-09 14:36 ` Mel Gorman
  1 sibling, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2020-12-09  8:15 UTC (permalink / raw)
  To: Aubrey Li
  Cc: mingo, peterz, juri.lelli, mgorman, valentin.schneider,
	qais.yousef, dietmar.eggemann, rostedt, bsegall, tim.c.chen,
	linux-kernel, Mel Gorman, Jiang Biao

Le mercredi 09 déc. 2020 à 14:24:04 (+0800), Aubrey Li a écrit :
> Add idle cpumask to track idle cpus in sched domain. Every time
> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
> target. And if the CPU is not in idle, the CPU is cleared in idle
> cpumask during scheduler tick to ratelimit idle cpumask update.
> 
> When a task wakes up to select an idle cpu, scanning idle cpumask
> has lower cost than scanning all the cpus in last level cache domain,
> especially when the system is heavily loaded.
> 
> Benchmarks including hackbench, schbench, uperf, sysbench mysql
> and kbuild were tested on a x86 4 socket system with 24 cores per
> socket and 2 hyperthreads per core, total 192 CPUs, no regression
> found.
> 
> v6->v7:
> - place the whole idle cpumask mechanism under CONFIG_SMP.
> 
> v5->v6:
> - decouple idle cpumask update from stop_tick signal, set idle CPU
>   in idle cpumask every time the CPU enters idle
> 
> v4->v5:
> - add update_idle_cpumask for s2idle case
> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>   idle_cpumask() everywhere
> 
> v3->v4:
> - change setting idle cpumask from every idle entry to tickless idle
>   if cpu driver is available.
> - move clearing idle cpumask to scheduler_tick to decouple nohz mode.
> 
> v2->v3:
> - change setting idle cpumask to every idle entry, otherwise schbench
>   has a regression of 99th percentile latency.
> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>   idle cpumask is ratelimited in the idle exiting path.
> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.
> 
> v1->v2:
> - idle cpumask is updated in the nohz routines, by initializing idle
>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>   behavior.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Qais Yousef <qais.yousef@arm.com>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jiang Biao <benbjiang@gmail.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> ---
>  include/linux/sched/topology.h | 13 +++++++++
>  kernel/sched/core.c            |  2 ++
>  kernel/sched/fair.c            | 51 +++++++++++++++++++++++++++++++++-
>  kernel/sched/idle.c            |  5 ++++
>  kernel/sched/sched.h           |  4 +++
>  kernel/sched/topology.c        |  3 +-
>  6 files changed, 76 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 820511289857..b47b85163607 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>  	atomic_t	ref;
>  	atomic_t	nr_busy_cpus;
>  	int		has_idle_cores;
> +	/*
> +	 * Span of all idle CPUs in this domain.
> +	 *
> +	 * NOTE: this field is variable length. (Allocated dynamically
> +	 * by attaching extra space to the end of the structure,
> +	 * depending on how many CPUs the kernel has booted up with)
> +	 */
> +	unsigned long	idle_cpus_span[];
>  };
>  
> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> +{
> +	return to_cpumask(sds->idle_cpus_span);
> +}
> +
>  struct sched_domain {
>  	/* These fields must be setup */
>  	struct sched_domain __rcu *parent;	/* top domain must be null terminated */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c4da7e17b906..c4c51ff3402a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>  
>  #ifdef CONFIG_SMP
>  	rq->idle_balance = idle_cpu(cpu);
> +	update_idle_cpumask(cpu, false);

Test rq->idle_balance here instead of adding the test in update_idle_cpumask which is only
relevant for this situation.

if (!rq->idle_balance)
    update_idle_cpumask(cpu, false);

>  	trigger_load_balance(rq);
>  #endif
>  }
> @@ -7186,6 +7187,7 @@ void __init sched_init(void)
>  		rq->idle_stamp = 0;
>  		rq->avg_idle = 2*sysctl_sched_migration_cost;
>  		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
> +		rq->last_idle_state = 1;
>  
>  		INIT_LIST_HEAD(&rq->cfs_tasks);
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c0c4d9ad7da8..7306f8886120 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  
>  	time = cpu_clock(this);
>  
> -	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +	/*
> +	 * sched_domain_shared is set only at shared cache level,
> +	 * this works only because select_idle_cpu is called with
> +	 * sd_llc.
> +	 */
> +	cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>  
>  	for_each_cpu_wrap(cpu, cpus, target) {
>  		if (!--nr)
> @@ -6806,6 +6811,50 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  
>  	return newidle_balance(rq, rf) != 0;
>  }
> +
> +/*
> + * Update cpu idle state and record this information
> + * in sd_llc_shared->idle_cpus_span.
> + */
> +void update_idle_cpumask(int cpu, bool set_idle)
> +{
> +	struct sched_domain *sd;
> +	struct rq *rq = cpu_rq(cpu);
> +	int idle_state;
> +
> +	/*
> +	 * If called from scheduler tick, only update
> +	 * idle cpumask if the CPU is busy, as idle
> +	 * cpumask is also updated on idle entry.
> +	 *
> +	 */
> +	if (!set_idle && rq->idle_balance)
> +		return;
> +	/*
> +	 * Also set SCHED_IDLE cpu in idle cpumask to
> +	 * allow SCHED_IDLE cpu as a wakeup target
> +	 */
> +	idle_state = set_idle || sched_idle_cpu(cpu);
> +	/*
> +	 * No need to update idle cpumask if the state
> +	 * does not change.
> +	 */
> +	if (rq->last_idle_state == idle_state)
> +		return;
> +	/*
> +	 * Called with irq disabled, rcu_read_lock() is not needed.
> +	 */
> +	sd = rcu_dereference(per_cpu(sd_llc, cpu));
> +	if (unlikely(!sd))
> +		return;
> +
> +	if (idle_state)
> +		cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> +	else
> +		cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> +
> +	rq->last_idle_state = idle_state;
> +}
>  #endif /* CONFIG_SMP */
>  
>  static unsigned long wakeup_gran(struct sched_entity *se)
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index f324dc36fc43..2c517d6a061a 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -257,6 +257,11 @@ static void do_idle(void)
>  			cpuhp_report_idle_dead();
>  			arch_cpu_idle_dead();
>  		}
> +		/*
> +		 * The CPU is about to go idle, set it in idle cpumask
> +		 * to be a wake up target.
> +		 */
> +		update_idle_cpumask(cpu, true);
>  
>  		arch_cpu_idle_enter();
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 8d1ca65db3b0..2167ca48f3aa 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -976,6 +976,7 @@ struct rq {
>  
>  	unsigned char		nohz_idle_balance;
>  	unsigned char		idle_balance;
> +	unsigned char		last_idle_state;
>  
>  	unsigned long		misfit_task_load;
>  
> @@ -1516,6 +1517,8 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
>  
>  extern int group_balance_cpu(struct sched_group *sg);
>  
> +void update_idle_cpumask(int cpu, bool set_idle);
> +
>  #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
>  void register_sched_domain_sysctl(void);
>  void dirty_sched_domain_sysctl(int cpu);
> @@ -1536,6 +1539,7 @@ extern void flush_smp_call_function_from_idle(void);
>  
>  #else /* !CONFIG_SMP: */
>  static inline void flush_smp_call_function_from_idle(void) { }
> +static inline void update_idle_cpumask(int cpu, bool set_idle) { }
>  #endif
>  
>  #include "stats.h"
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 1bd7e3af904f..541bd3a7de4d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl,
>  		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
>  		atomic_inc(&sd->shared->ref);
>  		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> +		cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd));
>  	}
>  
>  	sd->private = sdd;
> @@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>  
>  			*per_cpu_ptr(sdd->sd, j) = sd;
>  
> -			sds = kzalloc_node(sizeof(struct sched_domain_shared),
> +			sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
>  					GFP_KERNEL, cpu_to_node(j));
>  			if (!sds)
>  				return -ENOMEM;
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-09  8:15 ` Vincent Guittot
@ 2020-12-09 10:58   ` Li, Aubrey
  2020-12-09 13:09     ` Vincent Guittot
  0 siblings, 1 reply; 22+ messages in thread
From: Li, Aubrey @ 2020-12-09 10:58 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, mgorman, valentin.schneider,
	qais.yousef, dietmar.eggemann, rostedt, bsegall, tim.c.chen,
	linux-kernel, Mel Gorman, Jiang Biao

On 2020/12/9 16:15, Vincent Guittot wrote:
> Le mercredi 09 déc. 2020 à 14:24:04 (+0800), Aubrey Li a écrit :
>> Add idle cpumask to track idle cpus in sched domain. Every time
>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>> target. And if the CPU is not in idle, the CPU is cleared in idle
>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has lower cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks including hackbench, schbench, uperf, sysbench mysql
>> and kbuild were tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>> found.
>>
>> v6->v7:
>> - place the whole idle cpumask mechanism under CONFIG_SMP.
>>
>> v5->v6:
>> - decouple idle cpumask update from stop_tick signal, set idle CPU
>>   in idle cpumask every time the CPU enters idle
>>
>> v4->v5:
>> - add update_idle_cpumask for s2idle case
>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>   idle_cpumask() everywhere
>>
>> v3->v4:
>> - change setting idle cpumask from every idle entry to tickless idle
>>   if cpu driver is available.
>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode.
>>
>> v2->v3:
>> - change setting idle cpumask to every idle entry, otherwise schbench
>>   has a regression of 99th percentile latency.
>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>>   idle cpumask is ratelimited in the idle exiting path.
>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.
>>
>> v1->v2:
>> - idle cpumask is updated in the nohz routines, by initializing idle
>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>>   behavior.
>>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Cc: Qais Yousef <qais.yousef@arm.com>
>> Cc: Valentin Schneider <valentin.schneider@arm.com>
>> Cc: Jiang Biao <benbjiang@gmail.com>
>> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>> ---
>>  include/linux/sched/topology.h | 13 +++++++++
>>  kernel/sched/core.c            |  2 ++
>>  kernel/sched/fair.c            | 51 +++++++++++++++++++++++++++++++++-
>>  kernel/sched/idle.c            |  5 ++++
>>  kernel/sched/sched.h           |  4 +++
>>  kernel/sched/topology.c        |  3 +-
>>  6 files changed, 76 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 820511289857..b47b85163607 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>  	atomic_t	ref;
>>  	atomic_t	nr_busy_cpus;
>>  	int		has_idle_cores;
>> +	/*
>> +	 * Span of all idle CPUs in this domain.
>> +	 *
>> +	 * NOTE: this field is variable length. (Allocated dynamically
>> +	 * by attaching extra space to the end of the structure,
>> +	 * depending on how many CPUs the kernel has booted up with)
>> +	 */
>> +	unsigned long	idle_cpus_span[];
>>  };
>>  
>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>> +{
>> +	return to_cpumask(sds->idle_cpus_span);
>> +}
>> +
>>  struct sched_domain {
>>  	/* These fields must be setup */
>>  	struct sched_domain __rcu *parent;	/* top domain must be null terminated */
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c4da7e17b906..c4c51ff3402a 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>>  
>>  #ifdef CONFIG_SMP
>>  	rq->idle_balance = idle_cpu(cpu);
>> +	update_idle_cpumask(cpu, false);
> 
> Test rq->idle_balance here instead of adding the test in update_idle_cpumask which is only
> relevant for this situation.

If called from idle path, because !set_idle is false, rq->idle_balance won't be tested actually.

	if (!set_idle && rq->idle_balance)
		return;

So is it okay to leave it here to keep scheduler_tick a bit concise?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-09 10:58   ` Li, Aubrey
@ 2020-12-09 13:09     ` Vincent Guittot
  2020-12-09 14:53       ` Li, Aubrey
  0 siblings, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2020-12-09 13:09 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Mel Gorman,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao

On Wed, 9 Dec 2020 at 11:58, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>
> On 2020/12/9 16:15, Vincent Guittot wrote:
> > Le mercredi 09 déc. 2020 à 14:24:04 (+0800), Aubrey Li a écrit :
> >> Add idle cpumask to track idle cpus in sched domain. Every time
> >> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
> >> target. And if the CPU is not in idle, the CPU is cleared in idle
> >> cpumask during scheduler tick to ratelimit idle cpumask update.
> >>
> >> When a task wakes up to select an idle cpu, scanning idle cpumask
> >> has lower cost than scanning all the cpus in last level cache domain,
> >> especially when the system is heavily loaded.
> >>
> >> Benchmarks including hackbench, schbench, uperf, sysbench mysql
> >> and kbuild were tested on a x86 4 socket system with 24 cores per
> >> socket and 2 hyperthreads per core, total 192 CPUs, no regression
> >> found.
> >>
> >> v6->v7:
> >> - place the whole idle cpumask mechanism under CONFIG_SMP.
> >>
> >> v5->v6:
> >> - decouple idle cpumask update from stop_tick signal, set idle CPU
> >>   in idle cpumask every time the CPU enters idle
> >>
> >> v4->v5:
> >> - add update_idle_cpumask for s2idle case
> >> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
> >>   idle_cpumask() everywhere
> >>
> >> v3->v4:
> >> - change setting idle cpumask from every idle entry to tickless idle
> >>   if cpu driver is available.
> >> - move clearing idle cpumask to scheduler_tick to decouple nohz mode.
> >>
> >> v2->v3:
> >> - change setting idle cpumask to every idle entry, otherwise schbench
> >>   has a regression of 99th percentile latency.
> >> - change clearing idle cpumask to nohz_balancer_kick(), so updating
> >>   idle cpumask is ratelimited in the idle exiting path.
> >> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.
> >>
> >> v1->v2:
> >> - idle cpumask is updated in the nohz routines, by initializing idle
> >>   cpumask with sched_domain_span(sd), nohz=off case remains the original
> >>   behavior.
> >>
> >> Cc: Peter Zijlstra <peterz@infradead.org>
> >> Cc: Mel Gorman <mgorman@suse.de>
> >> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> >> Cc: Qais Yousef <qais.yousef@arm.com>
> >> Cc: Valentin Schneider <valentin.schneider@arm.com>
> >> Cc: Jiang Biao <benbjiang@gmail.com>
> >> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> >> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> >> ---
> >>  include/linux/sched/topology.h | 13 +++++++++
> >>  kernel/sched/core.c            |  2 ++
> >>  kernel/sched/fair.c            | 51 +++++++++++++++++++++++++++++++++-
> >>  kernel/sched/idle.c            |  5 ++++
> >>  kernel/sched/sched.h           |  4 +++
> >>  kernel/sched/topology.c        |  3 +-
> >>  6 files changed, 76 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> >> index 820511289857..b47b85163607 100644
> >> --- a/include/linux/sched/topology.h
> >> +++ b/include/linux/sched/topology.h
> >> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> >>      atomic_t        ref;
> >>      atomic_t        nr_busy_cpus;
> >>      int             has_idle_cores;
> >> +    /*
> >> +     * Span of all idle CPUs in this domain.
> >> +     *
> >> +     * NOTE: this field is variable length. (Allocated dynamically
> >> +     * by attaching extra space to the end of the structure,
> >> +     * depending on how many CPUs the kernel has booted up with)
> >> +     */
> >> +    unsigned long   idle_cpus_span[];
> >>  };
> >>
> >> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> >> +{
> >> +    return to_cpumask(sds->idle_cpus_span);
> >> +}
> >> +
> >>  struct sched_domain {
> >>      /* These fields must be setup */
> >>      struct sched_domain __rcu *parent;      /* top domain must be null terminated */
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index c4da7e17b906..c4c51ff3402a 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
> >>
> >>  #ifdef CONFIG_SMP
> >>      rq->idle_balance = idle_cpu(cpu);
> >> +    update_idle_cpumask(cpu, false);
> >
> > Test rq->idle_balance here instead of adding the test in update_idle_cpumask which is only
> > relevant for this situation.
>
> If called from idle path, because !set_idle is false, rq->idle_balance won't be tested actually.
>
>         if (!set_idle && rq->idle_balance)
>                 return;
>
> So is it okay to leave it here to keep scheduler_tick a bit concise?

I don't like having a tick specific condition in a generic function.
rq->idle_balance is only relevant in this case

calling update_idle_cpumask(rq->idle_balance) in scheduler_tick()
should do the job and we can remove the check of rq->idle_balance in
update_idle_cpumask()

In case of scheduler_tick() called when idle , we will only test if
(rq->last_idle_state == idle_state) and return

>
> Thanks,
> -Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-09  6:24 [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup Aubrey Li
  2020-12-09  8:15 ` Vincent Guittot
@ 2020-12-09 14:36 ` Mel Gorman
  2020-12-10  8:23   ` Li, Aubrey
  1 sibling, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2020-12-09 14:36 UTC (permalink / raw)
  To: Aubrey Li
  Cc: mingo, peterz, juri.lelli, vincent.guittot, valentin.schneider,
	qais.yousef, dietmar.eggemann, rostedt, bsegall, tim.c.chen,
	linux-kernel, Mel Gorman, Jiang Biao

On Wed, Dec 09, 2020 at 02:24:04PM +0800, Aubrey Li wrote:
> Add idle cpumask to track idle cpus in sched domain. Every time
> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
> target. And if the CPU is not in idle, the CPU is cleared in idle
> cpumask during scheduler tick to ratelimit idle cpumask update.
> 
> When a task wakes up to select an idle cpu, scanning idle cpumask
> has lower cost than scanning all the cpus in last level cache domain,
> especially when the system is heavily loaded.
> 
> Benchmarks including hackbench, schbench, uperf, sysbench mysql
> and kbuild were tested on a x86 4 socket system with 24 cores per
> socket and 2 hyperthreads per core, total 192 CPUs, no regression
> found.
> 

I ran this patch with tbench on top of of the schedstat patches that
track SIS efficiency. The tracking adds overhead so it's not a perfect
performance comparison but the expectation would be that the patch reduces
the number of runqueues that are scanned

tbench4
                          5.10.0-rc6             5.10.0-rc6
                      schedstat-v1r1          idlemask-v7r1
Hmean     1        504.76 (   0.00%)      500.14 *  -0.91%*
Hmean     2       1001.22 (   0.00%)      970.37 *  -3.08%*
Hmean     4       1930.56 (   0.00%)     1880.96 *  -2.57%*
Hmean     8       3688.05 (   0.00%)     3537.72 *  -4.08%*
Hmean     16      6352.71 (   0.00%)     6439.53 *   1.37%*
Hmean     32     10066.37 (   0.00%)    10124.65 *   0.58%*
Hmean     64     12846.32 (   0.00%)    11627.27 *  -9.49%*
Hmean     128    22278.41 (   0.00%)    22304.33 *   0.12%*
Hmean     256    21455.52 (   0.00%)    20900.13 *  -2.59%*
Hmean     320    21802.38 (   0.00%)    21928.81 *   0.58%*

Not very optimistic result. The schedstats indicate;

                                5.10.0-rc6     5.10.0-rc6
                            schedstat-v1r1  idlemask-v7r1
Ops TTWU Count               5599714302.00  5589495123.00
Ops TTWU Local               2687713250.00  2563662550.00
Ops SIS Search               5596677950.00  5586381168.00
Ops SIS Domain Search        3268344934.00  3229088045.00
Ops SIS Scanned             15909069113.00 16568899405.00
Ops SIS Domain Scanned      13580736097.00 14211606282.00
Ops SIS Failures             2944874939.00  2843113421.00
Ops SIS Core Search           262853975.00   311781774.00
Ops SIS Core Hit              185189656.00   216097102.00
Ops SIS Core Miss              77664319.00    95684672.00
Ops SIS Recent Used Hit       124265515.00   146021086.00
Ops SIS Recent Used Miss      338142547.00   403547579.00
Ops SIS Recent Attempts       462408062.00   549568665.00
Ops SIS Search Efficiency            35.18          33.72
Ops SIS Domain Search Eff            24.07          22.72
Ops SIS Fast Success Rate            41.60          42.20
Ops SIS Success Rate                 47.38          49.11
Ops SIS Recent Success Rate          26.87          26.57

The field I would expect to decrease is SIS Domain Scanned -- the number
of runqueues that were examined but it's actually worse and graphing over
time shows it's worse for the client thread counts.  select_idle_cpu()
is definitely being called because "Domain Search" is 10 times higher than
"Core Search" and there "Core Miss" is non-zero.

I suspect the issue is that the mask is only marked busy from the tick
context which is a very wide window. If select_idle_cpu() picks an idle
CPU from the mask, it's still marked as idle in the mask.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-09 13:09     ` Vincent Guittot
@ 2020-12-09 14:53       ` Li, Aubrey
  0 siblings, 0 replies; 22+ messages in thread
From: Li, Aubrey @ 2020-12-09 14:53 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Mel Gorman,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao

On 2020/12/9 21:09, Vincent Guittot wrote:
> On Wed, 9 Dec 2020 at 11:58, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>>
>> On 2020/12/9 16:15, Vincent Guittot wrote:
>>> Le mercredi 09 déc. 2020 à 14:24:04 (+0800), Aubrey Li a écrit :
>>>> Add idle cpumask to track idle cpus in sched domain. Every time
>>>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>>>> target. And if the CPU is not in idle, the CPU is cleared in idle
>>>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>>>
>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>> has lower cost than scanning all the cpus in last level cache domain,
>>>> especially when the system is heavily loaded.
>>>>
>>>> Benchmarks including hackbench, schbench, uperf, sysbench mysql
>>>> and kbuild were tested on a x86 4 socket system with 24 cores per
>>>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>>>> found.
>>>>
>>>> v6->v7:
>>>> - place the whole idle cpumask mechanism under CONFIG_SMP.
>>>>
>>>> v5->v6:
>>>> - decouple idle cpumask update from stop_tick signal, set idle CPU
>>>>   in idle cpumask every time the CPU enters idle
>>>>
>>>> v4->v5:
>>>> - add update_idle_cpumask for s2idle case
>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>>>   idle_cpumask() everywhere
>>>>
>>>> v3->v4:
>>>> - change setting idle cpumask from every idle entry to tickless idle
>>>>   if cpu driver is available.
>>>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode.
>>>>
>>>> v2->v3:
>>>> - change setting idle cpumask to every idle entry, otherwise schbench
>>>>   has a regression of 99th percentile latency.
>>>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>>>>   idle cpumask is ratelimited in the idle exiting path.
>>>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.
>>>>
>>>> v1->v2:
>>>> - idle cpumask is updated in the nohz routines, by initializing idle
>>>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>>>>   behavior.
>>>>
>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>> Cc: Mel Gorman <mgorman@suse.de>
>>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>>>> Cc: Qais Yousef <qais.yousef@arm.com>
>>>> Cc: Valentin Schneider <valentin.schneider@arm.com>
>>>> Cc: Jiang Biao <benbjiang@gmail.com>
>>>> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>>>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>>>> ---
>>>>  include/linux/sched/topology.h | 13 +++++++++
>>>>  kernel/sched/core.c            |  2 ++
>>>>  kernel/sched/fair.c            | 51 +++++++++++++++++++++++++++++++++-
>>>>  kernel/sched/idle.c            |  5 ++++
>>>>  kernel/sched/sched.h           |  4 +++
>>>>  kernel/sched/topology.c        |  3 +-
>>>>  6 files changed, 76 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>>>> index 820511289857..b47b85163607 100644
>>>> --- a/include/linux/sched/topology.h
>>>> +++ b/include/linux/sched/topology.h
>>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>>>      atomic_t        ref;
>>>>      atomic_t        nr_busy_cpus;
>>>>      int             has_idle_cores;
>>>> +    /*
>>>> +     * Span of all idle CPUs in this domain.
>>>> +     *
>>>> +     * NOTE: this field is variable length. (Allocated dynamically
>>>> +     * by attaching extra space to the end of the structure,
>>>> +     * depending on how many CPUs the kernel has booted up with)
>>>> +     */
>>>> +    unsigned long   idle_cpus_span[];
>>>>  };
>>>>
>>>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>>>> +{
>>>> +    return to_cpumask(sds->idle_cpus_span);
>>>> +}
>>>> +
>>>>  struct sched_domain {
>>>>      /* These fields must be setup */
>>>>      struct sched_domain __rcu *parent;      /* top domain must be null terminated */
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index c4da7e17b906..c4c51ff3402a 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>>>>
>>>>  #ifdef CONFIG_SMP
>>>>      rq->idle_balance = idle_cpu(cpu);
>>>> +    update_idle_cpumask(cpu, false);
>>>
>>> Test rq->idle_balance here instead of adding the test in update_idle_cpumask which is only
>>> relevant for this situation.
>>
>> If called from idle path, because !set_idle is false, rq->idle_balance won't be tested actually.
>>
>>         if (!set_idle && rq->idle_balance)
>>                 return;
>>
>> So is it okay to leave it here to keep scheduler_tick a bit concise?
> 
> I don't like having a tick specific condition in a generic function.
> rq->idle_balance is only relevant in this case
> 
> calling update_idle_cpumask(rq->idle_balance) in scheduler_tick()
> should do the job and we can remove the check of rq->idle_balance in
> update_idle_cpumask()
> 
> In case of scheduler_tick() called when idle , we will only test if
> (rq->last_idle_state == idle_state) and return
> 

I see, will come up with a v8 soon.

Thanks,
-Aubrey



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-09 14:36 ` Mel Gorman
@ 2020-12-10  8:23   ` Li, Aubrey
  2020-12-10 11:34     ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Li, Aubrey @ 2020-12-10  8:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: mingo, peterz, juri.lelli, vincent.guittot, valentin.schneider,
	qais.yousef, dietmar.eggemann, rostedt, bsegall, tim.c.chen,
	linux-kernel, Mel Gorman, Jiang Biao

Hi Mel,

On 2020/12/9 22:36, Mel Gorman wrote:
> On Wed, Dec 09, 2020 at 02:24:04PM +0800, Aubrey Li wrote:
>> Add idle cpumask to track idle cpus in sched domain. Every time
>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>> target. And if the CPU is not in idle, the CPU is cleared in idle
>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has lower cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks including hackbench, schbench, uperf, sysbench mysql
>> and kbuild were tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>> found.
>>
> 
> I ran this patch with tbench on top of of the schedstat patches that
> track SIS efficiency. The tracking adds overhead so it's not a perfect
> performance comparison but the expectation would be that the patch reduces
> the number of runqueues that are scanned

Thanks for the measurement! I don't play with tbench so may need a while
to digest the data.

> 
> tbench4
>                           5.10.0-rc6             5.10.0-rc6
>                       schedstat-v1r1          idlemask-v7r1
> Hmean     1        504.76 (   0.00%)      500.14 *  -0.91%*
> Hmean     2       1001.22 (   0.00%)      970.37 *  -3.08%*
> Hmean     4       1930.56 (   0.00%)     1880.96 *  -2.57%*
> Hmean     8       3688.05 (   0.00%)     3537.72 *  -4.08%*
> Hmean     16      6352.71 (   0.00%)     6439.53 *   1.37%*
> Hmean     32     10066.37 (   0.00%)    10124.65 *   0.58%*
> Hmean     64     12846.32 (   0.00%)    11627.27 *  -9.49%*
> Hmean     128    22278.41 (   0.00%)    22304.33 *   0.12%*
> Hmean     256    21455.52 (   0.00%)    20900.13 *  -2.59%*
> Hmean     320    21802.38 (   0.00%)    21928.81 *   0.58%*
> 
> Not very optimistic result. The schedstats indicate;

How many client threads was the following schedstats collected?

> 
>                                 5.10.0-rc6     5.10.0-rc6
>                             schedstat-v1r1  idlemask-v7r1
> Ops TTWU Count               5599714302.00  5589495123.00
> Ops TTWU Local               2687713250.00  2563662550.00
> Ops SIS Search               5596677950.00  5586381168.00
> Ops SIS Domain Search        3268344934.00  3229088045.00
> Ops SIS Scanned             15909069113.00 16568899405.00
> Ops SIS Domain Scanned      13580736097.00 14211606282.00
> Ops SIS Failures             2944874939.00  2843113421.00
> Ops SIS Core Search           262853975.00   311781774.00
> Ops SIS Core Hit              185189656.00   216097102.00
> Ops SIS Core Miss              77664319.00    95684672.00
> Ops SIS Recent Used Hit       124265515.00   146021086.00
> Ops SIS Recent Used Miss      338142547.00   403547579.00
> Ops SIS Recent Attempts       462408062.00   549568665.00
> Ops SIS Search Efficiency            35.18          33.72
> Ops SIS Domain Search Eff            24.07          22.72
> Ops SIS Fast Success Rate            41.60          42.20
> Ops SIS Success Rate                 47.38          49.11
> Ops SIS Recent Success Rate          26.87          26.57
> 
> The field I would expect to decrease is SIS Domain Scanned -- the number
> of runqueues that were examined but it's actually worse and graphing over
> time shows it's worse for the client thread counts.  select_idle_cpu()
> is definitely being called because "Domain Search" is 10 times higher than
> "Core Search" and there "Core Miss" is non-zero.

Why SIS Domain Scanned can be decreased?

I thought SIS Scanned was supposed to be decreased but it seems not on your side.

I printed some trace log on my side by uperf workload, and it looks properly.
To make the log easy to read, I started a 4 VCPU VM to run 2-second uperf 8 threads.

stage 1: system idle, update_idle_cpumask is called from idle thread, set cpumask to 0-3
========================================================================================
          <idle>-0       [002] d..1   137.408681: update_idle_cpumask: set_idle-1, cpumask: 2
          <idle>-0       [000] d..1   137.408713: update_idle_cpumask: set_idle-1, cpumask: 0,2
          <idle>-0       [003] d..1   137.408924: update_idle_cpumask: set_idle-1, cpumask: 0,2-3
          <idle>-0       [001] d..1   137.409035: update_idle_cpumask: set_idle-1, cpumask: 0-3

stage 2: uperf ramp up, cpumask changes back and forth
========================================================
           uperf-561     [003] d..3   137.410620: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] d..5   137.411384: select_task_rq_fair: scanning: 0-3
    kworker/u8:3-110     [000] d..4   137.411436: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] d.h1   137.412562: update_idle_cpumask: set_idle-0, cpumask: 1-3
           uperf-570     [002] d.h2   137.412580: update_idle_cpumask: set_idle-0, cpumask: 1,3
          <idle>-0       [002] d..1   137.412917: update_idle_cpumask: set_idle-1, cpumask: 1-3
          <idle>-0       [000] d..1   137.413004: update_idle_cpumask: set_idle-1, cpumask: 0-3
           uperf-560     [000] d..5   137.415856: select_task_rq_fair: scanning: 0-3
    kworker/u8:3-110     [001] d..4   137.415956: select_task_rq_fair: scanning: 0-3
            sshd-504     [003] d.h1   137.416562: update_idle_cpumask: set_idle-0, cpumask: 0-2
           uperf-560     [000] d.h1   137.416598: update_idle_cpumask: set_idle-0, cpumask: 1-2
          <idle>-0       [003] d..1   137.416638: update_idle_cpumask: set_idle-1, cpumask: 1-3
          <idle>-0       [000] d..1   137.417076: update_idle_cpumask: set_idle-1, cpumask: 0-3
    tmux: server-528     [001] d.h.   137.448566: update_idle_cpumask: set_idle-0, cpumask: 0,2-3
          <idle>-0       [001] d..1   137.448980: update_idle_cpumask: set_idle-1, cpumask: 0-3

stage 3: uperf running, select_idle_cpu scan all the CPUs in the scheduler domain at the beginning
===================================================================================================
           uperf-560     [000] d..3   138.418494: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] d..3   138.418506: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] d..3   138.418514: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] dN.3   138.418534: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] dN.3   138.418543: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] dN.3   138.418551: select_task_rq_fair: scanning: 0-3
           uperf-561     [003] d..3   138.418577: select_task_rq_fair: scanning: 0-3
           uperf-561     [003] d..3   138.418600: select_task_rq_fair: scanning: 0-3
           uperf-561     [003] d..3   138.418617: select_task_rq_fair: scanning: 0-3
           uperf-561     [003] d..3   138.418640: select_task_rq_fair: scanning: 0-3
           uperf-561     [003] d..3   138.418652: select_task_rq_fair: scanning: 0-3
           uperf-561     [003] d..3   138.418662: select_task_rq_fair: scanning: 0-3
           uperf-561     [003] d..3   138.418672: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] d..5   138.418676: select_task_rq_fair: scanning: 0-3
           uperf-561     [003] d..3   138.418693: select_task_rq_fair: scanning: 0-3
    kworker/u8:3-110     [002] d..4   138.418746: select_task_rq_fair: scanning: 0-3

stage 4: scheduler tick comes, update idle cpumask to EMPTY
============================================================
           uperf-572     [002] d.h.   139.420568: update_idle_cpumask: set_idle-0, cpumask: 1,3
           uperf-574     [000] d.H2   139.420568: update_idle_cpumask: set_idle-0, cpumask: 1,3
           uperf-565     [003] d.H6   139.420569: update_idle_cpumask: set_idle-0, cpumask: 1
    tmux: server-528     [001] d.h2   139.420572: update_idle_cpumask: set_idle-0, cpumask: 


stage 5: uperf continue running, select_idle_cpu does not scan idle cpu
========================================================================
<I only run 2 seconds uperf, during this two seconds, no idle cpu in cpumask to scan>

           uperf-565     [003] d.sa   139.420587: select_task_rq_fair: scanning: 
           uperf-572     [002] d.sa   139.420670: select_task_rq_fair: scanning: 
	   ............
           uperf-561     [003] d..5   141.421620: select_task_rq_fair: scanning: 
           uperf-571     [001] d.sa   141.421630: select_task_rq_fair: scanning: 

stage 6: uperf benchmark finished, idle thread switch on 
=========================================================

          <idle>-0       [002] d..1   141.421631: update_idle_cpumask: set_idle-1, cpumask: 2
          <idle>-0       [000] d..1   141.421654: update_idle_cpumask: set_idle-1, cpumask: 0,2
          <idle>-0       [001] d..1   141.421665: update_idle_cpumask: set_idle-1, cpumask: 0-2
           uperf-561     [003] d..5   141.421712: select_task_rq_fair: scanning: 0-2
          <idle>-0       [003] d..1   141.421807: update_idle_cpumask: set_idle-1, cpumask: 0-3


stage 7: uperf ramp down
==========================
           uperf-560     [000] d..5   141.423075: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] d..5   141.423107: select_task_rq_fair: scanning: 0-3
           uperf-560     [000] d..5   141.423259: select_task_rq_fair: scanning: 0-3
    tmux: server-528     [002] d..5   141.730489: select_task_rq_fair: scanning: 1-3
    kworker/u8:1-96      [003] d..4   141.731924: select_task_rq_fair: scanning: 1-3
          <idle>-0       [000] d..1   141.734560: update_idle_cpumask: set_idle-1, cpumask: 0-3
    tmux: server-528     [002] d.h1   141.736568: update_idle_cpumask: set_idle-0, cpumask: 0-1
        uperf.sh-558     [003] d.h.   141.736570: update_idle_cpumask: set_idle-0, cpumask: 0-1
          <idle>-0       [002] d..1   141.736718: update_idle_cpumask: set_idle-1, cpumask: 0-2
          <idle>-0       [003] d..1   141.738179: update_idle_cpumask: set_idle-1, cpumask: 0-3
           pkill-578     [001] d.h1   141.740569: update_idle_cpumask: set_idle-0, cpumask: 0,2-3
          <idle>-0       [001] d..1   141.741875: update_idle_cpumask: set_idle-1, cpumask: 0-3
           pkill-578     [001] d.h.   141.744570: update_idle_cpumask: set_idle-0, cpumask: 0,2-3
           pkill-578     [001] d..6   141.770012: select_task_rq_fair: scanning: 0,2-3
          <idle>-0       [001] d..1   141.770938: update_idle_cpumask: set_idle-1, cpumask: 0-3

In this case, SIS Scanned should be decreased. I'll apply your schedstat patch to see if
the data matches.

> 
> I suspect the issue is that the mask is only marked busy from the tick
> context which is a very wide window. If select_idle_cpu() picks an idle
> CPU from the mask, it's still marked as idle in the mask.
> 
That should be fine because we still check available_idle_cpu() and sched_idle_cpu for the selected
CPU. And even if that CPU is marked as idle in the mask, that mask should not be worse(bigger) than
the default sched_domain_span(sd).

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-10  8:23   ` Li, Aubrey
@ 2020-12-10 11:34     ` Mel Gorman
  2020-12-10 12:21       ` Li, Aubrey
  2020-12-14  7:53       ` Li, Aubrey
  0 siblings, 2 replies; 22+ messages in thread
From: Mel Gorman @ 2020-12-10 11:34 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: mingo, peterz, juri.lelli, vincent.guittot, valentin.schneider,
	qais.yousef, dietmar.eggemann, rostedt, bsegall, tim.c.chen,
	linux-kernel, Mel Gorman, Jiang Biao

On Thu, Dec 10, 2020 at 04:23:47PM +0800, Li, Aubrey wrote:
> > I ran this patch with tbench on top of of the schedstat patches that
> > track SIS efficiency. The tracking adds overhead so it's not a perfect
> > performance comparison but the expectation would be that the patch reduces
> > the number of runqueues that are scanned
> 
> Thanks for the measurement! I don't play with tbench so may need a while
> to digest the data.
> 

They key point is that it appears the idle mask was mostly equivalent to
the full domain mask, at least for this test.

> > 
> > tbench4
> >                           5.10.0-rc6             5.10.0-rc6
> >                       schedstat-v1r1          idlemask-v7r1
> > Hmean     1        504.76 (   0.00%)      500.14 *  -0.91%*
> > Hmean     2       1001.22 (   0.00%)      970.37 *  -3.08%*
> > Hmean     4       1930.56 (   0.00%)     1880.96 *  -2.57%*
> > Hmean     8       3688.05 (   0.00%)     3537.72 *  -4.08%*
> > Hmean     16      6352.71 (   0.00%)     6439.53 *   1.37%*
> > Hmean     32     10066.37 (   0.00%)    10124.65 *   0.58%*
> > Hmean     64     12846.32 (   0.00%)    11627.27 *  -9.49%*
> > Hmean     128    22278.41 (   0.00%)    22304.33 *   0.12%*
> > Hmean     256    21455.52 (   0.00%)    20900.13 *  -2.59%*
> > Hmean     320    21802.38 (   0.00%)    21928.81 *   0.58%*
> > 
> > Not very optimistic result. The schedstats indicate;
> 
> How many client threads was the following schedstats collected?
> 

That's the overall summary for all client counts. While proc-schedstat
was measured every few seconds over all client counts, presenting that
in text format is not easy to parse. However, looking at the graphs over
time, it did not appear that scan rates were consistently lower for any
client count for tbench.

> > 
> >                                 5.10.0-rc6     5.10.0-rc6
> >                             schedstat-v1r1  idlemask-v7r1
> > Ops TTWU Count               5599714302.00  5589495123.00
> > Ops TTWU Local               2687713250.00  2563662550.00
> > Ops SIS Search               5596677950.00  5586381168.00
> > Ops SIS Domain Search        3268344934.00  3229088045.00
> > Ops SIS Scanned             15909069113.00 16568899405.00
> > Ops SIS Domain Scanned      13580736097.00 14211606282.00
> > Ops SIS Failures             2944874939.00  2843113421.00
> > Ops SIS Core Search           262853975.00   311781774.00
> > Ops SIS Core Hit              185189656.00   216097102.00
> > Ops SIS Core Miss              77664319.00    95684672.00
> > Ops SIS Recent Used Hit       124265515.00   146021086.00
> > Ops SIS Recent Used Miss      338142547.00   403547579.00
> > Ops SIS Recent Attempts       462408062.00   549568665.00
> > Ops SIS Search Efficiency            35.18          33.72
> > Ops SIS Domain Search Eff            24.07          22.72
> > Ops SIS Fast Success Rate            41.60          42.20
> > Ops SIS Success Rate                 47.38          49.11
> > Ops SIS Recent Success Rate          26.87          26.57
> > 
> > The field I would expect to decrease is SIS Domain Scanned -- the number
> > of runqueues that were examined but it's actually worse and graphing over
> > time shows it's worse for the client thread counts.  select_idle_cpu()
> > is definitely being called because "Domain Search" is 10 times higher than
> > "Core Search" and there "Core Miss" is non-zero.
> 
> Why SIS Domain Scanned can be decreased?
> 

Because if idle CPUs are being targetted and its a subset of the entire
domain then it follows that fewer runqueues should be examined when
scanning the domain.

> I thought SIS Scanned was supposed to be decreased but it seems not on your side.
> 

It *should* have been decreased but it's indicating that more runqueues
were scanned with the patch. It should be noted that the scan count is
naturally variable because the depth of each individual search is variable.

> I printed some trace log on my side by uperf workload, and it looks properly.
> To make the log easy to read, I started a 4 VCPU VM to run 2-second uperf 8 threads.
> 
> stage 1: system idle, update_idle_cpumask is called from idle thread, set cpumask to 0-3
> ========================================================================================
>           <idle>-0       [002] d..1   137.408681: update_idle_cpumask: set_idle-1, cpumask: 2
>           <idle>-0       [000] d..1   137.408713: update_idle_cpumask: set_idle-1, cpumask: 0,2
>           <idle>-0       [003] d..1   137.408924: update_idle_cpumask: set_idle-1, cpumask: 0,2-3
>           <idle>-0       [001] d..1   137.409035: update_idle_cpumask: set_idle-1, cpumask: 0-3
> 

What's the size of the LLC domain on this machine? If it's 4 then this
is indicating that there is little difference between scanning the full
domain and targetting idle CPUs via the idle cpumask.

> stage 3: uperf running, select_idle_cpu scan all the CPUs in the scheduler domain at the beginning
> ===================================================================================================
>            uperf-560     [000] d..3   138.418494: select_task_rq_fair: scanning: 0-3
>            uperf-560     [000] d..3   138.418506: select_task_rq_fair: scanning: 0-3
>            uperf-560     [000] d..3   138.418514: select_task_rq_fair: scanning: 0-3
>            uperf-560     [000] dN.3   138.418534: select_task_rq_fair: scanning: 0-3
>            uperf-560     [000] dN.3   138.418543: select_task_rq_fair: scanning: 0-3
>            uperf-560     [000] dN.3   138.418551: select_task_rq_fair: scanning: 0-3
>            uperf-561     [003] d..3   138.418577: select_task_rq_fair: scanning: 0-3
>            uperf-561     [003] d..3   138.418600: select_task_rq_fair: scanning: 0-3
>            uperf-561     [003] d..3   138.418617: select_task_rq_fair: scanning: 0-3
>            uperf-561     [003] d..3   138.418640: select_task_rq_fair: scanning: 0-3
>            uperf-561     [003] d..3   138.418652: select_task_rq_fair: scanning: 0-3
>            uperf-561     [003] d..3   138.418662: select_task_rq_fair: scanning: 0-3
>            uperf-561     [003] d..3   138.418672: select_task_rq_fair: scanning: 0-3
>            uperf-560     [000] d..5   138.418676: select_task_rq_fair: scanning: 0-3
>            uperf-561     [003] d..3   138.418693: select_task_rq_fair: scanning: 0-3
>     kworker/u8:3-110     [002] d..4   138.418746: select_task_rq_fair: scanning: 0-3
> 
> stage 4: scheduler tick comes, update idle cpumask to EMPTY
> ============================================================
>            uperf-572     [002] d.h.   139.420568: update_idle_cpumask: set_idle-0, cpumask: 1,3
>            uperf-574     [000] d.H2   139.420568: update_idle_cpumask: set_idle-0, cpumask: 1,3
>            uperf-565     [003] d.H6   139.420569: update_idle_cpumask: set_idle-0, cpumask: 1
>     tmux: server-528     [001] d.h2   139.420572: update_idle_cpumask: set_idle-0, cpumask: 
> 

It's the timing of the clear that may be problematic. For the
configurations I use, the tick is every 4 milliseconds which is a very
long time in the scheduler for communicating workloads. A simple
round-trip of perf pipe where tasks will be entering/exiting rapidly is
just 0.004 milliseconds on my desktop.

> > 
> > I suspect the issue is that the mask is only marked busy from the tick
> > context which is a very wide window. If select_idle_cpu() picks an idle
> > CPU from the mask, it's still marked as idle in the mask.
> > 
> That should be fine because we still check available_idle_cpu() and sched_idle_cpu for the selected
> CPU. And even if that CPU is marked as idle in the mask, that mask should not be worse(bigger) than
> the default sched_domain_span(sd).
> 

I agree that it should never be worse/bigger but if the idle cpu mask is
often the same bits as the sched_domain_span then it's not going to be
any better either.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-10 11:34     ` Mel Gorman
@ 2020-12-10 12:21       ` Li, Aubrey
  2020-12-10 12:58         ` Mel Gorman
  2020-12-14  7:53       ` Li, Aubrey
  1 sibling, 1 reply; 22+ messages in thread
From: Li, Aubrey @ 2020-12-10 12:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: mingo, peterz, juri.lelli, vincent.guittot, valentin.schneider,
	qais.yousef, dietmar.eggemann, rostedt, bsegall, tim.c.chen,
	linux-kernel, Mel Gorman, Jiang Biao

On 2020/12/10 19:34, Mel Gorman wrote:
> On Thu, Dec 10, 2020 at 04:23:47PM +0800, Li, Aubrey wrote:
>>> I ran this patch with tbench on top of of the schedstat patches that
>>> track SIS efficiency. The tracking adds overhead so it's not a perfect
>>> performance comparison but the expectation would be that the patch reduces
>>> the number of runqueues that are scanned
>>
>> Thanks for the measurement! I don't play with tbench so may need a while
>> to digest the data.
>>
> 
> They key point is that it appears the idle mask was mostly equivalent to
> the full domain mask, at least for this test.

I'm more interested in how tbench with heavy load behaves.
If the load is heavy enough and idle thread has no chance to switch in,
idle cpumask will be empty at the first scheduler tick and remain empty
before the load comes down, during this period of heavy load:
- default select_idle_cpu still scan the entire sched domain(or throttled to
  4) everytime
- the patched select_idle_cpu does not scan at all
> 
>>>
>>> tbench4
>>>                           5.10.0-rc6             5.10.0-rc6
>>>                       schedstat-v1r1          idlemask-v7r1
>>> Hmean     1        504.76 (   0.00%)      500.14 *  -0.91%*
>>> Hmean     2       1001.22 (   0.00%)      970.37 *  -3.08%*
>>> Hmean     4       1930.56 (   0.00%)     1880.96 *  -2.57%*
>>> Hmean     8       3688.05 (   0.00%)     3537.72 *  -4.08%*
>>> Hmean     16      6352.71 (   0.00%)     6439.53 *   1.37%*
>>> Hmean     32     10066.37 (   0.00%)    10124.65 *   0.58%*
>>> Hmean     64     12846.32 (   0.00%)    11627.27 *  -9.49%*
>>> Hmean     128    22278.41 (   0.00%)    22304.33 *   0.12%*
>>> Hmean     256    21455.52 (   0.00%)    20900.13 *  -2.59%*
>>> Hmean     320    21802.38 (   0.00%)    21928.81 *   0.58%*
>>>
>>> Not very optimistic result. The schedstats indicate;
>>
>> How many client threads was the following schedstats collected?
>>
> 
> That's the overall summary for all client counts. While proc-schedstat
> was measured every few seconds over all client counts, presenting that
> in text format is not easy to parse. However, looking at the graphs over
> time, it did not appear that scan rates were consistently lower for any
> client count for tbench.
> 
>>>
>>>                                 5.10.0-rc6     5.10.0-rc6
>>>                             schedstat-v1r1  idlemask-v7r1
>>> Ops TTWU Count               5599714302.00  5589495123.00
>>> Ops TTWU Local               2687713250.00  2563662550.00
>>> Ops SIS Search               5596677950.00  5586381168.00
>>> Ops SIS Domain Search        3268344934.00  3229088045.00
>>> Ops SIS Scanned             15909069113.00 16568899405.00
>>> Ops SIS Domain Scanned      13580736097.00 14211606282.00
>>> Ops SIS Failures             2944874939.00  2843113421.00
>>> Ops SIS Core Search           262853975.00   311781774.00
>>> Ops SIS Core Hit              185189656.00   216097102.00
>>> Ops SIS Core Miss              77664319.00    95684672.00
>>> Ops SIS Recent Used Hit       124265515.00   146021086.00
>>> Ops SIS Recent Used Miss      338142547.00   403547579.00
>>> Ops SIS Recent Attempts       462408062.00   549568665.00
>>> Ops SIS Search Efficiency            35.18          33.72
>>> Ops SIS Domain Search Eff            24.07          22.72
>>> Ops SIS Fast Success Rate            41.60          42.20
>>> Ops SIS Success Rate                 47.38          49.11
>>> Ops SIS Recent Success Rate          26.87          26.57
>>>
>>> The field I would expect to decrease is SIS Domain Scanned -- the number
>>> of runqueues that were examined but it's actually worse and graphing over
>>> time shows it's worse for the client thread counts.  select_idle_cpu()
>>> is definitely being called because "Domain Search" is 10 times higher than
>>> "Core Search" and there "Core Miss" is non-zero.
>>
>> Why SIS Domain Scanned can be decreased?
>>
> 
> Because if idle CPUs are being targetted and its a subset of the entire
> domain then it follows that fewer runqueues should be examined when
> scanning the domain.

Sorry, I probably messed up "SIS Domain Scanned" and "SIS Domain Search".
How is "SIS Domain Scanned" calculated?

> 
>> I thought SIS Scanned was supposed to be decreased but it seems not on your side.
>>
> 
> It *should* have been decreased but it's indicating that more runqueues
> were scanned with the patch. It should be noted that the scan count is
> naturally variable because the depth of each individual search is variable.

I'm confused here, I saw 4 places to increase sis_scanned, so what's the difference
between "SIS Domain Scanned" and "SIS scanned"?

select_idle_sibling(struct task_struct *p, int prev, int target)
---->schedstat_inc(this_rq()->sis_scanned);

select_idle_core(struct task_struct *p, struct sched_domain *sd, int
---->schedstat_inc(this_rq()->sis_scanned);

static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
---->schedstat_inc(this_rq()->sis_scanned);

select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
---->schedstat_inc(this_rq()->sis_scanned);

> 
>> I printed some trace log on my side by uperf workload, and it looks properly.
>> To make the log easy to read, I started a 4 VCPU VM to run 2-second uperf 8 threads.
>>
>> stage 1: system idle, update_idle_cpumask is called from idle thread, set cpumask to 0-3
>> ========================================================================================
>>           <idle>-0       [002] d..1   137.408681: update_idle_cpumask: set_idle-1, cpumask: 2
>>           <idle>-0       [000] d..1   137.408713: update_idle_cpumask: set_idle-1, cpumask: 0,2
>>           <idle>-0       [003] d..1   137.408924: update_idle_cpumask: set_idle-1, cpumask: 0,2-3
>>           <idle>-0       [001] d..1   137.409035: update_idle_cpumask: set_idle-1, cpumask: 0-3
>>
> 
> What's the size of the LLC domain on this machine? If it's 4 then this
> is indicating that there is little difference between scanning the full
> domain and targetting idle CPUs via the idle cpumask.

But idle cpumask saves 4 loops of scanning runqueue everytime select_idle_cpu().

> 
>> stage 3: uperf running, select_idle_cpu scan all the CPUs in the scheduler domain at the beginning
>> ===================================================================================================
>>            uperf-560     [000] d..3   138.418494: select_task_rq_fair: scanning: 0-3
>>            uperf-560     [000] d..3   138.418506: select_task_rq_fair: scanning: 0-3
>>            uperf-560     [000] d..3   138.418514: select_task_rq_fair: scanning: 0-3
>>            uperf-560     [000] dN.3   138.418534: select_task_rq_fair: scanning: 0-3
>>            uperf-560     [000] dN.3   138.418543: select_task_rq_fair: scanning: 0-3
>>            uperf-560     [000] dN.3   138.418551: select_task_rq_fair: scanning: 0-3
>>            uperf-561     [003] d..3   138.418577: select_task_rq_fair: scanning: 0-3
>>            uperf-561     [003] d..3   138.418600: select_task_rq_fair: scanning: 0-3
>>            uperf-561     [003] d..3   138.418617: select_task_rq_fair: scanning: 0-3
>>            uperf-561     [003] d..3   138.418640: select_task_rq_fair: scanning: 0-3
>>            uperf-561     [003] d..3   138.418652: select_task_rq_fair: scanning: 0-3
>>            uperf-561     [003] d..3   138.418662: select_task_rq_fair: scanning: 0-3
>>            uperf-561     [003] d..3   138.418672: select_task_rq_fair: scanning: 0-3
>>            uperf-560     [000] d..5   138.418676: select_task_rq_fair: scanning: 0-3
>>            uperf-561     [003] d..3   138.418693: select_task_rq_fair: scanning: 0-3
>>     kworker/u8:3-110     [002] d..4   138.418746: select_task_rq_fair: scanning: 0-3
>>
>> stage 4: scheduler tick comes, update idle cpumask to EMPTY
>> ============================================================
>>            uperf-572     [002] d.h.   139.420568: update_idle_cpumask: set_idle-0, cpumask: 1,3
>>            uperf-574     [000] d.H2   139.420568: update_idle_cpumask: set_idle-0, cpumask: 1,3
>>            uperf-565     [003] d.H6   139.420569: update_idle_cpumask: set_idle-0, cpumask: 1
>>     tmux: server-528     [001] d.h2   139.420572: update_idle_cpumask: set_idle-0, cpumask: 
>>
> 
> It's the timing of the clear that may be problematic. For the
> configurations I use, the tick is every 4 milliseconds which is a very
> long time in the scheduler for communicating workloads. A simple
> round-trip of perf pipe where tasks will be entering/exiting rapidly is
> just 0.004 milliseconds on my desktop.

Agreed, but that's not the case this patch targeted. The target of this
patch is the system load is heavy enough so that idle thread has no chance
to be scheduled.

> 
>>>
>>> I suspect the issue is that the mask is only marked busy from the tick
>>> context which is a very wide window. If select_idle_cpu() picks an idle
>>> CPU from the mask, it's still marked as idle in the mask.
>>>
>> That should be fine because we still check available_idle_cpu() and sched_idle_cpu for the selected
>> CPU. And even if that CPU is marked as idle in the mask, that mask should not be worse(bigger) than
>> the default sched_domain_span(sd).
>>
> 
> I agree that it should never be worse/bigger but if the idle cpu mask is
> often the same bits as the sched_domain_span then it's not going to be
> any better either.
> 

At least on my side, v5 is easier to see the benefit, as rapidly entering
and exiting idle will make idle governor to retain the tick, v5 uses stop_tick
signal from idle governor to decide if this CPU is added to idle cpumask.
https://lore.kernel.org/lkml/20201118043113.53128-1-aubrey.li@linux.intel.com/

But as Vincent and Peter concerned, v5 may break the latency sensitive workload.
Also the short idle definition need more work.

Thanks,
-Aubrey


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-10 12:21       ` Li, Aubrey
@ 2020-12-10 12:58         ` Mel Gorman
  2020-12-11 17:44           ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2020-12-10 12:58 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: mingo, peterz, juri.lelli, vincent.guittot, valentin.schneider,
	qais.yousef, dietmar.eggemann, rostedt, bsegall, tim.c.chen,
	linux-kernel, Mel Gorman, Jiang Biao

On Thu, Dec 10, 2020 at 08:21:14PM +0800, Li, Aubrey wrote:
> >>>
> >>> The field I would expect to decrease is SIS Domain Scanned -- the number
> >>> of runqueues that were examined but it's actually worse and graphing over
> >>> time shows it's worse for the client thread counts.  select_idle_cpu()
> >>> is definitely being called because "Domain Search" is 10 times higher than
> >>> "Core Search" and there "Core Miss" is non-zero.
> >>
> >> Why SIS Domain Scanned can be decreased?
> >>
> > 
> > Because if idle CPUs are being targetted and its a subset of the entire
> > domain then it follows that fewer runqueues should be examined when
> > scanning the domain.
> 
> Sorry, I probably messed up "SIS Domain Scanned" and "SIS Domain Search".
> How is "SIS Domain Scanned" calculated?
> 

        my $fast_search = $sis_search - $sis_domain_search;
        my $domain_scanned = $sis_scanned - $fast_search;

The difference is margainal. The "fast" search is the checking of
target, prev and recent as described in this hunk

@@ -6246,6 +6249,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
        unsigned long task_util;
        int i, recent_used_cpu;

+       schedstat_inc(this_rq()->sis_search);
+
+       /*
+        * Checking if prev, target and recent is treated as one scan. A
+        * perfect hit on one of those is considered 100% efficiency.
+        * Further scanning impairs efficiency.
+        */
+       schedstat_inc(this_rq()->sis_scanned);
+
        /*
         * On asymmetric system, update task utilization because we will check
         * that the task fits with cpu's capacity.

For your patch, the fast path is irrelevant. I cared at one point
because I was looking at the hit rate for recent_cpu_used so any
improvement there would reduce the number of runqueues scanned in
select_idle_[core|cpu|smt].


> >> I thought SIS Scanned was supposed to be decreased but it seems not on your side.
> >>
> > 
> > It *should* have been decreased but it's indicating that more runqueues
> > were scanned with the patch. It should be noted that the scan count is
> > naturally variable because the depth of each individual search is variable.
> 
> I'm confused here, I saw 4 places to increase sis_scanned, so what's the difference
> between "SIS Domain Scanned" and "SIS scanned"?
> 

Domain scanned is the scans within select_idle_[core|cpu|smt]. SIS scanned
accounts for the checking of prev, target and recent. Your patch does
not care about the fast path which is perfectly fine. The fast path is
only really relevant when modifying how p->recent_used_cpu is used.

However, there should be some evidence that the patch reduces the amount
of scanning even it's workload-specific.

> >> I printed some trace log on my side by uperf workload, and it looks properly.
> >> To make the log easy to read, I started a 4 VCPU VM to run 2-second uperf 8 threads.
> >>
> >> stage 1: system idle, update_idle_cpumask is called from idle thread, set cpumask to 0-3
> >> ========================================================================================
> >>           <idle>-0       [002] d..1   137.408681: update_idle_cpumask: set_idle-1, cpumask: 2
> >>           <idle>-0       [000] d..1   137.408713: update_idle_cpumask: set_idle-1, cpumask: 0,2
> >>           <idle>-0       [003] d..1   137.408924: update_idle_cpumask: set_idle-1, cpumask: 0,2-3
> >>           <idle>-0       [001] d..1   137.409035: update_idle_cpumask: set_idle-1, cpumask: 0-3
> >>
> > 
> > What's the size of the LLC domain on this machine? If it's 4 then this
> > is indicating that there is little difference between scanning the full
> > domain and targetting idle CPUs via the idle cpumask.
> 
> But idle cpumask saves 4 loops of scanning runqueue everytime select_idle_cpu().
> 

I don't follow. If sds_idle_cpus(sd->shared) == sched_domain_span(sd)
then there are no savings.

> > It's the timing of the clear that may be problematic. For the
> > configurations I use, the tick is every 4 milliseconds which is a very
> > long time in the scheduler for communicating workloads. A simple
> > round-trip of perf pipe where tasks will be entering/exiting rapidly is
> > just 0.004 milliseconds on my desktop.
> 
> Agreed, but that's not the case this patch targeted. The target of this
> patch is the system load is heavy enough so that idle thread has no chance
> to be scheduled.
> 

That's somewhat more marginal as a benefit but fair enough. I took a
closer look at the scan rates for 320 clients which loads the machine
fairly heavily. mpstat is reported 0% idle time but the domain scanned
with or without the patch is almost identical which hints that scans were
not avoided. As I write this, hackbench has not executed yet to see if
it makes a difference but it's in the queue.

> > I agree that it should never be worse/bigger but if the idle cpu mask is
> > often the same bits as the sched_domain_span then it's not going to be
> > any better either.
> > 
> 
> At least on my side, v5 is easier to see the benefit, as rapidly entering
> and exiting idle will make idle governor to retain the tick, v5 uses stop_tick
> signal from idle governor to decide if this CPU is added to idle cpumask.
> https://lore.kernel.org/lkml/20201118043113.53128-1-aubrey.li@linux.intel.com/
> 
> But as Vincent and Peter concerned, v5 may break the latency sensitive workload.
> Also the short idle definition need more work.
> 

Indeed, this is why I was advocating an approach that would use the idle
CPU mask as a hint to quickly find idle CPUs but still fall through
the scan the rest of the domain if necessary. That would cover the
"latency sensitive" workload situation because a deeper scan still
happens if necessary. While I'm not 100% certain, they were probably
thinking about schbench which favours deep searches for idle CPUs and is
latency sensitive.

The prequisite patch to make that approach work was rejected though
as on its own, it's not very helpful and Vincent didn't like that the
load_balance_mask was abused to make it effective.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-10 12:58         ` Mel Gorman
@ 2020-12-11 17:44           ` Peter Zijlstra
  2020-12-11 20:43             ` Mel Gorman
  2020-12-14  9:18             ` Vincent Guittot
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2020-12-11 17:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Li, Aubrey, mingo, juri.lelli, vincent.guittot,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, linux-kernel, Mel Gorman, Jiang Biao

On Thu, Dec 10, 2020 at 12:58:33PM +0000, Mel Gorman wrote:
> The prequisite patch to make that approach work was rejected though
> as on its own, it's not very helpful and Vincent didn't like that the
> load_balance_mask was abused to make it effective.

So last time I poked at all this, I found that using more masks was a
performance issue as well due to added cache misses.

Anyway, I finally found some time to look at all this, and while reading
over the whole SiS crud to refresh the brain came up with the below...

It's still naf, but at least it gets rid of a bunch of duplicate
scanning and LoC decreases, so it should be awesome. Ofcourse, as
always, benchmarks will totally ruin everything, we'll see, I started
some.

It goes on top of Mel's first two patches (which is about as far as I
got)...

---
 fair.c |  124 ++++++++++++++++++++++++-----------------------------------------
 1 file changed, 47 insertions(+), 77 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6021,11 +6021,13 @@ static inline void set_idle_cores(int cp
 
 static inline bool test_idle_cores(int cpu, bool def)
 {
-	struct sched_domain_shared *sds;
+	if (static_branch_likely(&sched_smt_present)) {
+		struct sched_domain_shared *sds;
 
-	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-	if (sds)
-		return READ_ONCE(sds->has_idle_cores);
+		sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+		if (sds)
+			return READ_ONCE(sds->has_idle_cores);
+	}
 
 	return def;
 }
@@ -6059,77 +6061,39 @@ void __update_idle_core(struct rq *rq)
 	rcu_read_unlock();
 }
 
-/*
- * Scan the entire LLC domain for idle cores; this dynamically switches off if
- * there are no idle cores left in the system; tracked through
- * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
- */
-static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+static int __select_idle_core(int core, struct cpumask *cpus, int *idle_cpu)
 {
-	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-	int core, cpu;
-
-	if (!static_branch_likely(&sched_smt_present))
-		return -1;
-
-	if (!test_idle_cores(target, false))
-		return -1;
-
-	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
-
-	for_each_cpu_wrap(core, cpus, target) {
-		bool idle = true;
+	bool idle = true;
+	int cpu;
 
-		for_each_cpu(cpu, cpu_smt_mask(core)) {
-			if (!available_idle_cpu(cpu)) {
-				idle = false;
-				break;
-			}
+	for_each_cpu(cpu, cpu_smt_mask(core)) {
+		if (!available_idle_cpu(cpu)) {
+			idle = false;
+			continue;
 		}
-
-		if (idle)
-			return core;
-
-		cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+		if (idle_cpu)
+			*idle_cpu = cpu;
 	}
 
-	/*
-	 * Failed to find an idle core; stop looking for one.
-	 */
-	set_idle_cores(target, 0);
+	if (idle)
+		return core;
 
+	cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
 	return -1;
 }
 
-/*
- * Scan the local SMT mask for idle CPUs.
- */
-static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
-{
-	int cpu;
-
-	if (!static_branch_likely(&sched_smt_present))
-		return -1;
-
-	for_each_cpu(cpu, cpu_smt_mask(target)) {
-		if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
-		    !cpumask_test_cpu(cpu, sched_domain_span(sd)))
-			continue;
-		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-			return cpu;
-	}
+#else /* CONFIG_SCHED_SMT */
 
-	return -1;
+static inline void set_idle_cores(int cpu, int val)
+{
 }
 
-#else /* CONFIG_SCHED_SMT */
-
-static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+static inline bool test_idle_cores(int cpu, bool def)
 {
-	return -1;
+	return def;
 }
 
-static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+static inline int __select_idle_core(int core, struct cpumask *cpus, int *idle_cpu)
 {
 	return -1;
 }
@@ -6144,10 +6108,11 @@ static inline int select_idle_smt(struct
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	int this = smp_processor_id();
+	bool smt = test_idle_cores(this, false);
+	int i, cpu, idle_cpu = -1, nr = INT_MAX;
 	struct sched_domain *this_sd;
 	u64 time;
-	int this = smp_processor_id();
-	int cpu, nr = INT_MAX;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6155,7 +6120,7 @@ static int select_idle_cpu(struct task_s
 
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
-	if (sched_feat(SIS_PROP)) {
+	if (sched_feat(SIS_PROP) && !smt) {
 		u64 avg_cost, avg_idle, span_avg;
 
 		/*
@@ -6175,18 +6140,31 @@ static int select_idle_cpu(struct task_s
 	}
 
 	for_each_cpu_wrap(cpu, cpus, target) {
-		if (!--nr)
-			return -1;
-		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-			break;
+		if (smt) {
+			i = __select_idle_core(cpu, cpus, &idle_cpu);
+			if ((unsigned)i < nr_cpumask_bits)
+				return i;
+
+		} else {
+			if (!--nr)
+				return -1;
+
+			if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
+				idle_cpu = cpu;
+				break;
+			}
+		}
 	}
 
-	if (sched_feat(SIS_PROP)) {
+	if (smt)
+		set_idle_cores(this, false);
+
+	if (sched_feat(SIS_PROP) && !smt) {
 		time = cpu_clock(this) - time;
 		update_avg(&this_sd->avg_scan_cost, time);
 	}
 
-	return cpu;
+	return idle_cpu;
 }
 
 /*
@@ -6315,18 +6293,10 @@ static int select_idle_sibling(struct ta
 	if (!sd)
 		return target;
 
-	i = select_idle_core(p, sd, target);
-	if ((unsigned)i < nr_cpumask_bits)
-		return i;
-
 	i = select_idle_cpu(p, sd, target);
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
-	i = select_idle_smt(p, sd, target);
-	if ((unsigned)i < nr_cpumask_bits)
-		return i;
-
 	return target;
 }
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-11 17:44           ` Peter Zijlstra
@ 2020-12-11 20:43             ` Mel Gorman
  2020-12-11 22:19               ` Peter Zijlstra
  2020-12-14  9:18             ` Vincent Guittot
  1 sibling, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2020-12-11 20:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Li, Aubrey, mingo, juri.lelli, vincent.guittot,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, linux-kernel, Mel Gorman, Jiang Biao

On Fri, Dec 11, 2020 at 06:44:42PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 10, 2020 at 12:58:33PM +0000, Mel Gorman wrote:
> > The prequisite patch to make that approach work was rejected though
> > as on its own, it's not very helpful and Vincent didn't like that the
> > load_balance_mask was abused to make it effective.
> 
> So last time I poked at all this, I found that using more masks was a
> performance issue as well due to added cache misses.
> 
> Anyway, I finally found some time to look at all this, and while reading
> over the whole SiS crud to refresh the brain came up with the below...
> 
> It's still naf, but at least it gets rid of a bunch of duplicate
> scanning and LoC decreases, so it should be awesome. Ofcourse, as
> always, benchmarks will totally ruin everything, we'll see, I started
> some.
> 
> It goes on top of Mel's first two patches (which is about as far as I
> got)...
> 

It's not free of bugs but it's interesting! The positive aspects are that
it does a single scan, inits select_idle_mask once and caches an idle
candidate when searching for an idle core but in a far cleaner fashion
than what I did.

One bug is in __select_idle_core() though. It's scanning the SMT mask,
not select_idle_mask so it can return an idle candidate that is not in
p->cpus_ptr.

I queued tests anyway to see what it looks like.

There are some other potential caveats.

This is a single pass so when test_idle_cores() is true, __select_idle_core
is used to to check all the siblings even if the core is not idle. That
could have been cut short if __select_idle_core checked *idle_cpu ==
1 and terminated the SMT scan if an idle candidate had already been found.

Second downside is related. If test_idle_cpus() was true but no idle
CPU is found then __select_idle_core has been called enough to scan
the entire domain. In this corner case, the new code does *more* work
because the old code would have failed select_idle_core() quickly and
then select_idle_cpu() would be throttled by SIS_PROP. I think this will
only be noticable in the heavily overloaded case but if the corner case
hits enough then the new code will be slower than the old code for the
over-saturated case (i.e. hackbench with lots of groups).

The third potential downside is that the SMT sibling is not guaranteed to
be checked due to SIS_PROP throttling but in the old code, that would have
been checked by select_idle_smt(). That might result in premature stacking
of runnable tasks on the same CPU. Similarly, as __select_idle_core may
find multiple idle candidates, it will not pick the targets SMT sibling
if it is idle like select_idle_smt would have.

That said, I am skeptical that select_idle_smt() matters all that often.
If select_idle_cpu() has been throttled heavily then the machine is
either over-saturated or was over-saturated in the very recent pass. In
that situation, the chances of the SMT siblings being free is slim.

Each of the downsides could potentially addressed with this untested
mess

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 637f610c7059..260592d143d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6061,7 +6061,7 @@ void __update_idle_core(struct rq *rq)
 	rcu_read_unlock();
 }
 
-static int __select_idle_core(int core, struct cpumask *cpus, int *idle_cpu)
+static int __select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
 	bool idle = true;
 	int cpu;
@@ -6070,9 +6070,11 @@ static int __select_idle_core(int core, struct cpumask *cpus, int *idle_cpu)
 		schedstat_inc(this_rq()->sis_scanned);
 		if (!available_idle_cpu(cpu)) {
 			idle = false;
-			continue;
+			if (*idle_cpu == -1)
+				continue;
+			break;
 		}
-		if (idle_cpu)
+		if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
 			*idle_cpu = cpu;
 	}
 
@@ -6094,7 +6096,7 @@ static inline bool test_idle_cores(int cpu, bool def)
 	return def;
 }
 
-static inline int __select_idle_core(int core, struct cpumask *cpus, int *idle_cpu)
+static inline int __select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
 	return -1;
 }
@@ -6142,7 +6144,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 
 	for_each_cpu_wrap(cpu, cpus, target) {
 		if (smt) {
-			i = __select_idle_core(cpu, cpus, &idle_cpu);
+			i = __select_idle_core(p, cpu, cpus, &idle_cpu);
 			if ((unsigned)i < nr_cpumask_bits)
 				return i;
 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-11 20:43             ` Mel Gorman
@ 2020-12-11 22:19               ` Peter Zijlstra
  2020-12-11 22:50                 ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2020-12-11 22:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Li, Aubrey, mingo, juri.lelli, vincent.guittot,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, linux-kernel, Mel Gorman, Jiang Biao

On Fri, Dec 11, 2020 at 08:43:37PM +0000, Mel Gorman wrote:
> One bug is in __select_idle_core() though. It's scanning the SMT mask,
> not select_idle_mask so it can return an idle candidate that is not in
> p->cpus_ptr.

D'0h.. luckily the benchmarks don't hit that :-)

> There are some other potential caveats.
> 
> This is a single pass so when test_idle_cores() is true, __select_idle_core
> is used to to check all the siblings even if the core is not idle. That
> could have been cut short if __select_idle_core checked *idle_cpu ==
> 1 and terminated the SMT scan if an idle candidate had already been found.

So I did that on purpose, so as to track the last/most-recent idle cpu,
with the thinking that that cpu has the higher chance of still being
idle vs one we checked earlier/longer-ago.

I suppose we benchmark both and see which is liked best.

> Second downside is related. If test_idle_cpus() was true but no idle
> CPU is found then __select_idle_core has been called enough to scan
> the entire domain. In this corner case, the new code does *more* work
> because the old code would have failed select_idle_core() quickly and
> then select_idle_cpu() would be throttled by SIS_PROP. I think this will
> only be noticable in the heavily overloaded case but if the corner case
> hits enough then the new code will be slower than the old code for the
> over-saturated case (i.e. hackbench with lots of groups).

Right, due to scanning siblings, even if the first inspected thread is
not idle, we scan more.

> The third potential downside is that the SMT sibling is not guaranteed to
> be checked due to SIS_PROP throttling but in the old code, that would have
> been checked by select_idle_smt(). That might result in premature stacking
> of runnable tasks on the same CPU. Similarly, as __select_idle_core may
> find multiple idle candidates, it will not pick the targets SMT sibling
> if it is idle like select_idle_smt would have.
> 
> That said, I am skeptical that select_idle_smt() matters all that often.

This, I didn't really believe in it either.


The benchmarks I started are mostly noise, with a few wins for
TCP_STREAM and UDP_RR around the 50% mark. Although I should run
longer variants to make sure.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-11 22:19               ` Peter Zijlstra
@ 2020-12-11 22:50                 ` Mel Gorman
  2020-12-14  8:11                   ` Vincent Guittot
  2020-12-14  9:32                   ` Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Mel Gorman @ 2020-12-11 22:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Li, Aubrey, mingo, juri.lelli, vincent.guittot,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, linux-kernel, Mel Gorman, Jiang Biao

On Fri, Dec 11, 2020 at 11:19:05PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 11, 2020 at 08:43:37PM +0000, Mel Gorman wrote:
> > One bug is in __select_idle_core() though. It's scanning the SMT mask,
> > not select_idle_mask so it can return an idle candidate that is not in
> > p->cpus_ptr.
> 
> D'0h.. luckily the benchmarks don't hit that :-)
> 

Yep, neither do mine for the most part which is why I ran it as-is but
eventually someone would complain that sched_setscheduler was being
ignored.

Trick is whether a failed check should continue searching for an idle core
or terminate early and just pick an allowed idle CPU. I tend to favour
the latter but it's hard to predict what sort of reasonable workloads
would be affected.

> > There are some other potential caveats.
> > 
> > This is a single pass so when test_idle_cores() is true, __select_idle_core
> > is used to to check all the siblings even if the core is not idle. That
> > could have been cut short if __select_idle_core checked *idle_cpu ==
> > 1 and terminated the SMT scan if an idle candidate had already been found.
> 
> So I did that on purpose, so as to track the last/most-recent idle cpu,
> with the thinking that that cpu has the higher chance of still being
> idle vs one we checked earlier/longer-ago.
> 
> I suppose we benchmark both and see which is liked best.
> 

I originally did something like that on purpose too but Vincent called
it out so it is worth mentioning now to avoid surprises. That said, I'm
at the point where anything SIS-related simply needs excessive exposure
because no matter what it does, someone is unhappy with it.

> > Second downside is related. If test_idle_cpus() was true but no idle
> > CPU is found then __select_idle_core has been called enough to scan
> > the entire domain. In this corner case, the new code does *more* work
> > because the old code would have failed select_idle_core() quickly and
> > then select_idle_cpu() would be throttled by SIS_PROP. I think this will
> > only be noticable in the heavily overloaded case but if the corner case
> > hits enough then the new code will be slower than the old code for the
> > over-saturated case (i.e. hackbench with lots of groups).
> 
> Right, due to scanning siblings, even if the first inspected thread is
> not idle, we scan more.
> 

Yep.

> > The third potential downside is that the SMT sibling is not guaranteed to
> > be checked due to SIS_PROP throttling but in the old code, that would have
> > been checked by select_idle_smt(). That might result in premature stacking
> > of runnable tasks on the same CPU. Similarly, as __select_idle_core may
> > find multiple idle candidates, it will not pick the targets SMT sibling
> > if it is idle like select_idle_smt would have.
> > 
> > That said, I am skeptical that select_idle_smt() matters all that often.
> 
> This, I didn't really believe in it either.
> 

Good because I think any benefit from select_idle_smt is so marginal
that it should be ignored if the full scan is simpler overall.

> The benchmarks I started are mostly noise, with a few wins for
> TCP_STREAM and UDP_RR around the 50% mark. Although I should run
> longer variants to make sure.

So far I have one benchmark from one machine. It's tbench again because
it's a reasonable communicating workload that is trivial to vary the
level of utilisation.

80-cpu CascadeLake, 2 sockets, HT enabled
tbench4
                          5.10.0-rc6             5.10.0-rc6             5.10.0-rc6
                       baseline-v1r1        singlescan-v1r1        singlescan-v1r3
Hmean     1        503.90 (   0.00%)      505.19 *   0.26%*      499.20 *  -0.93%*
Hmean     2        980.80 (   0.00%)      975.15 *  -0.58%*      983.79 *   0.31%*
Hmean     4       1912.37 (   0.00%)     1883.25 *  -1.52%*     1923.76 *   0.60%*
Hmean     8       3741.47 (   0.00%)     3568.66 *  -4.62%*     3690.60 *  -1.36%*
Hmean     16      6552.90 (   0.00%)     6549.97 *  -0.04%*     6478.37 *  -1.14%*
Hmean     32     10217.34 (   0.00%)    10266.66 *   0.48%*    10291.60 *   0.73%*
Hmean     64     13604.93 (   0.00%)    11240.88 * -17.38%*    12045.87 * -11.46%*
Hmean     128    21194.11 (   0.00%)    21316.08 *   0.58%*    21868.39 *   3.18%*
Hmean     256    21163.19 (   0.00%)    20989.14 *  -0.82%*    20831.20 *  -1.57%*
Hmean     320    20906.29 (   0.00%)    21024.11 *   0.56%*    20756.88 *  -0.71%*
Stddev    1          3.14 (   0.00%)        1.17 (  62.93%)        1.05 (  66.61%)
Stddev    2          4.44 (   0.00%)        3.72 (  16.35%)        2.20 (  50.56%)
Stddev    4          9.09 (   0.00%)       18.67 (-105.32%)        3.66 (  59.71%)
Stddev    8         12.87 (   0.00%)       18.96 ( -47.31%)       11.90 (   7.58%)
Stddev    16         8.34 (   0.00%)        8.77 (  -5.22%)       36.25 (-334.84%)
Stddev    32        27.05 (   0.00%)       20.90 (  22.74%)       28.57 (  -5.61%)
Stddev    64       720.66 (   0.00%)       20.12 (  97.21%)       32.10 (  95.55%)
Stddev    128       17.49 (   0.00%)       52.33 (-199.22%)      137.68 (-687.25%)
Stddev    256       57.17 (   0.00%)       18.87 (  67.00%)       38.98 (  31.81%)
Stddev    320       29.87 (   0.00%)       46.49 ( -55.64%)       31.48 (  -5.39%)

                  5.10.0-rc6  5.10.0-rc6  5.10.0-rc6
                baseline-v1r1singlescan-v1r1singlescan-v1r3
Duration User        9771.18     9435.64     9353.29
Duration System     34224.28    33829.01    33802.34
Duration Elapsed     2218.87     2218.87     2218.69

baseline is roughly what's in tip for 5.11-rc1 with patches 1-2 from my
series like you had.

singlescan-v1r1 is your patch

singlescan-v1r3 is your patch with my "untested" patch on top that
enforces p->cpus_ptr and short-cuts corner cases.

Few points of interest

1. 64 clients regresses. With 64 clients, this is roughly the point
   where test_idle_cores() returns false positives and we hit the worst
   corner cases. Your patch regresses 17%, mine is only a marginal
   improvement and still a regression slower.

2. Variations are all over the place. Your patch is great at low
   utilisation and stabilises overall. After that, your milage varies a lot

3. The system CPu usage summarised over the entire workload drops quite
   a bit. Whether it's your patch or minor changes on top, there is less
   work being done within the kernel overall

A wider battery of tests is still running and a second set is queued
that adds the debugging schedstats but they take ages to run.

I'm currently on "holidays" so response time will be variable :P

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-10 11:34     ` Mel Gorman
  2020-12-10 12:21       ` Li, Aubrey
@ 2020-12-14  7:53       ` Li, Aubrey
  1 sibling, 0 replies; 22+ messages in thread
From: Li, Aubrey @ 2020-12-14  7:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: mingo, peterz, juri.lelli, vincent.guittot, valentin.schneider,
	qais.yousef, dietmar.eggemann, rostedt, bsegall, tim.c.chen,
	linux-kernel, Mel Gorman, Jiang Biao

On 2020/12/10 19:34, Mel Gorman wrote:
> On Thu, Dec 10, 2020 at 04:23:47PM +0800, Li, Aubrey wrote:
>>> I ran this patch with tbench on top of of the schedstat patches that
>>> track SIS efficiency. The tracking adds overhead so it's not a perfect
>>> performance comparison but the expectation would be that the patch reduces
>>> the number of runqueues that are scanned
>>
>> Thanks for the measurement! I don't play with tbench so may need a while
>> to digest the data.
>>
> 
> They key point is that it appears the idle mask was mostly equivalent to
> the full domain mask, at least for this test.
> 
>>>
>>> tbench4
>>>                           5.10.0-rc6             5.10.0-rc6
>>>                       schedstat-v1r1          idlemask-v7r1
>>> Hmean     1        504.76 (   0.00%)      500.14 *  -0.91%*
>>> Hmean     2       1001.22 (   0.00%)      970.37 *  -3.08%*
>>> Hmean     4       1930.56 (   0.00%)     1880.96 *  -2.57%*
>>> Hmean     8       3688.05 (   0.00%)     3537.72 *  -4.08%*
>>> Hmean     16      6352.71 (   0.00%)     6439.53 *   1.37%*
>>> Hmean     32     10066.37 (   0.00%)    10124.65 *   0.58%*


>>> Hmean     64     12846.32 (   0.00%)    11627.27 *  -9.49%*

I focused on this case and run it 5 times, and here is the data on my side.
5 times x 600s tbench, thread number is 153(80% x 192(h/w thread num)).

Hmean 153		v5.9.12			v5.9.12
			schedstat-v1		idlemask-v8(with schedstat)
Round 1			15717.3			15608.1
Round 2			14856.9			15642.5
Round 3			14856.7			15782.1
Round 4			15408.9			15912.9
Round 5			15436.6			15927.7

From tbench throughput data, bigger is better, it looks like idlemask wins

And here is SIS_scanned data:

Hmean 153		v5.9.12			v5.9.12
			schedstat-v1		idlemask-v8(with schedstat)
Round 1			22562490432		21894932302
Round 2			21288529957		21693722629
Round 3			20657521771		21268308377
Round 4			21868486414		22289128955
Round 5			21859614988		22214740417

From SIS_scanned data, less is better, it looks like the default one is better.

But combined with throughput data, this can be explained as bigger throughput
performs more SIS_scanned.

So at least, there is no regression of this case.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-11 22:50                 ` Mel Gorman
@ 2020-12-14  8:11                   ` Vincent Guittot
  2020-12-14  9:31                     ` Peter Zijlstra
  2020-12-14  9:32                   ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2020-12-14  8:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Li, Aubrey, Ingo Molnar, Juri Lelli,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao

On Fri, 11 Dec 2020 at 23:50, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Fri, Dec 11, 2020 at 11:19:05PM +0100, Peter Zijlstra wrote:
> > On Fri, Dec 11, 2020 at 08:43:37PM +0000, Mel Gorman wrote:
> > > One bug is in __select_idle_core() though. It's scanning the SMT mask,
> > > not select_idle_mask so it can return an idle candidate that is not in
> > > p->cpus_ptr.
> >
> > D'0h.. luckily the benchmarks don't hit that :-)
> >
>
> Yep, neither do mine for the most part which is why I ran it as-is but
> eventually someone would complain that sched_setscheduler was being
> ignored.
>
> Trick is whether a failed check should continue searching for an idle core
> or terminate early and just pick an allowed idle CPU. I tend to favour
> the latter but it's hard to predict what sort of reasonable workloads
> would be affected.
>
> > > There are some other potential caveats.
> > >
> > > This is a single pass so when test_idle_cores() is true, __select_idle_core
> > > is used to to check all the siblings even if the core is not idle. That
> > > could have been cut short if __select_idle_core checked *idle_cpu ==
> > > 1 and terminated the SMT scan if an idle candidate had already been found.
> >
> > So I did that on purpose, so as to track the last/most-recent idle cpu,
> > with the thinking that that cpu has the higher chance of still being
> > idle vs one we checked earlier/longer-ago.
> >
> > I suppose we benchmark both and see which is liked best.
> >
>
> I originally did something like that on purpose too but Vincent called
> it out so it is worth mentioning now to avoid surprises. That said, I'm
> at the point where anything SIS-related simply needs excessive exposure
> because no matter what it does, someone is unhappy with it.

Yeah, I don't like extending the idle core search loop for something
that is not related to the idle core search. This adds more work out
of  control of the sis_prop. So I'm still against hiding some idle cpu
search in idle core search.

>
> > > Second downside is related. If test_idle_cpus() was true but no idle
> > > CPU is found then __select_idle_core has been called enough to scan
> > > the entire domain. In this corner case, the new code does *more* work
> > > because the old code would have failed select_idle_core() quickly and
> > > then select_idle_cpu() would be throttled by SIS_PROP. I think this will
> > > only be noticable in the heavily overloaded case but if the corner case
> > > hits enough then the new code will be slower than the old code for the
> > > over-saturated case (i.e. hackbench with lots of groups).
> >
> > Right, due to scanning siblings, even if the first inspected thread is
> > not idle, we scan more.
> >
>
> Yep.
>
> > > The third potential downside is that the SMT sibling is not guaranteed to
> > > be checked due to SIS_PROP throttling but in the old code, that would have
> > > been checked by select_idle_smt(). That might result in premature stacking
> > > of runnable tasks on the same CPU. Similarly, as __select_idle_core may
> > > find multiple idle candidates, it will not pick the targets SMT sibling
> > > if it is idle like select_idle_smt would have.
> > >
> > > That said, I am skeptical that select_idle_smt() matters all that often.
> >
> > This, I didn't really believe in it either.
> >
>
> Good because I think any benefit from select_idle_smt is so marginal
> that it should be ignored if the full scan is simpler overall.
>
> > The benchmarks I started are mostly noise, with a few wins for
> > TCP_STREAM and UDP_RR around the 50% mark. Although I should run
> > longer variants to make sure.
>
> So far I have one benchmark from one machine. It's tbench again because
> it's a reasonable communicating workload that is trivial to vary the
> level of utilisation.
>
> 80-cpu CascadeLake, 2 sockets, HT enabled
> tbench4
>                           5.10.0-rc6             5.10.0-rc6             5.10.0-rc6
>                        baseline-v1r1        singlescan-v1r1        singlescan-v1r3
> Hmean     1        503.90 (   0.00%)      505.19 *   0.26%*      499.20 *  -0.93%*
> Hmean     2        980.80 (   0.00%)      975.15 *  -0.58%*      983.79 *   0.31%*
> Hmean     4       1912.37 (   0.00%)     1883.25 *  -1.52%*     1923.76 *   0.60%*
> Hmean     8       3741.47 (   0.00%)     3568.66 *  -4.62%*     3690.60 *  -1.36%*
> Hmean     16      6552.90 (   0.00%)     6549.97 *  -0.04%*     6478.37 *  -1.14%*
> Hmean     32     10217.34 (   0.00%)    10266.66 *   0.48%*    10291.60 *   0.73%*
> Hmean     64     13604.93 (   0.00%)    11240.88 * -17.38%*    12045.87 * -11.46%*
> Hmean     128    21194.11 (   0.00%)    21316.08 *   0.58%*    21868.39 *   3.18%*
> Hmean     256    21163.19 (   0.00%)    20989.14 *  -0.82%*    20831.20 *  -1.57%*
> Hmean     320    20906.29 (   0.00%)    21024.11 *   0.56%*    20756.88 *  -0.71%*
> Stddev    1          3.14 (   0.00%)        1.17 (  62.93%)        1.05 (  66.61%)
> Stddev    2          4.44 (   0.00%)        3.72 (  16.35%)        2.20 (  50.56%)
> Stddev    4          9.09 (   0.00%)       18.67 (-105.32%)        3.66 (  59.71%)
> Stddev    8         12.87 (   0.00%)       18.96 ( -47.31%)       11.90 (   7.58%)
> Stddev    16         8.34 (   0.00%)        8.77 (  -5.22%)       36.25 (-334.84%)
> Stddev    32        27.05 (   0.00%)       20.90 (  22.74%)       28.57 (  -5.61%)
> Stddev    64       720.66 (   0.00%)       20.12 (  97.21%)       32.10 (  95.55%)
> Stddev    128       17.49 (   0.00%)       52.33 (-199.22%)      137.68 (-687.25%)
> Stddev    256       57.17 (   0.00%)       18.87 (  67.00%)       38.98 (  31.81%)
> Stddev    320       29.87 (   0.00%)       46.49 ( -55.64%)       31.48 (  -5.39%)
>
>                   5.10.0-rc6  5.10.0-rc6  5.10.0-rc6
>                 baseline-v1r1singlescan-v1r1singlescan-v1r3
> Duration User        9771.18     9435.64     9353.29
> Duration System     34224.28    33829.01    33802.34
> Duration Elapsed     2218.87     2218.87     2218.69
>
> baseline is roughly what's in tip for 5.11-rc1 with patches 1-2 from my
> series like you had.
>
> singlescan-v1r1 is your patch
>
> singlescan-v1r3 is your patch with my "untested" patch on top that
> enforces p->cpus_ptr and short-cuts corner cases.
>
> Few points of interest
>
> 1. 64 clients regresses. With 64 clients, this is roughly the point
>    where test_idle_cores() returns false positives and we hit the worst
>    corner cases. Your patch regresses 17%, mine is only a marginal
>    improvement and still a regression slower.
>
> 2. Variations are all over the place. Your patch is great at low
>    utilisation and stabilises overall. After that, your milage varies a lot
>
> 3. The system CPu usage summarised over the entire workload drops quite
>    a bit. Whether it's your patch or minor changes on top, there is less
>    work being done within the kernel overall
>
> A wider battery of tests is still running and a second set is queued
> that adds the debugging schedstats but they take ages to run.
>
> I'm currently on "holidays" so response time will be variable :P
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-11 17:44           ` Peter Zijlstra
  2020-12-11 20:43             ` Mel Gorman
@ 2020-12-14  9:18             ` Vincent Guittot
  2020-12-14 12:42               ` Mel Gorman
  1 sibling, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2020-12-14  9:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Li, Aubrey, Ingo Molnar, Juri Lelli,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao

On Fri, 11 Dec 2020 at 18:45, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Dec 10, 2020 at 12:58:33PM +0000, Mel Gorman wrote:
> > The prequisite patch to make that approach work was rejected though
> > as on its own, it's not very helpful and Vincent didn't like that the
> > load_balance_mask was abused to make it effective.
>
> So last time I poked at all this, I found that using more masks was a
> performance issue as well due to added cache misses.
>
> Anyway, I finally found some time to look at all this, and while reading
> over the whole SiS crud to refresh the brain came up with the below...
>
> It's still naf, but at least it gets rid of a bunch of duplicate
> scanning and LoC decreases, so it should be awesome. Ofcourse, as
> always, benchmarks will totally ruin everything, we'll see, I started
> some.
>
> It goes on top of Mel's first two patches (which is about as far as I
> got)...

We have several different things that Aubrey and Mel patches are
trying to achieve:

Aubrey wants to avoid scanning busy cpus
- patch works well on busy system: I see a significant benefit with
hackbench on a lot of group on my 2 nodes * 28 cores * 4 smt
    hackbench -l 2000 -g 128
tip 3.334
w/ patch 2.978 (+12%)

- Aubey also tried to not scan the cpus which are idle for a short
duration (when a tick not stopped) but there are problems because not
stopping a tick doesn't mean a short idle. In fact , something similar
to find_idlest_group_cpu() should be done in this case but then it's
no more a fast path search

Mel wants to minimize looping over the cpus
-patch 4 is an obvious win on light loaded system because it avoids
looping twice the cpus mask when some cpus are idle but no core
-But patch 3 generates perf régression
    hackbench -l 2000 -g 1
tip 12.293
/w all Mel's patches 15.163 -14%
/w Aubrey + Mel patches minus patch 3 : 12.788 +3.8% But I think that
Aubreay's patch doesn't help here. Test without aubrey's patch are
running

-he also tried to used load_balance_mask to do something similar to the below


>
> ---
>  fair.c |  124 ++++++++++++++++++++++++-----------------------------------------
>  1 file changed, 47 insertions(+), 77 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6021,11 +6021,13 @@ static inline void set_idle_cores(int cp
>
>  static inline bool test_idle_cores(int cpu, bool def)
>  {
> -       struct sched_domain_shared *sds;
> +       if (static_branch_likely(&sched_smt_present)) {
> +               struct sched_domain_shared *sds;
>
> -       sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> -       if (sds)
> -               return READ_ONCE(sds->has_idle_cores);
> +               sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> +               if (sds)
> +                       return READ_ONCE(sds->has_idle_cores);
> +       }
>
>         return def;
>  }
> @@ -6059,77 +6061,39 @@ void __update_idle_core(struct rq *rq)
>         rcu_read_unlock();
>  }
>
> -/*
> - * Scan the entire LLC domain for idle cores; this dynamically switches off if
> - * there are no idle cores left in the system; tracked through
> - * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
> - */
> -static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
> +static int __select_idle_core(int core, struct cpumask *cpus, int *idle_cpu)
>  {
> -       struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> -       int core, cpu;
> -
> -       if (!static_branch_likely(&sched_smt_present))
> -               return -1;
> -
> -       if (!test_idle_cores(target, false))
> -               return -1;
> -
> -       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> -
> -       for_each_cpu_wrap(core, cpus, target) {
> -               bool idle = true;
> +       bool idle = true;
> +       int cpu;
>
> -               for_each_cpu(cpu, cpu_smt_mask(core)) {
> -                       if (!available_idle_cpu(cpu)) {
> -                               idle = false;
> -                               break;
> -                       }
> +       for_each_cpu(cpu, cpu_smt_mask(core)) {
> +               if (!available_idle_cpu(cpu)) {
> +                       idle = false;
> +                       continue;

By not breaking on the first not idle cpu of the core, you will
largely increase the number of loops. On my system, it increases 4
times from 28 up tu 112

>                 }
> -
> -               if (idle)
> -                       return core;
> -
> -               cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
> +               if (idle_cpu)
> +                       *idle_cpu = cpu;
>         }
>
> -       /*
> -        * Failed to find an idle core; stop looking for one.
> -        */
> -       set_idle_cores(target, 0);
> +       if (idle)
> +               return core;
>
> +       cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
>         return -1;
>  }
>
> -/*
> - * Scan the local SMT mask for idle CPUs.
> - */
> -static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
> -{
> -       int cpu;
> -
> -       if (!static_branch_likely(&sched_smt_present))
> -               return -1;
> -
> -       for_each_cpu(cpu, cpu_smt_mask(target)) {
> -               if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
> -                   !cpumask_test_cpu(cpu, sched_domain_span(sd)))
> -                       continue;
> -               if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> -                       return cpu;
> -       }
> +#else /* CONFIG_SCHED_SMT */
>
> -       return -1;
> +static inline void set_idle_cores(int cpu, int val)
> +{
>  }
>
> -#else /* CONFIG_SCHED_SMT */
> -
> -static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
> +static inline bool test_idle_cores(int cpu, bool def)
>  {
> -       return -1;
> +       return def;
>  }
>
> -static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
> +static inline int __select_idle_core(int core, struct cpumask *cpus, int *idle_cpu)
>  {
>         return -1;
>  }
> @@ -6144,10 +6108,11 @@ static inline int select_idle_smt(struct
>  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
>  {
>         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> +       int this = smp_processor_id();
> +       bool smt = test_idle_cores(this, false);
> +       int i, cpu, idle_cpu = -1, nr = INT_MAX;
>         struct sched_domain *this_sd;
>         u64 time;
> -       int this = smp_processor_id();
> -       int cpu, nr = INT_MAX;
>
>         this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>         if (!this_sd)
> @@ -6155,7 +6120,7 @@ static int select_idle_cpu(struct task_s
>
>         cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>
> -       if (sched_feat(SIS_PROP)) {
> +       if (sched_feat(SIS_PROP) && !smt) {
>                 u64 avg_cost, avg_idle, span_avg;
>
>                 /*
> @@ -6175,18 +6140,31 @@ static int select_idle_cpu(struct task_s
>         }
>
>         for_each_cpu_wrap(cpu, cpus, target) {
> -               if (!--nr)
> -                       return -1;
> -               if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> -                       break;
> +               if (smt) {

sds->has_idle_core being set doesn't means that there is one idle core
which means for the same real HW state (i.e. no idle cores)
if sds->has_idle_core is set, we will loop all cpus here and never get
a chance to return a sched_idle_cpu()
but if sds->has_idle_core is clear, we will loop on a limited number
of cpus and test sched_idle_cpu()

> +                       i = __select_idle_core(cpu, cpus, &idle_cpu);
> +                       if ((unsigned)i < nr_cpumask_bits)
> +                               return i;
> +
> +               } else {
> +                       if (!--nr)
> +                               return -1;
> +
> +                       if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
> +                               idle_cpu = cpu;
> +                               break;
> +                       }
> +               }
>         }
>
> -       if (sched_feat(SIS_PROP)) {
> +       if (smt)
> +               set_idle_cores(this, false);
> +
> +       if (sched_feat(SIS_PROP) && !smt) {
>                 time = cpu_clock(this) - time;
>                 update_avg(&this_sd->avg_scan_cost, time);
>         }
>
> -       return cpu;
> +       return idle_cpu;
>  }
>
>  /*
> @@ -6315,18 +6293,10 @@ static int select_idle_sibling(struct ta
>         if (!sd)
>                 return target;
>
> -       i = select_idle_core(p, sd, target);
> -       if ((unsigned)i < nr_cpumask_bits)
> -               return i;
> -
>         i = select_idle_cpu(p, sd, target);
>         if ((unsigned)i < nr_cpumask_bits)
>                 return i;
>
> -       i = select_idle_smt(p, sd, target);
> -       if ((unsigned)i < nr_cpumask_bits)
> -               return i;
> -
>         return target;
>  }
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-14  8:11                   ` Vincent Guittot
@ 2020-12-14  9:31                     ` Peter Zijlstra
  2020-12-14 12:36                       ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2020-12-14  9:31 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Li, Aubrey, Ingo Molnar, Juri Lelli,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao

On Mon, Dec 14, 2020 at 09:11:29AM +0100, Vincent Guittot wrote:
> On Fri, 11 Dec 2020 at 23:50, Mel Gorman <mgorman@techsingularity.net> wrote:

> > I originally did something like that on purpose too but Vincent called
> > it out so it is worth mentioning now to avoid surprises. That said, I'm
> > at the point where anything SIS-related simply needs excessive exposure
> > because no matter what it does, someone is unhappy with it.
> 
> Yeah, I don't like extending the idle core search loop for something
> that is not related to the idle core search. This adds more work out
> of  control of the sis_prop. So I'm still against hiding some idle cpu
> search in idle core search.

The idea, of course, is to do less. The current code is pretty crap in
that it will do a whole bunch of things multiple times.

Also, a possible follow up, would be something like the below (and
remove all the sds->has_idle_cores crud), which brings core scanning
under SIS_PROP.

But it all needs lots of benchmarking :/

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6095,6 +6095,9 @@ static inline bool test_idle_cores(int c
 
 static inline int __select_idle_core(int core, struct cpumask *cpus, int *idle_cpu)
 {
+	if (idle_cpu && (available_idle_cpu(core) || sched_idle_cpu(cpu))
+		*idle_cpu = core;
+
 	return -1;
 }
 
@@ -6109,7 +6112,6 @@ static int select_idle_cpu(struct task_s
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int this = smp_processor_id();
-	bool smt = test_idle_cores(this, false);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
 	struct sched_domain *this_sd;
 	u64 time;
@@ -6120,7 +6122,7 @@ static int select_idle_cpu(struct task_s
 
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
-	if (sched_feat(SIS_PROP) && !smt) {
+	if (sched_feat(SIS_PROP)) {
 		u64 avg_cost, avg_idle, span_avg;
 
 		/*
@@ -6140,26 +6142,17 @@ static int select_idle_cpu(struct task_s
 	}
 
 	for_each_cpu_wrap(cpu, cpus, target) {
-		if (smt) {
-			i = __select_idle_core(cpu, cpus, &idle_cpu);
-			if ((unsigned)i < nr_cpumask_bits)
-				return i;
-
-		} else {
-			if (!--nr)
-				return -1;
-
-			if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
-				idle_cpu = cpu;
-				break;
-			}
+		if (!--nr)
+			break;
+
+		i = __select_idle_core(cpu, cpus, &idle_cpu);
+		if ((unsigned)i < nr_cpumask_bits) {
+			idle_cpu = i;
+			break;
 		}
 	}
 
-	if (smt)
-		set_idle_cores(this, false);
-
-	if (sched_feat(SIS_PROP) && !smt) {
+	if (sched_feat(SIS_PROP)) {
 		time = cpu_clock(this) - time;
 		update_avg(&this_sd->avg_scan_cost, time);
 	}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-11 22:50                 ` Mel Gorman
  2020-12-14  8:11                   ` Vincent Guittot
@ 2020-12-14  9:32                   ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2020-12-14  9:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Li, Aubrey, mingo, juri.lelli, vincent.guittot,
	valentin.schneider, qais.yousef, dietmar.eggemann, rostedt,
	bsegall, tim.c.chen, linux-kernel, Mel Gorman, Jiang Biao

On Fri, Dec 11, 2020 at 10:50:02PM +0000, Mel Gorman wrote:

> > > The third potential downside is that the SMT sibling is not guaranteed to
> > > be checked due to SIS_PROP throttling but in the old code, that would have
> > > been checked by select_idle_smt(). That might result in premature stacking
> > > of runnable tasks on the same CPU. Similarly, as __select_idle_core may
> > > find multiple idle candidates, it will not pick the targets SMT sibling
> > > if it is idle like select_idle_smt would have.
> > > 
> > > That said, I am skeptical that select_idle_smt() matters all that often.
> > 
> > This, I didn't really believe in it either.
> > 
> 
> Good because I think any benefit from select_idle_smt is so marginal
> that it should be ignored if the full scan is simpler overall.

Perhaps we should start out with a simple patch removing that pass..
That should show, what, if anything, the effect of it is.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-14  9:31                     ` Peter Zijlstra
@ 2020-12-14 12:36                       ` Mel Gorman
  2020-12-14 15:01                         ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2020-12-14 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Li, Aubrey, Ingo Molnar, Juri Lelli,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao

On Mon, Dec 14, 2020 at 10:31:22AM +0100, Peter Zijlstra wrote:
> On Mon, Dec 14, 2020 at 09:11:29AM +0100, Vincent Guittot wrote:
> > On Fri, 11 Dec 2020 at 23:50, Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > > I originally did something like that on purpose too but Vincent called
> > > it out so it is worth mentioning now to avoid surprises. That said, I'm
> > > at the point where anything SIS-related simply needs excessive exposure
> > > because no matter what it does, someone is unhappy with it.
> > 
> > Yeah, I don't like extending the idle core search loop for something
> > that is not related to the idle core search. This adds more work out
> > of  control of the sis_prop. So I'm still against hiding some idle cpu
> > search in idle core search.
> 
> The idea, of course, is to do less. The current code is pretty crap in
> that it will do a whole bunch of things multiple times.
> 

^^^ this

While there is some overhead when searching for an idle core to track
an idle sibling, it's better than double scanning when test_idle_cores()
returns a false positive (or it races with a parallel search that takes
the last idle core).

> Also, a possible follow up, would be something like the below (and
> remove all the sds->has_idle_cores crud), which brings core scanning
> under SIS_PROP.
> 

I'm less confident about this this but admit I have no data. test_idle_core
becomes critical for hackbench once it saturates the machine as it'll
generally return a true positive.

Where test_idle_cores causes problems is when domains are over 50%
busy and returns false positives due to races and the work to find an
idle core is wasted. The flip side is that updating the has_idle_core
information is also expensive in this case as it causes lots of dirty
cache line bouncing so maybe in practice it might be ok. It definitely
should be a separate patch that is tested on top of your first prototype
with the p->cpus_ptr check when picking an idle candidate.

The other side-effect is that with this patch that the scan cost is
*always* accounted for. While this makes intuitive sense as it was never
clear to me why it was only accounted with scan failures. When I had tested
something like this, it was a mix of wins and losses. At minimum, a patch
that always accounts for scan cost and one the removes the test_idle_core
should be separate patches for bisection purposes at the very least.

This is the current set of results I have for your prototype plus the
fixes I suggested on top

http://www.skynet.ie/~mel/postings/peterz-20201214/dashboard.html

It's not a universal win but appears to win more than it loses

The biggest loss is dbench on EPYC 2

http://www.skynet.ie/~mel/postings/peterz-20201214/io-dbench4-async-xfs/romulus/index.html#dbench4

It's not at clear why it was so badly affected but in general, EPYC can
be problematic as it has multiple small LLCs. The same machine for specjvm
showed large gains.

http://www.skynet.ie/~mel/postings/peterz-20201214/jvm-specjbb2005-multi/romulus/index.html#specjbb

There are a lot of results to trawl through but mostly it shows that
it's a mix of wins and losses and it's both workload and machine
dependant which is generally true for anything select_idle_sibling
related.

As the merge window is open, it's inevitable that this will need to be
evaluated against 5.11-rc1 when all the current batch of scheduler code
has been merged. Do you mind splitting your prototype into three patches
and slap some sort of changlog on them? I'll run them through the grid
with p->recent_used_cpu changes on top to use recent_use_cpu as a search
hint instead of an idle candidate so that it scans for a core. They'll
take a while to run but it's automated and some people are going to be
dropping off for holidays relatively soon anyway. I can test on arm too
but as it does not have SMT enabled, it'll be less useful.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-14  9:18             ` Vincent Guittot
@ 2020-12-14 12:42               ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2020-12-14 12:42 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Li, Aubrey, Ingo Molnar, Juri Lelli,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao

On Mon, Dec 14, 2020 at 10:18:16AM +0100, Vincent Guittot wrote:
> On Fri, 11 Dec 2020 at 18:45, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Dec 10, 2020 at 12:58:33PM +0000, Mel Gorman wrote:
> > > The prequisite patch to make that approach work was rejected though
> > > as on its own, it's not very helpful and Vincent didn't like that the
> > > load_balance_mask was abused to make it effective.
> >
> > So last time I poked at all this, I found that using more masks was a
> > performance issue as well due to added cache misses.
> >
> > Anyway, I finally found some time to look at all this, and while reading
> > over the whole SiS crud to refresh the brain came up with the below...
> >
> > It's still naf, but at least it gets rid of a bunch of duplicate
> > scanning and LoC decreases, so it should be awesome. Ofcourse, as
> > always, benchmarks will totally ruin everything, we'll see, I started
> > some.
> >
> > It goes on top of Mel's first two patches (which is about as far as I
> > got)...
> 
> We have several different things that Aubrey and Mel patches are
> trying to achieve:
> 
> Aubrey wants to avoid scanning busy cpus
> - patch works well on busy system: I see a significant benefit with
> hackbench on a lot of group on my 2 nodes * 28 cores * 4 smt
>     hackbench -l 2000 -g 128
> tip 3.334
> w/ patch 2.978 (+12%)
> 

It's still the case that Aubrey's work does not conflict with what Peter
is doing. All that changes is what mask is applied.

Similarly, the p->recent_used_cpu changes I made initially (but got
rejected) reduces scanning. I intend to revisit that to use recent_used_cpu
as a search target because one side-effect of the patch was the siblings
can be used prematurely.

> - Aubey also tried to not scan the cpus which are idle for a short
> duration (when a tick not stopped) but there are problems because not
> stopping a tick doesn't mean a short idle. In fact , something similar
> to find_idlest_group_cpu() should be done in this case but then it's
> no more a fast path search
> 

The crowd most likely to complain about this is Facebook as they have
workloads that prefer deep searches to find idle CPUs.

> Mel wants to minimize looping over the cpus
> -patch 4 is an obvious win on light loaded system because it avoids
> looping twice the cpus mask when some cpus are idle but no core

Peter's patch replaces that entirely which I'm fine with.

> -But patch 3 generates perf régression
>     hackbench -l 2000 -g 1
> tip 12.293
> /w all Mel's patches 15.163 -14%
> /w Aubrey + Mel patches minus patch 3 : 12.788 +3.8% But I think that
> Aubreay's patch doesn't help here. Test without aubrey's patch are
> running
> 
> -he also tried to used load_balance_mask to do something similar to the below
> 

Peter's approach removes load_balance_mask abuse so I think it's a
better approach overall to scanning LLC domains in a single pass. It
needs refinement and a lot of testing but I think it's promising.

> > -static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
> > +static int __select_idle_core(int core, struct cpumask *cpus, int *idle_cpu)
> >  {
> > -       struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> > -       int core, cpu;
> > -
> > -       if (!static_branch_likely(&sched_smt_present))
> > -               return -1;
> > -
> > -       if (!test_idle_cores(target, false))
> > -               return -1;
> > -
> > -       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > -
> > -       for_each_cpu_wrap(core, cpus, target) {
> > -               bool idle = true;
> > +       bool idle = true;
> > +       int cpu;
> >
> > -               for_each_cpu(cpu, cpu_smt_mask(core)) {
> > -                       if (!available_idle_cpu(cpu)) {
> > -                               idle = false;
> > -                               break;
> > -                       }
> > +       for_each_cpu(cpu, cpu_smt_mask(core)) {
> > +               if (!available_idle_cpu(cpu)) {
> > +                       idle = false;
> > +                       continue;
> 
> By not breaking on the first not idle cpu of the core, you will
> largely increase the number of loops. On my system, it increases 4
> times from 28 up tu 112
> 

But breaking early will mean that SMT siblings are used prematurely.
However, if SIS_PROP was partially enforced for the idle core scan, it
could limit the search for an idle core once an idle candidate was
discovered.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
  2020-12-14 12:36                       ` Mel Gorman
@ 2020-12-14 15:01                         ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2020-12-14 15:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Li, Aubrey, Ingo Molnar, Juri Lelli,
	Valentin Schneider, Qais Yousef, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Tim Chen, linux-kernel, Mel Gorman,
	Jiang Biao

On Mon, Dec 14, 2020 at 12:36:32PM +0000, Mel Gorman wrote:
> As the merge window is open, it's inevitable that this will need to be
> evaluated against 5.11-rc1 when all the current batch of scheduler code
> has been merged. Do you mind splitting your prototype into three patches
> and slap some sort of changlog on them?

Sure, I'm afraid it's a little more than 3, but I'll get on it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-12-14 15:02 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-09  6:24 [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup Aubrey Li
2020-12-09  8:15 ` Vincent Guittot
2020-12-09 10:58   ` Li, Aubrey
2020-12-09 13:09     ` Vincent Guittot
2020-12-09 14:53       ` Li, Aubrey
2020-12-09 14:36 ` Mel Gorman
2020-12-10  8:23   ` Li, Aubrey
2020-12-10 11:34     ` Mel Gorman
2020-12-10 12:21       ` Li, Aubrey
2020-12-10 12:58         ` Mel Gorman
2020-12-11 17:44           ` Peter Zijlstra
2020-12-11 20:43             ` Mel Gorman
2020-12-11 22:19               ` Peter Zijlstra
2020-12-11 22:50                 ` Mel Gorman
2020-12-14  8:11                   ` Vincent Guittot
2020-12-14  9:31                     ` Peter Zijlstra
2020-12-14 12:36                       ` Mel Gorman
2020-12-14 15:01                         ` Peter Zijlstra
2020-12-14  9:32                   ` Peter Zijlstra
2020-12-14  9:18             ` Vincent Guittot
2020-12-14 12:42               ` Mel Gorman
2020-12-14  7:53       ` Li, Aubrey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).