[PATCH v2 0/2] sched/fair: Wake task within the cluster when possible

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] sched/fair: Wake task within the cluster when possible
@ 2022-01-26  8:09 Yicong Yang
  2022-01-26  8:09 ` [PATCH v2 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
  2022-01-26  8:09 ` [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
  0 siblings, 2 replies; 27+ messages in thread
From: Yicong Yang @ 2022-01-26  8:09 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	song.bao.hua, guodong.xu

This is the follow-up work to support cluster scheduler. Previously
we have added cluster level in the scheduler for both ARM64[1] and
X86[2] to support load balance between clusters to bring more memory
bandwidth and decrease cache contention. This patchset, on the other
hand, takes care of wake-up path by giving CPUs within the same cluster
a try before scanning the whole LLC to benefit those tasks communicating
with each other.

[1] 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
[2] 66558b730f25 ("sched: Add cluster scheduler level for x86")

Change since v1:
- regain the performance data based on v5.17-rc1
- rename cpus_share_cluster to cpus_share_resources per Vincent and Gautham, thanks!

Barry Song (2):
  sched: Add per_cpu cluster domain info and cpus_share_resources API
  sched/fair: Scan cluster before scanning LLC in wake-up path

 include/linux/sched/sd_flags.h |  7 ++++++
 include/linux/sched/topology.h |  8 +++++-
 kernel/sched/core.c            | 12 +++++++++
 kernel/sched/fair.c            | 46 +++++++++++++++++++++++++++++++---
 kernel/sched/sched.h           |  2 ++
 kernel/sched/topology.c        | 15 +++++++++++
 6 files changed, 85 insertions(+), 5 deletions(-)

-- 
2.24.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
  2022-01-26  8:09 [PATCH v2 0/2] sched/fair: Wake task within the cluster when possible Yicong Yang
@ 2022-01-26  8:09 ` Yicong Yang
  2022-01-27 15:26   ` Gautham R. Shenoy
  2022-01-26  8:09 ` [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
  1 sibling, 1 reply; 27+ messages in thread
From: Yicong Yang @ 2022-01-26  8:09 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	song.bao.hua, guodong.xu

From: Barry Song <song.bao.hua@hisilicon.com>

Add per-cpu cluster domain info and cpus_share_resources() API.
This is the preparation for the optimization of select_idle_cpu()
on platforms with cluster scheduler level.

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
---
 include/linux/sched/sd_flags.h |  7 +++++++
 include/linux/sched/topology.h |  8 +++++++-
 kernel/sched/core.c            | 12 ++++++++++++
 kernel/sched/sched.h           |  2 ++
 kernel/sched/topology.c        | 15 +++++++++++++++
 5 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 57bde66d95f7..42ed454e8b18 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -109,6 +109,13 @@ SD_FLAG(SD_ASYM_CPUCAPACITY_FULL, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
  */
 SD_FLAG(SD_SHARE_CPUCAPACITY, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
 
+/*
+ * Domain members share CPU cluster (LLC tags or L2 cache)
+ *
+ * NEEDS_GROUPS: Clusters are shared between groups.
+ */
+SD_FLAG(SD_CLUSTER, SDF_NEEDS_GROUPS)
+
 /*
  * Domain members share CPU package resources (i.e. caches)
  *
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8054641c0a7b..2f84fdf00481 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -45,7 +45,7 @@ static inline int cpu_smt_flags(void)
 #ifdef CONFIG_SCHED_CLUSTER
 static inline int cpu_cluster_flags(void)
 {
-	return SD_SHARE_PKG_RESOURCES;
+	return SD_CLUSTER | SD_SHARE_PKG_RESOURCES;
 }
 #endif
 
@@ -177,6 +177,7 @@ cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
 void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
 
 bool cpus_share_cache(int this_cpu, int that_cpu);
+bool cpus_share_resources(int this_cpu, int that_cpu);
 
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 typedef int (*sched_domain_flags_f)(void);
@@ -230,6 +231,11 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
 	return true;
 }
 
+static inline bool cpus_share_resources(int this_cpu, int that_cpu)
+{
+	return true;
+}
+
 #endif	/* !CONFIG_SMP */
 
 #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 848eaa0efe0e..7b203a6d96b2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3746,6 +3746,18 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
 
+/*
+ * Whether CPUs are share cache resources, which means LLC on non-cluster
+ * machines and LLC tag or L2 on machines with clusters.
+ */
+bool cpus_share_resources(int this_cpu, int that_cpu)
+{
+	if (this_cpu == that_cpu)
+		return true;
+
+	return per_cpu(sd_share_id, this_cpu) == per_cpu(sd_share_id, that_cpu);
+}
+
 static inline bool ttwu_queue_cond(int cpu, int wake_flags)
 {
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index de53be905739..d04b342cc28d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1765,7 +1765,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(int, sd_share_id);
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..408fede6e732 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -644,6 +644,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(int, sd_share_id);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
@@ -669,6 +671,18 @@ static void update_top_cache_domain(int cpu)
 	per_cpu(sd_llc_id, cpu) = id;
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
+	sd = lowest_flag_domain(cpu, SD_CLUSTER);
+	if (sd)
+		id = cpumask_first(sched_domain_span(sd));
+	rcu_assign_pointer(per_cpu(sd_cluster, cpu), sd);
+
+	/*
+	 * This assignment should be placed after the sd_llc_id as
+	 * we want this id equals to cluster id on cluster machines
+	 * but equals to LLC id on non-Cluster machines.
+	 */
+	per_cpu(sd_share_id, cpu) = id;
+
 	sd = lowest_flag_domain(cpu, SD_NUMA);
 	rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
 
@@ -1514,6 +1528,7 @@ static unsigned long __read_mostly *sched_numa_onlined_nodes;
  */
 #define TOPOLOGY_SD_FLAGS		\
 	(SD_SHARE_CPUCAPACITY	|	\
+	 SD_CLUSTER		|	\
 	 SD_SHARE_PKG_RESOURCES |	\
 	 SD_NUMA		|	\
 	 SD_ASYM_PACKING)
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-01-26  8:09 [PATCH v2 0/2] sched/fair: Wake task within the cluster when possible Yicong Yang
  2022-01-26  8:09 ` [PATCH v2 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
@ 2022-01-26  8:09 ` Yicong Yang
  2022-01-27  1:14   ` Tim Chen
  2022-01-27 15:41   ` Gautham R. Shenoy
  1 sibling, 2 replies; 27+ messages in thread
From: Yicong Yang @ 2022-01-26  8:09 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	song.bao.hua, guodong.xu

From: Barry Song <song.bao.hua@hisilicon.com>

For platforms having clusters like Kunpeng920, CPUs within the same
cluster have lower latency when synchronizing and accessing shared
resources like cache. Thus, this patch tries to find an idle cpu
within the cluster of the target CPU before scanning the whole LLC
to gain lower latency.

Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this
patch doesn't consider SMT for this moment.

Testing has been done on Kunpeng920 by pinning tasks to one numa
and two numa. On Kunpeng920, Each numa has 8 clusters and each
cluster has 4 CPUs.

With this patch, We noticed enhancement on tbench within one
numa or cross two numa.

On numa 0:
                            5.17-rc1                patched
Hmean     1        324.73 (   0.00%)      378.01 *  16.41%*
Hmean     2        645.36 (   0.00%)      754.63 *  16.93%*
Hmean     4       1302.09 (   0.00%)     1507.54 *  15.78%*
Hmean     8       2612.03 (   0.00%)     2982.57 *  14.19%*
Hmean     16      5307.12 (   0.00%)     5886.66 *  10.92%*
Hmean     32      9354.22 (   0.00%)     9908.13 *   5.92%*
Hmean     64      7240.35 (   0.00%)     7278.78 *   0.53%*
Hmean     128     6186.40 (   0.00%)     6187.85 (   0.02%)

On numa 0-1:
                            5.17-rc1                patched
Hmean     1        320.01 (   0.00%)      378.44 *  18.26%*
Hmean     2        643.85 (   0.00%)      752.52 *  16.88%*
Hmean     4       1287.36 (   0.00%)     1505.62 *  16.95%*
Hmean     8       2564.60 (   0.00%)     2955.29 *  15.23%*
Hmean     16      5195.69 (   0.00%)     5814.74 *  11.91%*
Hmean     32      9769.16 (   0.00%)    10872.63 *  11.30%*
Hmean     64     15952.50 (   0.00%)    17281.98 *   8.33%*
Hmean     128    13113.77 (   0.00%)    13895.20 *   5.96%*
Hmean     256    10997.59 (   0.00%)    11244.69 *   2.25%*
Hmean     512    14623.60 (   0.00%)    15526.25 *   6.17%*

This will also help to improve the MySQL. With MySQL server
running on numa 0 and client running on numa 1, both QPS and
latency is imporved on read-write case:
                        5.17-rc1        patched
QPS-16threads        143333.2633    145077.4033(+1.22%)
QPS-24threads        195085.9367    202719.6133(+3.91%)
QPS-32threads        241165.6867      249020.74(+3.26%)
QPS-64threads        244586.8433    253387.7567(+3.60%)
avg-lat-16threads           2.23           2.19(+1.19%)
avg-lat-24threads           2.46           2.36(+3.79%)
avg-lat-36threads           2.66           2.57(+3.26%)
avg-lat-64threads           5.23           5.05(+3.44%)

Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
---
 kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5146163bfabb..2f84a933aedd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6262,12 +6262,46 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
 
 #endif /* CONFIG_SCHED_SMT */
 
+#ifdef CONFIG_SCHED_CLUSTER
+/*
+ * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
+ */
+static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target)
+{
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
+	int cpu, idle_cpu;
+
+	/* TODO: Support SMT case while a machine with both cluster and SMT born */
+	if (!sched_smt_active() && sd) {
+		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
+			idle_cpu = __select_idle_cpu(cpu, p);
+			if ((unsigned int)idle_cpu < nr_cpumask_bits)
+				return idle_cpu;
+		}
+
+		/* Don't ping-pong tasks in and out cluster frequently */
+		if (cpus_share_resources(target, prev_cpu))
+			return target;
+
+		cpumask_andnot(cpus, cpus, sched_domain_span(sd));
+	}
+
+	return -1;
+}
+#else
+static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target)
+{
+	return -1;
+}
+#endif
+
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
  * average idle time for this rq (as found in rq->avg_idle).
  */
-static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
+static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int prev_cpu, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
@@ -6282,6 +6316,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
+	idle_cpu = scan_cluster(p, prev_cpu, target);
+	if ((unsigned int)idle_cpu < nr_cpumask_bits)
+		return idle_cpu;
+
 	if (sched_feat(SIS_PROP) && !has_idle_core) {
 		u64 avg_cost, avg_idle, span_avg;
 		unsigned long now = jiffies;
@@ -6416,7 +6454,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	/*
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
-	if (prev != target && cpus_share_cache(prev, target) &&
+	if (prev != target && cpus_share_resources(prev, target) &&
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
 	    asym_fits_capacity(task_util, prev))
 		return prev;
@@ -6442,7 +6480,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	p->recent_used_cpu = prev;
 	if (recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
-	    cpus_share_cache(recent_used_cpu, target) &&
+	    cpus_share_resources(recent_used_cpu, target) &&
 	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
 	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
 	    asym_fits_capacity(task_util, recent_used_cpu)) {
@@ -6483,7 +6521,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		}
 	}
 
-	i = select_idle_cpu(p, sd, has_idle_core, target);
+	i = select_idle_cpu(p, sd, has_idle_core, prev, target);
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-01-26  8:09 ` [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
@ 2022-01-27  1:14   ` Tim Chen
  2022-01-27  2:02     ` Yicong Yang
  2022-01-27 15:41   ` Gautham R. Shenoy
  1 sibling, 1 reply; 27+ messages in thread
From: Tim Chen @ 2022-01-27  1:14 UTC (permalink / raw)
  To: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, song.bao.hua,
	guodong.xu

On Wed, 2022-01-26 at 16:09 +0800, Yicong Yang wrote:
> From: Barry Song <song.bao.hua@hisilicon.com>
> 
> For platforms having clusters like Kunpeng920, CPUs within the same
> cluster have lower latency when synchronizing and accessing shared
> resources like cache. Thus, this patch tries to find an idle cpu
> within the cluster of the target CPU before scanning the whole LLC
> to gain lower latency.
> 
> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this
> patch doesn't consider SMT for this moment.
> 
> Testing has been done on Kunpeng920 by pinning tasks to one numa
> and two numa. On Kunpeng920, Each numa has 8 clusters and each
> cluster has 4 CPUs.
> 
> With this patch, We noticed enhancement on tbench within one
> numa or cross two numa.
> 
> On numa 0:
>                             5.17-rc1                patched
> Hmean     1        324.73 (   0.00%)      378.01 *  16.41%*
> Hmean     2        645.36 (   0.00%)      754.63 *  16.93%*
> Hmean     4       1302.09 (   0.00%)     1507.54 *  15.78%*
> Hmean     8       2612.03 (   0.00%)     2982.57 *  14.19%*
> Hmean     16      5307.12 (   0.00%)     5886.66 *  10.92%*
> Hmean     32      9354.22 (   0.00%)     9908.13 *   5.92%*
> Hmean     64      7240.35 (   0.00%)     7278.78 *   0.53%*
> Hmean     128     6186.40 (   0.00%)     6187.85 (   0.02%)
> 
> On numa 0-1:
>                             5.17-rc1                patched
> Hmean     1        320.01 (   0.00%)      378.44 *  18.26%*
> Hmean     2        643.85 (   0.00%)      752.52 *  16.88%*
> Hmean     4       1287.36 (   0.00%)     1505.62 *  16.95%*
> Hmean     8       2564.60 (   0.00%)     2955.29 *  15.23%*
> Hmean     16      5195.69 (   0.00%)     5814.74 *  11.91%*
> Hmean     32      9769.16 (   0.00%)    10872.63 *  11.30%*
> Hmean     64     15952.50 (   0.00%)    17281.98 *   8.33%*
> Hmean     128    13113.77 (   0.00%)    13895.20 *   5.96%*
> Hmean     256    10997.59 (   0.00%)    11244.69 *   2.25%*
> Hmean     512    14623.60 (   0.00%)    15526.25 *   6.17%*
> 
> This will also help to improve the MySQL. With MySQL server
> running on numa 0 and client running on numa 1, both QPS and
> latency is imporved on read-write case:
>                         5.17-rc1        patched
> QPS-16threads        143333.2633    145077.4033(+1.22%)
> QPS-24threads        195085.9367    202719.6133(+3.91%)
> QPS-32threads        241165.6867      249020.74(+3.26%)
> QPS-64threads        244586.8433    253387.7567(+3.60%)
> avg-lat-16threads           2.23           2.19(+1.19%)
> avg-lat-24threads           2.46           2.36(+3.79%)
> avg-lat-36threads           2.66           2.57(+3.26%)
> avg-lat-64threads           5.23           5.05(+3.44%)
> 
> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> ---
>  kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++
> ----
>  1 file changed, 42 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5146163bfabb..2f84a933aedd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6262,12 +6262,46 @@ static inline int select_idle_smt(struct
> task_struct *p, struct sched_domain *sd
>  
>  #endif /* CONFIG_SCHED_SMT */
>  
> +#ifdef CONFIG_SCHED_CLUSTER
> +/*
> + * Scan the cluster domain for idle CPUs and clear cluster cpumask
> after scanning
> + */
> +static inline int scan_cluster(struct task_struct *p, int prev_cpu,
> int target)
> +{
> +	struct cpumask *cpus =
> this_cpu_cpumask_var_ptr(select_idle_mask);
> +	struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster,
> target));
> +	int cpu, idle_cpu;
> +
> +	/* TODO: Support SMT case while a machine with both cluster and
> SMT born */

This is probably a clearer comment

	/* TODO: Support SMT system with cluster topology */

> +	if (!sched_smt_active() && sd) {
> +		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> +			idle_cpu = __select_idle_cpu(cpu, p);
>   */
> -static int select_idle_cpu(struct task_struct *p, struct
> sched_domain *sd, bool has_idle_core, int target)
> +static int select_idle_cpu(struct task_struct *p, struct
> sched_domain *sd, bool has_idle_core, int prev_cpu, int target)
>  {
>  	struct cpumask *cpus =
> this_cpu_cpumask_var_ptr(select_idle_mask);
>  	int i, cpu, idle_cpu = -1, nr = INT_MAX;
> @@ -6282,6 +6316,10 @@ static int select_idle_cpu(struct task_struct
> *p, struct sched_domain *sd, bool
>  
>  	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>  
> +	idle_cpu = scan_cluster(p, prev_cpu, target);

Shouldn't "cpus" from 

cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);

be passed to scan_cluster, to make sure that the cpu returned is 
in the affinity mask of the task? I don't see p->cpus_ptr
being checked in scan_cluster to make sure the cpu found is in the
affinity mask.

Tim


> +	if ((unsigned int)idle_cpu < nr_cpumask_bits)
> +		return idle_cpu;
> +
>  	if (sched_feat(SIS_PROP) && !has_idle_core) {
>  		u64 avg_cost, avg_idle, span_avg;
>  		unsigned long now = jiffies;
> @@ -6416,7 +6454,7 @@ static int select_idle_sibling(struct
> task_struct *p, int prev, int target)
>  	/*
>  	 * If the previous CPU is cache affine and idle, don't be
> stupid:
>  	 */
> -	if (prev != target && cpus_share_cache(prev, target) &&
> +	if (prev != target && cpus_share_resources(prev, target) &&
>  	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
>  	    asym_fits_capacity(task_util, prev))
>  		return prev;
> @@ -6442,7 +6480,7 @@ static int select_idle_sibling(struct
> task_struct *p, int prev, int target)
>  	p->recent_used_cpu = prev;
>  	if (recent_used_cpu != prev &&
>  	    recent_used_cpu != target &&
> -	    cpus_share_cache(recent_used_cpu, target) &&
> +	    cpus_share_resources(recent_used_cpu, target) &&
>  	    (available_idle_cpu(recent_used_cpu) ||
> sched_idle_cpu(recent_used_cpu)) &&
>  	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
>  	    asym_fits_capacity(task_util, recent_used_cpu)) {
> @@ -6483,7 +6521,7 @@ static int select_idle_sibling(struct
> task_struct *p, int prev, int target)
>  		}
>  	}
>  
> -	i = select_idle_cpu(p, sd, has_idle_core, target);
> +	i = select_idle_cpu(p, sd, has_idle_core, prev, target);
>  	if ((unsigned)i < nr_cpumask_bits)
>  		return i;
>  


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-01-27  1:14   ` Tim Chen
@ 2022-01-27  2:02     ` Yicong Yang
  2022-01-27  2:30       ` Tim Chen
  0 siblings, 1 reply; 27+ messages in thread
From: Yicong Yang @ 2022-01-27  2:02 UTC (permalink / raw)
  To: Tim Chen, Yicong Yang, peterz, mingo, juri.lelli,
	vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, song.bao.hua,
	guodong.xu

On 2022/1/27 9:14, Tim Chen wrote:
> On Wed, 2022-01-26 at 16:09 +0800, Yicong Yang wrote:
>> From: Barry Song <song.bao.hua@hisilicon.com>
>>
>> For platforms having clusters like Kunpeng920, CPUs within the same
>> cluster have lower latency when synchronizing and accessing shared
>> resources like cache. Thus, this patch tries to find an idle cpu
>> within the cluster of the target CPU before scanning the whole LLC
>> to gain lower latency.
>>
>> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this
>> patch doesn't consider SMT for this moment.
>>
>> Testing has been done on Kunpeng920 by pinning tasks to one numa
>> and two numa. On Kunpeng920, Each numa has 8 clusters and each
>> cluster has 4 CPUs.
>>
>> With this patch, We noticed enhancement on tbench within one
>> numa or cross two numa.
>>
>> On numa 0:
>>                             5.17-rc1                patched
>> Hmean     1        324.73 (   0.00%)      378.01 *  16.41%*
>> Hmean     2        645.36 (   0.00%)      754.63 *  16.93%*
>> Hmean     4       1302.09 (   0.00%)     1507.54 *  15.78%*
>> Hmean     8       2612.03 (   0.00%)     2982.57 *  14.19%*
>> Hmean     16      5307.12 (   0.00%)     5886.66 *  10.92%*
>> Hmean     32      9354.22 (   0.00%)     9908.13 *   5.92%*
>> Hmean     64      7240.35 (   0.00%)     7278.78 *   0.53%*
>> Hmean     128     6186.40 (   0.00%)     6187.85 (   0.02%)
>>
>> On numa 0-1:
>>                             5.17-rc1                patched
>> Hmean     1        320.01 (   0.00%)      378.44 *  18.26%*
>> Hmean     2        643.85 (   0.00%)      752.52 *  16.88%*
>> Hmean     4       1287.36 (   0.00%)     1505.62 *  16.95%*
>> Hmean     8       2564.60 (   0.00%)     2955.29 *  15.23%*
>> Hmean     16      5195.69 (   0.00%)     5814.74 *  11.91%*
>> Hmean     32      9769.16 (   0.00%)    10872.63 *  11.30%*
>> Hmean     64     15952.50 (   0.00%)    17281.98 *   8.33%*
>> Hmean     128    13113.77 (   0.00%)    13895.20 *   5.96%*
>> Hmean     256    10997.59 (   0.00%)    11244.69 *   2.25%*
>> Hmean     512    14623.60 (   0.00%)    15526.25 *   6.17%*
>>
>> This will also help to improve the MySQL. With MySQL server
>> running on numa 0 and client running on numa 1, both QPS and
>> latency is imporved on read-write case:
>>                         5.17-rc1        patched
>> QPS-16threads        143333.2633    145077.4033(+1.22%)
>> QPS-24threads        195085.9367    202719.6133(+3.91%)
>> QPS-32threads        241165.6867      249020.74(+3.26%)
>> QPS-64threads        244586.8433    253387.7567(+3.60%)
>> avg-lat-16threads           2.23           2.19(+1.19%)
>> avg-lat-24threads           2.46           2.36(+3.79%)
>> avg-lat-36threads           2.66           2.57(+3.26%)
>> avg-lat-64threads           5.23           5.05(+3.44%)
>>
>> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
>> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
>> ---
>>  kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++
>> ----
>>  1 file changed, 42 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 5146163bfabb..2f84a933aedd 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6262,12 +6262,46 @@ static inline int select_idle_smt(struct
>> task_struct *p, struct sched_domain *sd
>>  
>>  #endif /* CONFIG_SCHED_SMT */
>>  
>> +#ifdef CONFIG_SCHED_CLUSTER
>> +/*
>> + * Scan the cluster domain for idle CPUs and clear cluster cpumask
>> after scanning
>> + */
>> +static inline int scan_cluster(struct task_struct *p, int prev_cpu,
>> int target)
>> +{
>> +	struct cpumask *cpus =
>> this_cpu_cpumask_var_ptr(select_idle_mask);
>> +	struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster,
>> target));
>> +	int cpu, idle_cpu;
>> +
>> +	/* TODO: Support SMT case while a machine with both cluster and
>> SMT born */
> 
> This is probably a clearer comment
> 
> 	/* TODO: Support SMT system with cluster topology */
> 
>> +	if (!sched_smt_active() && sd) {
>> +		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
>> +			idle_cpu = __select_idle_cpu(cpu, p);
>>   */
>> -static int select_idle_cpu(struct task_struct *p, struct
>> sched_domain *sd, bool has_idle_core, int target)
>> +static int select_idle_cpu(struct task_struct *p, struct
>> sched_domain *sd, bool has_idle_core, int prev_cpu, int target)
>>  {
>>  	struct cpumask *cpus =
>> this_cpu_cpumask_var_ptr(select_idle_mask);
>>  	int i, cpu, idle_cpu = -1, nr = INT_MAX;
>> @@ -6282,6 +6316,10 @@ static int select_idle_cpu(struct task_struct
>> *p, struct sched_domain *sd, bool
>>  
>>  	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>  
>> +	idle_cpu = scan_cluster(p, prev_cpu, target);
> 
> Shouldn't "cpus" from 
> 
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> 
> be passed to scan_cluster, to make sure that the cpu returned is 
> in the affinity mask of the task? I don't see p->cpus_ptr
> being checked in scan_cluster to make sure the cpu found is in the
> affinity mask.
> 

The cpus scanned in scan_cluster() is the intersection of
select_idle_mask and sched_domain_span(cluster_sd), and
we limited the select_idle_mask in the tasks' affinity mask
before we enter scan_cluster() here.

Thanks.

> Tim
> 
> 
>> +	if ((unsigned int)idle_cpu < nr_cpumask_bits)
>> +		return idle_cpu;
>> +
>>  	if (sched_feat(SIS_PROP) && !has_idle_core) {
>>  		u64 avg_cost, avg_idle, span_avg;
>>  		unsigned long now = jiffies;
>> @@ -6416,7 +6454,7 @@ static int select_idle_sibling(struct
>> task_struct *p, int prev, int target)
>>  	/*
>>  	 * If the previous CPU is cache affine and idle, don't be
>> stupid:
>>  	 */
>> -	if (prev != target && cpus_share_cache(prev, target) &&
>> +	if (prev != target && cpus_share_resources(prev, target) &&
>>  	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
>>  	    asym_fits_capacity(task_util, prev))
>>  		return prev;
>> @@ -6442,7 +6480,7 @@ static int select_idle_sibling(struct
>> task_struct *p, int prev, int target)
>>  	p->recent_used_cpu = prev;
>>  	if (recent_used_cpu != prev &&
>>  	    recent_used_cpu != target &&
>> -	    cpus_share_cache(recent_used_cpu, target) &&
>> +	    cpus_share_resources(recent_used_cpu, target) &&
>>  	    (available_idle_cpu(recent_used_cpu) ||
>> sched_idle_cpu(recent_used_cpu)) &&
>>  	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
>>  	    asym_fits_capacity(task_util, recent_used_cpu)) {
>> @@ -6483,7 +6521,7 @@ static int select_idle_sibling(struct
>> task_struct *p, int prev, int target)
>>  		}
>>  	}
>>  
>> -	i = select_idle_cpu(p, sd, has_idle_core, target);
>> +	i = select_idle_cpu(p, sd, has_idle_core, prev, target);
>>  	if ((unsigned)i < nr_cpumask_bits)
>>  		return i;
>>  
> 
> .
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-01-27  2:02     ` Yicong Yang
@ 2022-01-27  2:30       ` Tim Chen
  2022-01-27  2:36         ` Tim Chen
  0 siblings, 1 reply; 27+ messages in thread
From: Tim Chen @ 2022-01-27  2:30 UTC (permalink / raw)
  To: Yicong Yang, Yicong Yang, peterz, mingo, juri.lelli,
	vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, song.bao.hua,
	guodong.xu

On Thu, 2022-01-27 at 10:02 +0800, Yicong Yang wrote:
> On 2022/1/27 9:14, Tim Chen wrote:
> > On Wed, 2022-01-26 at 16:09 +0800, Yicong Yang wrote:
> > > From: Barry Song <song.bao.hua@hisilicon.com>
> > > 
> > > For platforms having clusters like Kunpeng920, CPUs within the
> > > same
> > > cluster have lower latency when synchronizing and accessing
> > > shared
> > > resources like cache. Thus, this patch tries to find an idle cpu
> > > within the cluster of the target CPU before scanning the whole
> > > LLC
> > > to gain lower latency.
> > > 
> > > Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this
> > > patch doesn't consider SMT for this moment.
> > > 
> > > Testing has been done on Kunpeng920 by pinning tasks to one numa
> > > and two numa. On Kunpeng920, Each numa has 8 clusters and each
> > > cluster has 4 CPUs.
> > > 
> > > With this patch, We noticed enhancement on tbench within one
> > > numa or cross two numa.
> > > 
> > > On numa 0:
> > >                             5.17-rc1                patched
> > > Hmean     1        324.73 (   0.00%)      378.01 *  16.41%*
> > > Hmean     2        645.36 (   0.00%)      754.63 *  16.93%*
> > > Hmean     4       1302.09 (   0.00%)     1507.54 *  15.78%*
> > > Hmean     8       2612.03 (   0.00%)     2982.57 *  14.19%*
> > > Hmean     16      5307.12 (   0.00%)     5886.66 *  10.92%*
> > > Hmean     32      9354.22 (   0.00%)     9908.13 *   5.92%*
> > > Hmean     64      7240.35 (   0.00%)     7278.78 *   0.53%*
> > > Hmean     128     6186.40 (   0.00%)     6187.85 (   0.02%)
> > > 
> > > On numa 0-1:
> > >                             5.17-rc1                patched
> > > Hmean     1        320.01 (   0.00%)      378.44 *  18.26%*
> > > Hmean     2        643.85 (   0.00%)      752.52 *  16.88%*
> > > Hmean     4       1287.36 (   0.00%)     1505.62 *  16.95%*
> > > Hmean     8       2564.60 (   0.00%)     2955.29 *  15.23%*
> > > Hmean     16      5195.69 (   0.00%)     5814.74 *  11.91%*
> > > Hmean     32      9769.16 (   0.00%)    10872.63 *  11.30%*
> > > Hmean     64     15952.50 (   0.00%)    17281.98 *   8.33%*
> > > Hmean     128    13113.77 (   0.00%)    13895.20 *   5.96%*
> > > Hmean     256    10997.59 (   0.00%)    11244.69 *   2.25%*
> > > Hmean     512    14623.60 (   0.00%)    15526.25 *   6.17%*
> > > 
> > > This will also help to improve the MySQL. With MySQL server
> > > running on numa 0 and client running on numa 1, both QPS and
> > > latency is imporved on read-write case:
> > >                         5.17-rc1        patched
> > > QPS-16threads        143333.2633    145077.4033(+1.22%)
> > > QPS-24threads        195085.9367    202719.6133(+3.91%)
> > > QPS-32threads        241165.6867      249020.74(+3.26%)
> > > QPS-64threads        244586.8433    253387.7567(+3.60%)
> > > avg-lat-16threads           2.23           2.19(+1.19%)
> > > avg-lat-24threads           2.46           2.36(+3.79%)
> > > avg-lat-36threads           2.66           2.57(+3.26%)
> > > avg-lat-64threads           5.23           5.05(+3.44%)
> > > 
> > > Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> > > ---
> > >  kernel/sched/fair.c | 46
> > > +++++++++++++++++++++++++++++++++++++++++
> > > ----
> > >  1 file changed, 42 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 5146163bfabb..2f84a933aedd 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6262,12 +6262,46 @@ static inline int select_idle_smt(struct
> > > task_struct *p, struct sched_domain *sd
> > >  
> > >  #endif /* CONFIG_SCHED_SMT */
> > >  
> > > +#ifdef CONFIG_SCHED_CLUSTER
> > > +/*
> > > + * Scan the cluster domain for idle CPUs and clear cluster
> > > cpumask
> > > after scanning
> > > + */
> > > +static inline int scan_cluster(struct task_struct *p, int
> > > prev_cpu,
> > > int target)
> > > +{
> > > +	struct cpumask *cpus =
> > > this_cpu_cpumask_var_ptr(select_idle_mask);
> > > +	struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster,
> > > target));
> > > +	int cpu, idle_cpu;
> > > +
> > > +	/* TODO: Support SMT case while a machine with both cluster and
> > > SMT born */
> > 
> > This is probably a clearer comment
> > 
> > 	/* TODO: Support SMT system with cluster topology */
> > 
> > > +	if (!sched_smt_active() && sd) {
> > > +		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> > > +			idle_cpu = __select_idle_cpu(cpu, p);
> > >   */
> > > -static int select_idle_cpu(struct task_struct *p, struct
> > > sched_domain *sd, bool has_idle_core, int target)
> > > +static int select_idle_cpu(struct task_struct *p, struct
> > > sched_domain *sd, bool has_idle_core, int prev_cpu, int target)
> > >  {
> > >  	struct cpumask *cpus =
> > > this_cpu_cpumask_var_ptr(select_idle_mask);
> > >  	int i, cpu, idle_cpu = -1, nr = INT_MAX;
> > > @@ -6282,6 +6316,10 @@ static int select_idle_cpu(struct
> > > task_struct
> > > *p, struct sched_domain *sd, bool
> > >  
> > >  	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > >  
> > > +	idle_cpu = scan_cluster(p, prev_cpu, target);
> > 
> > Shouldn't "cpus" from 
> > 
> > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > 
> > be passed to scan_cluster, to make sure that the cpu returned is 
> > in the affinity mask of the task? I don't see p->cpus_ptr
> > being checked in scan_cluster to make sure the cpu found is in the
> > affinity mask.
> > 
> 
> The cpus scanned in scan_cluster() is the intersection of
> select_idle_mask and sched_domain_span(cluster_sd), and
> we limited the select_idle_mask in the tasks' affinity mask
> before we enter scan_cluster() here.

Ah, I missed the fact that cpus point to the select_idle_mask.

Thanks.

Tim



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-01-27  2:30       ` Tim Chen
@ 2022-01-27  2:36         ` Tim Chen
  2022-01-27  3:05           ` Yicong Yang
  0 siblings, 1 reply; 27+ messages in thread
From: Tim Chen @ 2022-01-27  2:36 UTC (permalink / raw)
  To: Yicong Yang, Yicong Yang, peterz, mingo, juri.lelli,
	vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, song.bao.hua,
	guodong.xu

On Wed, 2022-01-26 at 18:30 -0800, Tim Chen wrote:
> On Thu, 2022-01-27 at 10:02 +0800, Yicong Yang wrote:
> > On 2022/1/27 9:14, Tim Chen wrote:
> > > On Wed, 2022-01-26 at 16:09 +0800, Yicong Yang wrote:
> > > > From: Barry Song <song.bao.hua@hisilicon.com>
> > > > 
> > > > For platforms having clusters like Kunpeng920, CPUs within the
> > > > same
> > > > cluster have lower latency when synchronizing and accessing
> > > > shared
> > > > resources like cache. Thus, this patch tries to find an idle
> > > > cpu
> > > > within the cluster of the target CPU before scanning the whole
> > > > LLC
> > > > to gain lower latency.
> > > > 
> > > > Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so
> > > > this
> > > > patch doesn't consider SMT for this moment.
> > > > 
> > > > Testing has been done on Kunpeng920 by pinning tasks to one
> > > > numa
> > > > and two numa. On Kunpeng920, Each numa has 8 clusters and each
> > > > cluster has 4 CPUs.
> > > > 
> > > > With this patch, We noticed enhancement on tbench within one
> > > > numa or cross two numa.
> > > > 
> > > > On numa 0:
> > > >                             5.17-rc1                patched
> > > > Hmean     1        324.73 (   0.00%)      378.01 *  16.41%*
> > > > Hmean     2        645.36 (   0.00%)      754.63 *  16.93%*
> > > > Hmean     4       1302.09 (   0.00%)     1507.54 *  15.78%*
> > > > Hmean     8       2612.03 (   0.00%)     2982.57 *  14.19%*
> > > > Hmean     16      5307.12 (   0.00%)     5886.66 *  10.92%*
> > > > Hmean     32      9354.22 (   0.00%)     9908.13 *   5.92%*
> > > > Hmean     64      7240.35 (   0.00%)     7278.78 *   0.53%*
> > > > Hmean     128     6186.40 (   0.00%)     6187.85 (   0.02%)
> > > > 
> > > > On numa 0-1:
> > > >                             5.17-rc1                patched
> > > > Hmean     1        320.01 (   0.00%)      378.44 *  18.26%*
> > > > Hmean     2        643.85 (   0.00%)      752.52 *  16.88%*
> > > > Hmean     4       1287.36 (   0.00%)     1505.62 *  16.95%*
> > > > Hmean     8       2564.60 (   0.00%)     2955.29 *  15.23%*
> > > > Hmean     16      5195.69 (   0.00%)     5814.74 *  11.91%*
> > > > Hmean     32      9769.16 (   0.00%)    10872.63 *  11.30%*
> > > > Hmean     64     15952.50 (   0.00%)    17281.98 *   8.33%*
> > > > Hmean     128    13113.77 (   0.00%)    13895.20 *   5.96%*
> > > > Hmean     256    10997.59 (   0.00%)    11244.69 *   2.25%*
> > > > Hmean     512    14623.60 (   0.00%)    15526.25 *   6.17%*
> > > > 
> > > > This will also help to improve the MySQL. With MySQL server
> > > > running on numa 0 and client running on numa 1, both QPS and
> > > > latency is imporved on read-write case:
> > > >                         5.17-rc1        patched
> > > > QPS-16threads        143333.2633    145077.4033(+1.22%)
> > > > QPS-24threads        195085.9367    202719.6133(+3.91%)
> > > > QPS-32threads        241165.6867      249020.74(+3.26%)
> > > > QPS-64threads        244586.8433    253387.7567(+3.60%)
> > > > avg-lat-16threads           2.23           2.19(+1.19%)
> > > > avg-lat-24threads           2.46           2.36(+3.79%)
> > > > avg-lat-36threads           2.66           2.57(+3.26%)
> > > > avg-lat-64threads           5.23           5.05(+3.44%)
> > > > 
> > > > Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> > > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > > > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> > > > ---
> > > >  kernel/sched/fair.c | 46
> > > > +++++++++++++++++++++++++++++++++++++++++
> > > > ----
> > > >  1 file changed, 42 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 5146163bfabb..2f84a933aedd 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -6262,12 +6262,46 @@ static inline int
> > > > select_idle_smt(struct
> > > > task_struct *p, struct sched_domain *sd
> > > >  
> > > >  #endif /* CONFIG_SCHED_SMT */
> > > >  
> > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > +/*
> > > > + * Scan the cluster domain for idle CPUs and clear cluster
> > > > cpumask
> > > > after scanning
> > > > + */
> > > > +static inline int scan_cluster(struct task_struct *p, int
> > > > prev_cpu,
> > > > int target)
> > > > +{
> > > > +	struct cpumask *cpus =
> > > > this_cpu_cpumask_var_ptr(select_idle_mask);
> > > > +	struct sched_domain *sd =
> > > > rcu_dereference(per_cpu(sd_cluster,
> > > > target));
> > > > +	int cpu, idle_cpu;
> > > > +
> > > > +	/* TODO: Support SMT case while a machine with both
> > > > cluster and
> > > > SMT born */
> > > 
> > > This is probably a clearer comment
> > > 
> > > 	/* TODO: Support SMT system with cluster topology */
> > > 
> > > > +	if (!sched_smt_active() && sd) {
> > > > +		for_each_cpu_and(cpu, cpus,
> > > > sched_domain_span(sd)) {
> > > > +			idle_cpu = __select_idle_cpu(cpu, p);
> > > >   */
> > > > -static int select_idle_cpu(struct task_struct *p, struct
> > > > sched_domain *sd, bool has_idle_core, int target)
> > > > +static int select_idle_cpu(struct task_struct *p, struct
> > > > sched_domain *sd, bool has_idle_core, int prev_cpu, int target)
> > > >  {
> > > >  	struct cpumask *cpus =
> > > > this_cpu_cpumask_var_ptr(select_idle_mask);
> > > >  	int i, cpu, idle_cpu = -1, nr = INT_MAX;
> > > > @@ -6282,6 +6316,10 @@ static int select_idle_cpu(struct
> > > > task_struct
> > > > *p, struct sched_domain *sd, bool
> > > >  
> > > >  	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > > >  
> > > > +	idle_cpu = scan_cluster(p, prev_cpu, target);
> > > 
> > > Shouldn't "cpus" from 
> > > 
> > > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > > 
> > > be passed to scan_cluster, to make sure that the cpu returned is 
> > > in the affinity mask of the task? I don't see p->cpus_ptr
> > > being checked in scan_cluster to make sure the cpu found is in
> > > the
> > > affinity mask.
> > > 
> > 
> > The cpus scanned in scan_cluster() is the intersection of
> > select_idle_mask and sched_domain_span(cluster_sd), and
> > we limited the select_idle_mask in the tasks' affinity mask
> > before we enter scan_cluster() here.
> 
> Ah, I missed the fact that cpus point to the select_idle_mask.
> 

I think it will be easier to read the code if you pass "cpus" directly
to scan cluster, rather than making this implicit, and having this
assignment 

*cpus = this_cpu_cpumask_var_ptr(select_idle_mask);

again in scan_cluster.

Tim



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-01-27  2:36         ` Tim Chen
@ 2022-01-27  3:05           ` Yicong Yang
  0 siblings, 0 replies; 27+ messages in thread
From: Yicong Yang @ 2022-01-27  3:05 UTC (permalink / raw)
  To: Tim Chen, Yicong Yang, peterz, mingo, juri.lelli,
	vincent.guittot, gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, song.bao.hua,
	guodong.xu

On 2022/1/27 10:36, Tim Chen wrote:
> On Wed, 2022-01-26 at 18:30 -0800, Tim Chen wrote:
>> On Thu, 2022-01-27 at 10:02 +0800, Yicong Yang wrote:
>>> On 2022/1/27 9:14, Tim Chen wrote:
>>>> On Wed, 2022-01-26 at 16:09 +0800, Yicong Yang wrote:
>>>>> From: Barry Song <song.bao.hua@hisilicon.com>
>>>>>
>>>>> For platforms having clusters like Kunpeng920, CPUs within the
>>>>> same
>>>>> cluster have lower latency when synchronizing and accessing
>>>>> shared
>>>>> resources like cache. Thus, this patch tries to find an idle
>>>>> cpu
>>>>> within the cluster of the target CPU before scanning the whole
>>>>> LLC
>>>>> to gain lower latency.
>>>>>
>>>>> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so
>>>>> this
>>>>> patch doesn't consider SMT for this moment.
>>>>>
>>>>> Testing has been done on Kunpeng920 by pinning tasks to one
>>>>> numa
>>>>> and two numa. On Kunpeng920, Each numa has 8 clusters and each
>>>>> cluster has 4 CPUs.
>>>>>
>>>>> With this patch, We noticed enhancement on tbench within one
>>>>> numa or cross two numa.
>>>>>
>>>>> On numa 0:
>>>>>                             5.17-rc1                patched
>>>>> Hmean     1        324.73 (   0.00%)      378.01 *  16.41%*
>>>>> Hmean     2        645.36 (   0.00%)      754.63 *  16.93%*
>>>>> Hmean     4       1302.09 (   0.00%)     1507.54 *  15.78%*
>>>>> Hmean     8       2612.03 (   0.00%)     2982.57 *  14.19%*
>>>>> Hmean     16      5307.12 (   0.00%)     5886.66 *  10.92%*
>>>>> Hmean     32      9354.22 (   0.00%)     9908.13 *   5.92%*
>>>>> Hmean     64      7240.35 (   0.00%)     7278.78 *   0.53%*
>>>>> Hmean     128     6186.40 (   0.00%)     6187.85 (   0.02%)
>>>>>
>>>>> On numa 0-1:
>>>>>                             5.17-rc1                patched
>>>>> Hmean     1        320.01 (   0.00%)      378.44 *  18.26%*
>>>>> Hmean     2        643.85 (   0.00%)      752.52 *  16.88%*
>>>>> Hmean     4       1287.36 (   0.00%)     1505.62 *  16.95%*
>>>>> Hmean     8       2564.60 (   0.00%)     2955.29 *  15.23%*
>>>>> Hmean     16      5195.69 (   0.00%)     5814.74 *  11.91%*
>>>>> Hmean     32      9769.16 (   0.00%)    10872.63 *  11.30%*
>>>>> Hmean     64     15952.50 (   0.00%)    17281.98 *   8.33%*
>>>>> Hmean     128    13113.77 (   0.00%)    13895.20 *   5.96%*
>>>>> Hmean     256    10997.59 (   0.00%)    11244.69 *   2.25%*
>>>>> Hmean     512    14623.60 (   0.00%)    15526.25 *   6.17%*
>>>>>
>>>>> This will also help to improve the MySQL. With MySQL server
>>>>> running on numa 0 and client running on numa 1, both QPS and
>>>>> latency is imporved on read-write case:
>>>>>                         5.17-rc1        patched
>>>>> QPS-16threads        143333.2633    145077.4033(+1.22%)
>>>>> QPS-24threads        195085.9367    202719.6133(+3.91%)
>>>>> QPS-32threads        241165.6867      249020.74(+3.26%)
>>>>> QPS-64threads        244586.8433    253387.7567(+3.60%)
>>>>> avg-lat-16threads           2.23           2.19(+1.19%)
>>>>> avg-lat-24threads           2.46           2.36(+3.79%)
>>>>> avg-lat-36threads           2.66           2.57(+3.26%)
>>>>> avg-lat-64threads           5.23           5.05(+3.44%)
>>>>>
>>>>> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
>>>>> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
>>>>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
>>>>> ---
>>>>>  kernel/sched/fair.c | 46
>>>>> +++++++++++++++++++++++++++++++++++++++++
>>>>> ----
>>>>>  1 file changed, 42 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index 5146163bfabb..2f84a933aedd 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -6262,12 +6262,46 @@ static inline int
>>>>> select_idle_smt(struct
>>>>> task_struct *p, struct sched_domain *sd
>>>>>  
>>>>>  #endif /* CONFIG_SCHED_SMT */
>>>>>  
>>>>> +#ifdef CONFIG_SCHED_CLUSTER
>>>>> +/*
>>>>> + * Scan the cluster domain for idle CPUs and clear cluster
>>>>> cpumask
>>>>> after scanning
>>>>> + */
>>>>> +static inline int scan_cluster(struct task_struct *p, int
>>>>> prev_cpu,
>>>>> int target)
>>>>> +{
>>>>> +	struct cpumask *cpus =
>>>>> this_cpu_cpumask_var_ptr(select_idle_mask);
>>>>> +	struct sched_domain *sd =
>>>>> rcu_dereference(per_cpu(sd_cluster,
>>>>> target));
>>>>> +	int cpu, idle_cpu;
>>>>> +
>>>>> +	/* TODO: Support SMT case while a machine with both
>>>>> cluster and
>>>>> SMT born */
>>>>
>>>> This is probably a clearer comment
>>>>
>>>> 	/* TODO: Support SMT system with cluster topology */
>>>>
>>>>> +	if (!sched_smt_active() && sd) {
>>>>> +		for_each_cpu_and(cpu, cpus,
>>>>> sched_domain_span(sd)) {
>>>>> +			idle_cpu = __select_idle_cpu(cpu, p);
>>>>>   */
>>>>> -static int select_idle_cpu(struct task_struct *p, struct
>>>>> sched_domain *sd, bool has_idle_core, int target)
>>>>> +static int select_idle_cpu(struct task_struct *p, struct
>>>>> sched_domain *sd, bool has_idle_core, int prev_cpu, int target)
>>>>>  {
>>>>>  	struct cpumask *cpus =
>>>>> this_cpu_cpumask_var_ptr(select_idle_mask);
>>>>>  	int i, cpu, idle_cpu = -1, nr = INT_MAX;
>>>>> @@ -6282,6 +6316,10 @@ static int select_idle_cpu(struct
>>>>> task_struct
>>>>> *p, struct sched_domain *sd, bool
>>>>>  
>>>>>  	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>>>  
>>>>> +	idle_cpu = scan_cluster(p, prev_cpu, target);
>>>>
>>>> Shouldn't "cpus" from 
>>>>
>>>> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>>
>>>> be passed to scan_cluster, to make sure that the cpu returned is 
>>>> in the affinity mask of the task? I don't see p->cpus_ptr
>>>> being checked in scan_cluster to make sure the cpu found is in
>>>> the
>>>> affinity mask.
>>>>
>>>
>>> The cpus scanned in scan_cluster() is the intersection of
>>> select_idle_mask and sched_domain_span(cluster_sd), and
>>> we limited the select_idle_mask in the tasks' affinity mask
>>> before we enter scan_cluster() here.
>>
>> Ah, I missed the fact that cpus point to the select_idle_mask.
>>
> 
> I think it will be easier to read the code if you pass "cpus" directly
> to scan cluster, rather than making this implicit, and having this
> assignment 
> 
> *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> 
> again in scan_cluster.

sure. It does look more readable and I think we can change to that. :)

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
  2022-01-26  8:09 ` [PATCH v2 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
@ 2022-01-27 15:26   ` Gautham R. Shenoy
  0 siblings, 0 replies; 27+ messages in thread
From: Gautham R. Shenoy @ 2022-01-27 15:26 UTC (permalink / raw)
  To: Yicong Yang
  Cc: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	linux-kernel, linux-arm-kernel, dietmar.eggemann, rostedt,
	bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar,
	linuxarm, 21cnbao, song.bao.hua, guodong.xu

Hello Yicong, Barry,
On Wed, Jan 26, 2022 at 04:09:46PM +0800, Yicong Yang wrote:
> From: Barry Song <song.bao.hua@hisilicon.com>
> 
> Add per-cpu cluster domain info and cpus_share_resources() API.
> This is the preparation for the optimization of select_idle_cpu()
> on platforms with cluster scheduler level.
> 
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>

This patch looks good to me.

Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-01-26  8:09 ` [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
  2022-01-27  1:14   ` Tim Chen
@ 2022-01-27 15:41   ` Gautham R. Shenoy
  2022-01-27 20:21     ` Barry Song
  1 sibling, 1 reply; 27+ messages in thread
From: Gautham R. Shenoy @ 2022-01-27 15:41 UTC (permalink / raw)
  To: Yicong Yang
  Cc: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	linux-kernel, linux-arm-kernel, dietmar.eggemann, rostedt,
	bsegall, bristot, prime.zeng, jonathan.cameron, ego, srikar,
	linuxarm, 21cnbao, song.bao.hua, guodong.xu

On Wed, Jan 26, 2022 at 04:09:47PM +0800, Yicong Yang wrote:
> From: Barry Song <song.bao.hua@hisilicon.com>
> 
> For platforms having clusters like Kunpeng920, CPUs within the same
> cluster have lower latency when synchronizing and accessing shared
> resources like cache. Thus, this patch tries to find an idle cpu
> within the cluster of the target CPU before scanning the whole LLC
> to gain lower latency.
> 
> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this
> patch doesn't consider SMT for this moment.
> 
> Testing has been done on Kunpeng920 by pinning tasks to one numa
> and two numa. On Kunpeng920, Each numa has 8 clusters and each
> cluster has 4 CPUs.
> 
> With this patch, We noticed enhancement on tbench within one
> numa or cross two numa.
> 
> On numa 0:
>                             5.17-rc1                patched
> Hmean     1        324.73 (   0.00%)      378.01 *  16.41%*
> Hmean     2        645.36 (   0.00%)      754.63 *  16.93%*
> Hmean     4       1302.09 (   0.00%)     1507.54 *  15.78%*
> Hmean     8       2612.03 (   0.00%)     2982.57 *  14.19%*
> Hmean     16      5307.12 (   0.00%)     5886.66 *  10.92%*
> Hmean     32      9354.22 (   0.00%)     9908.13 *   5.92%*
> Hmean     64      7240.35 (   0.00%)     7278.78 *   0.53%*
> Hmean     128     6186.40 (   0.00%)     6187.85 (   0.02%)
> 
> On numa 0-1:
>                             5.17-rc1                patched
> Hmean     1        320.01 (   0.00%)      378.44 *  18.26%*
> Hmean     2        643.85 (   0.00%)      752.52 *  16.88%*
> Hmean     4       1287.36 (   0.00%)     1505.62 *  16.95%*
> Hmean     8       2564.60 (   0.00%)     2955.29 *  15.23%*
> Hmean     16      5195.69 (   0.00%)     5814.74 *  11.91%*
> Hmean     32      9769.16 (   0.00%)    10872.63 *  11.30%*
> Hmean     64     15952.50 (   0.00%)    17281.98 *   8.33%*
> Hmean     128    13113.77 (   0.00%)    13895.20 *   5.96%*
> Hmean     256    10997.59 (   0.00%)    11244.69 *   2.25%*
> Hmean     512    14623.60 (   0.00%)    15526.25 *   6.17%*
> 
> This will also help to improve the MySQL. With MySQL server
> running on numa 0 and client running on numa 1, both QPS and
> latency is imporved on read-write case:
>                         5.17-rc1        patched
> QPS-16threads        143333.2633    145077.4033(+1.22%)
> QPS-24threads        195085.9367    202719.6133(+3.91%)
> QPS-32threads        241165.6867      249020.74(+3.26%)
> QPS-64threads        244586.8433    253387.7567(+3.60%)
> avg-lat-16threads           2.23           2.19(+1.19%)
> avg-lat-24threads           2.46           2.36(+3.79%)
> avg-lat-36threads           2.66           2.57(+3.26%)
> avg-lat-64threads           5.23           5.05(+3.44%)
> 
> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> ---
>  kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 42 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5146163bfabb..2f84a933aedd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6262,12 +6262,46 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>  
>  #endif /* CONFIG_SCHED_SMT */
>  
> +#ifdef CONFIG_SCHED_CLUSTER
> +/*
> + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
> + */
> +static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target)
> +{
> +	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> +	struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
> +	int cpu, idle_cpu;
> +
> +	/* TODO: Support SMT case while a machine with both cluster and SMT born */
> +	if (!sched_smt_active() && sd) {
> +		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> +			idle_cpu = __select_idle_cpu(cpu, p);
> +			if ((unsigned int)idle_cpu < nr_cpumask_bits)
> +				return idle_cpu;
> +		}
> +
> +		/* Don't ping-pong tasks in and out cluster frequently */
> +		if (cpus_share_resources(target, prev_cpu))
> +			return target;

We reach here when there aren't any idle CPUs within the
cluster. However there might be idle CPUs in the MC domain. Is a busy
@target preferable to a potentially idle CPU within the larger domain
?


--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-01-28  7:13       ` Srikar Dronamraju
@ 2022-01-27 18:40         ` Barry Song
  2022-02-01  9:38           ` Srikar Dronamraju
  0 siblings, 1 reply; 27+ messages in thread
From: Barry Song @ 2022-01-27 18:40 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Gautham R. Shenoy, Yicong Yang, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Tim Chen, LKML, LAK,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, prime.zeng, Jonathan Cameron, ego,
	Linuxarm, Barry Song, Guodong Xu

On Fri, Jan 28, 2022 at 8:13 PM Srikar Dronamraju
<srikar@linux.vnet.ibm.com> wrote:
>
> * Barry Song <21cnbao@gmail.com> [2022-01-28 09:21:08]:
>
> > On Fri, Jan 28, 2022 at 4:41 AM Gautham R. Shenoy
> > <gautham.shenoy@amd.com> wrote:
> > >
> > > On Wed, Jan 26, 2022 at 04:09:47PM +0800, Yicong Yang wrote:
> > > > From: Barry Song <song.bao.hua@hisilicon.com>
> > > >
> > > > For platforms having clusters like Kunpeng920, CPUs within the same
> > > > cluster have lower latency when synchronizing and accessing shared
> > > > resources like cache. Thus, this patch tries to find an idle cpu
> > > > within the cluster of the target CPU before scanning the whole LLC
> > > > to gain lower latency.
> > > >
> > > > Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this
> > > > patch doesn't consider SMT for this moment.
> > > >
> > > > Testing has been done on Kunpeng920 by pinning tasks to one numa
> > > > and two numa. On Kunpeng920, Each numa has 8 clusters and each
> > > > cluster has 4 CPUs.
> > > >
> > > > With this patch, We noticed enhancement on tbench within one
> > > > numa or cross two numa.
> > > >
> > > > On numa 0:
> > > >                             5.17-rc1                patched
> > > > Hmean     1        324.73 (   0.00%)      378.01 *  16.41%*
> > > > Hmean     2        645.36 (   0.00%)      754.63 *  16.93%*
> > > > Hmean     4       1302.09 (   0.00%)     1507.54 *  15.78%*
> > > > Hmean     8       2612.03 (   0.00%)     2982.57 *  14.19%*
> > > > Hmean     16      5307.12 (   0.00%)     5886.66 *  10.92%*
> > > > Hmean     32      9354.22 (   0.00%)     9908.13 *   5.92%*
> > > > Hmean     64      7240.35 (   0.00%)     7278.78 *   0.53%*
> > > > Hmean     128     6186.40 (   0.00%)     6187.85 (   0.02%)
> > > >
> > > > On numa 0-1:
> > > >                             5.17-rc1                patched
> > > > Hmean     1        320.01 (   0.00%)      378.44 *  18.26%*
> > > > Hmean     2        643.85 (   0.00%)      752.52 *  16.88%*
> > > > Hmean     4       1287.36 (   0.00%)     1505.62 *  16.95%*
> > > > Hmean     8       2564.60 (   0.00%)     2955.29 *  15.23%*
> > > > Hmean     16      5195.69 (   0.00%)     5814.74 *  11.91%*
> > > > Hmean     32      9769.16 (   0.00%)    10872.63 *  11.30%*
> > > > Hmean     64     15952.50 (   0.00%)    17281.98 *   8.33%*
> > > > Hmean     128    13113.77 (   0.00%)    13895.20 *   5.96%*
> > > > Hmean     256    10997.59 (   0.00%)    11244.69 *   2.25%*
> > > > Hmean     512    14623.60 (   0.00%)    15526.25 *   6.17%*
> > > >
> > > > This will also help to improve the MySQL. With MySQL server
> > > > running on numa 0 and client running on numa 1, both QPS and
> > > > latency is imporved on read-write case:
> > > >                         5.17-rc1        patched
> > > > QPS-16threads        143333.2633    145077.4033(+1.22%)
> > > > QPS-24threads        195085.9367    202719.6133(+3.91%)
> > > > QPS-32threads        241165.6867      249020.74(+3.26%)
> > > > QPS-64threads        244586.8433    253387.7567(+3.60%)
> > > > avg-lat-16threads           2.23           2.19(+1.19%)
> > > > avg-lat-24threads           2.46           2.36(+3.79%)
> > > > avg-lat-36threads           2.66           2.57(+3.26%)
> > > > avg-lat-64threads           5.23           5.05(+3.44%)
> > > >
> > > > Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> > > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > > > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> > > > ---
> > > >  kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++----
> > > >  1 file changed, 42 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 5146163bfabb..2f84a933aedd 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -6262,12 +6262,46 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
> > > >
> > > >  #endif /* CONFIG_SCHED_SMT */
> > > >
> > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > +/*
> > > > + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
> > > > + */
> > > > +static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target)
> > > > +{
> > > > +     struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> > > > +     struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
> > > > +     int cpu, idle_cpu;
> > > > +
> > > > +     /* TODO: Support SMT case while a machine with both cluster and SMT born */
> > > > +     if (!sched_smt_active() && sd) {
> > > > +             for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> > > > +                     idle_cpu = __select_idle_cpu(cpu, p);
> > > > +                     if ((unsigned int)idle_cpu < nr_cpumask_bits)
> > > > +                             return idle_cpu;
> > > > +             }
> > > > +
> > > > +             /* Don't ping-pong tasks in and out cluster frequently */
> > > > +             if (cpus_share_resources(target, prev_cpu))
> > > > +                     return target;
> > >
> > > We reach here when there aren't any idle CPUs within the
> > > cluster. However there might be idle CPUs in the MC domain. Is a busy
> > > @target preferable to a potentially idle CPU within the larger domain
> > > ?
> >
> > Hi Gautham,
> >
>
> Hi Barry,
>
>
> > My benchmark showed some performance regression while load was medium or above
> > if we grabbed idle cpu in and out the cluster. it turned out the
> > regression disappeared if
> > we blocked the ping-pong. so the logic here is that if we have scanned
> > and found an
> > idle cpu within the cluster before, we don't let the task jumping back
> > and forth frequently
> > as cache synchronization is higher cost. but the code still allows
> > scanning out of the cluster
> > if we haven't packed waker and wakee together yet.
> >
>
> Like what Gautham said, should we choose the same cluster if we find that
> there are no idle-cpus in the LLC? This way we avoid ping-pong if there are
> no idle-cpus but we still pick an idle-cpu to a busy cpu?

Hi Srikar,
I am sorry I didn't get your question. Currently the code works as below:
if task A wakes up task B, and task A is in LLC0 and task B is in LLC1.
we will scan the cluster of A before scanning the whole LLC0, in this case,
cluster of A is the closest sibling, so it is the better choice than other CPUs
which are in LLC0 but not in the cluster of A. But we do scan all cpus of LLC0
afterwards if we fail to find an idle CPU in the cluster.

After a while, if the cluster of A gets an idle CPU and pulls B into the
cluster, we prefer not pushing B out of the cluster of A again though
there might be an idle CPU outside. as benchmark shows getting an
idle CPU out of the cluster of A doesn't bring performance improvement
but performance decreases as B might be getting in and getting out
the cluster of A very frequently, then cache coherence ping-pong.

Note we are only returning target while if
(cpus_share_resources(target, prev_cpu))
is true. So we are not losing chance to pull B to the LLC of A while LLC0 has
an idle one.

>
> > it might not be a universal win in all kinds of workload. we saw
> > tbench, mysql benefit from
> > the whole change. but pgbench seems not always. so we are still on the
> > way to make possible
> > further tuning here.
> >
>
> > >
> > >
> > > --
> > > Thanks and Regards
> > > gautham.
> >
> > Thanks
> > Barry
>
> --
> Thanks and Regards
> Srikar Dronamraju

Thanks
Barry

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-01-27 15:41   ` Gautham R. Shenoy
@ 2022-01-27 20:21     ` Barry Song
  2022-01-28  7:13       ` Srikar Dronamraju
  0 siblings, 1 reply; 27+ messages in thread
From: Barry Song @ 2022-01-27 20:21 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Yicong Yang, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Tim Chen, LKML, LAK, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	prime.zeng, Jonathan Cameron, ego, Srikar Dronamraju, Linuxarm,
	Barry Song, Guodong Xu

On Fri, Jan 28, 2022 at 4:41 AM Gautham R. Shenoy
<gautham.shenoy@amd.com> wrote:
>
> On Wed, Jan 26, 2022 at 04:09:47PM +0800, Yicong Yang wrote:
> > From: Barry Song <song.bao.hua@hisilicon.com>
> >
> > For platforms having clusters like Kunpeng920, CPUs within the same
> > cluster have lower latency when synchronizing and accessing shared
> > resources like cache. Thus, this patch tries to find an idle cpu
> > within the cluster of the target CPU before scanning the whole LLC
> > to gain lower latency.
> >
> > Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this
> > patch doesn't consider SMT for this moment.
> >
> > Testing has been done on Kunpeng920 by pinning tasks to one numa
> > and two numa. On Kunpeng920, Each numa has 8 clusters and each
> > cluster has 4 CPUs.
> >
> > With this patch, We noticed enhancement on tbench within one
> > numa or cross two numa.
> >
> > On numa 0:
> >                             5.17-rc1                patched
> > Hmean     1        324.73 (   0.00%)      378.01 *  16.41%*
> > Hmean     2        645.36 (   0.00%)      754.63 *  16.93%*
> > Hmean     4       1302.09 (   0.00%)     1507.54 *  15.78%*
> > Hmean     8       2612.03 (   0.00%)     2982.57 *  14.19%*
> > Hmean     16      5307.12 (   0.00%)     5886.66 *  10.92%*
> > Hmean     32      9354.22 (   0.00%)     9908.13 *   5.92%*
> > Hmean     64      7240.35 (   0.00%)     7278.78 *   0.53%*
> > Hmean     128     6186.40 (   0.00%)     6187.85 (   0.02%)
> >
> > On numa 0-1:
> >                             5.17-rc1                patched
> > Hmean     1        320.01 (   0.00%)      378.44 *  18.26%*
> > Hmean     2        643.85 (   0.00%)      752.52 *  16.88%*
> > Hmean     4       1287.36 (   0.00%)     1505.62 *  16.95%*
> > Hmean     8       2564.60 (   0.00%)     2955.29 *  15.23%*
> > Hmean     16      5195.69 (   0.00%)     5814.74 *  11.91%*
> > Hmean     32      9769.16 (   0.00%)    10872.63 *  11.30%*
> > Hmean     64     15952.50 (   0.00%)    17281.98 *   8.33%*
> > Hmean     128    13113.77 (   0.00%)    13895.20 *   5.96%*
> > Hmean     256    10997.59 (   0.00%)    11244.69 *   2.25%*
> > Hmean     512    14623.60 (   0.00%)    15526.25 *   6.17%*
> >
> > This will also help to improve the MySQL. With MySQL server
> > running on numa 0 and client running on numa 1, both QPS and
> > latency is imporved on read-write case:
> >                         5.17-rc1        patched
> > QPS-16threads        143333.2633    145077.4033(+1.22%)
> > QPS-24threads        195085.9367    202719.6133(+3.91%)
> > QPS-32threads        241165.6867      249020.74(+3.26%)
> > QPS-64threads        244586.8433    253387.7567(+3.60%)
> > avg-lat-16threads           2.23           2.19(+1.19%)
> > avg-lat-24threads           2.46           2.36(+3.79%)
> > avg-lat-36threads           2.66           2.57(+3.26%)
> > avg-lat-64threads           5.23           5.05(+3.44%)
> >
> > Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> > ---
> >  kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 42 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 5146163bfabb..2f84a933aedd 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6262,12 +6262,46 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
> >
> >  #endif /* CONFIG_SCHED_SMT */
> >
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +/*
> > + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
> > + */
> > +static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target)
> > +{
> > +     struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> > +     struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
> > +     int cpu, idle_cpu;
> > +
> > +     /* TODO: Support SMT case while a machine with both cluster and SMT born */
> > +     if (!sched_smt_active() && sd) {
> > +             for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> > +                     idle_cpu = __select_idle_cpu(cpu, p);
> > +                     if ((unsigned int)idle_cpu < nr_cpumask_bits)
> > +                             return idle_cpu;
> > +             }
> > +
> > +             /* Don't ping-pong tasks in and out cluster frequently */
> > +             if (cpus_share_resources(target, prev_cpu))
> > +                     return target;
>
> We reach here when there aren't any idle CPUs within the
> cluster. However there might be idle CPUs in the MC domain. Is a busy
> @target preferable to a potentially idle CPU within the larger domain
> ?

Hi Gautham,

My benchmark showed some performance regression while load was medium or above
if we grabbed idle cpu in and out the cluster. it turned out the
regression disappeared if
we blocked the ping-pong. so the logic here is that if we have scanned
and found an
idle cpu within the cluster before, we don't let the task jumping back
and forth frequently
as cache synchronization is higher cost. but the code still allows
scanning out of the cluster
if we haven't packed waker and wakee together yet.

it might not be a universal win in all kinds of workload. we saw
tbench, mysql benefit from
the whole change. but pgbench seems not always. so we are still on the
way to make possible
further tuning here.

>
>
> --
> Thanks and Regards
> gautham.

Thanks
Barry

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-01-27 20:21     ` Barry Song
@ 2022-01-28  7:13       ` Srikar Dronamraju
  2022-01-27 18:40         ` Barry Song
  0 siblings, 1 reply; 27+ messages in thread
From: Srikar Dronamraju @ 2022-01-28  7:13 UTC (permalink / raw)
  To: Barry Song
  Cc: Gautham R. Shenoy, Yicong Yang, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Tim Chen, LKML, LAK,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, prime.zeng, Jonathan Cameron, ego,
	Linuxarm, Barry Song, Guodong Xu

* Barry Song <21cnbao@gmail.com> [2022-01-28 09:21:08]:

> On Fri, Jan 28, 2022 at 4:41 AM Gautham R. Shenoy
> <gautham.shenoy@amd.com> wrote:
> >
> > On Wed, Jan 26, 2022 at 04:09:47PM +0800, Yicong Yang wrote:
> > > From: Barry Song <song.bao.hua@hisilicon.com>
> > >
> > > For platforms having clusters like Kunpeng920, CPUs within the same
> > > cluster have lower latency when synchronizing and accessing shared
> > > resources like cache. Thus, this patch tries to find an idle cpu
> > > within the cluster of the target CPU before scanning the whole LLC
> > > to gain lower latency.
> > >
> > > Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this
> > > patch doesn't consider SMT for this moment.
> > >
> > > Testing has been done on Kunpeng920 by pinning tasks to one numa
> > > and two numa. On Kunpeng920, Each numa has 8 clusters and each
> > > cluster has 4 CPUs.
> > >
> > > With this patch, We noticed enhancement on tbench within one
> > > numa or cross two numa.
> > >
> > > On numa 0:
> > >                             5.17-rc1                patched
> > > Hmean     1        324.73 (   0.00%)      378.01 *  16.41%*
> > > Hmean     2        645.36 (   0.00%)      754.63 *  16.93%*
> > > Hmean     4       1302.09 (   0.00%)     1507.54 *  15.78%*
> > > Hmean     8       2612.03 (   0.00%)     2982.57 *  14.19%*
> > > Hmean     16      5307.12 (   0.00%)     5886.66 *  10.92%*
> > > Hmean     32      9354.22 (   0.00%)     9908.13 *   5.92%*
> > > Hmean     64      7240.35 (   0.00%)     7278.78 *   0.53%*
> > > Hmean     128     6186.40 (   0.00%)     6187.85 (   0.02%)
> > >
> > > On numa 0-1:
> > >                             5.17-rc1                patched
> > > Hmean     1        320.01 (   0.00%)      378.44 *  18.26%*
> > > Hmean     2        643.85 (   0.00%)      752.52 *  16.88%*
> > > Hmean     4       1287.36 (   0.00%)     1505.62 *  16.95%*
> > > Hmean     8       2564.60 (   0.00%)     2955.29 *  15.23%*
> > > Hmean     16      5195.69 (   0.00%)     5814.74 *  11.91%*
> > > Hmean     32      9769.16 (   0.00%)    10872.63 *  11.30%*
> > > Hmean     64     15952.50 (   0.00%)    17281.98 *   8.33%*
> > > Hmean     128    13113.77 (   0.00%)    13895.20 *   5.96%*
> > > Hmean     256    10997.59 (   0.00%)    11244.69 *   2.25%*
> > > Hmean     512    14623.60 (   0.00%)    15526.25 *   6.17%*
> > >
> > > This will also help to improve the MySQL. With MySQL server
> > > running on numa 0 and client running on numa 1, both QPS and
> > > latency is imporved on read-write case:
> > >                         5.17-rc1        patched
> > > QPS-16threads        143333.2633    145077.4033(+1.22%)
> > > QPS-24threads        195085.9367    202719.6133(+3.91%)
> > > QPS-32threads        241165.6867      249020.74(+3.26%)
> > > QPS-64threads        244586.8433    253387.7567(+3.60%)
> > > avg-lat-16threads           2.23           2.19(+1.19%)
> > > avg-lat-24threads           2.46           2.36(+3.79%)
> > > avg-lat-36threads           2.66           2.57(+3.26%)
> > > avg-lat-64threads           5.23           5.05(+3.44%)
> > >
> > > Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> > > ---
> > >  kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++----
> > >  1 file changed, 42 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 5146163bfabb..2f84a933aedd 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6262,12 +6262,46 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
> > >
> > >  #endif /* CONFIG_SCHED_SMT */
> > >
> > > +#ifdef CONFIG_SCHED_CLUSTER
> > > +/*
> > > + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
> > > + */
> > > +static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target)
> > > +{
> > > +     struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> > > +     struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
> > > +     int cpu, idle_cpu;
> > > +
> > > +     /* TODO: Support SMT case while a machine with both cluster and SMT born */
> > > +     if (!sched_smt_active() && sd) {
> > > +             for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> > > +                     idle_cpu = __select_idle_cpu(cpu, p);
> > > +                     if ((unsigned int)idle_cpu < nr_cpumask_bits)
> > > +                             return idle_cpu;
> > > +             }
> > > +
> > > +             /* Don't ping-pong tasks in and out cluster frequently */
> > > +             if (cpus_share_resources(target, prev_cpu))
> > > +                     return target;
> >
> > We reach here when there aren't any idle CPUs within the
> > cluster. However there might be idle CPUs in the MC domain. Is a busy
> > @target preferable to a potentially idle CPU within the larger domain
> > ?
> 
> Hi Gautham,
> 

Hi Barry,


> My benchmark showed some performance regression while load was medium or above
> if we grabbed idle cpu in and out the cluster. it turned out the
> regression disappeared if
> we blocked the ping-pong. so the logic here is that if we have scanned
> and found an
> idle cpu within the cluster before, we don't let the task jumping back
> and forth frequently
> as cache synchronization is higher cost. but the code still allows
> scanning out of the cluster
> if we haven't packed waker and wakee together yet.
> 

Like what Gautham said, should we choose the same cluster if we find that
there are no idle-cpus in the LLC? This way we avoid ping-pong if there are
no idle-cpus but we still pick an idle-cpu to a busy cpu?

> it might not be a universal win in all kinds of workload. we saw
> tbench, mysql benefit from
> the whole change. but pgbench seems not always. so we are still on the
> way to make possible
> further tuning here.
> 

> >
> >
> > --
> > Thanks and Regards
> > gautham.
> 
> Thanks
> Barry

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-01-27 18:40         ` Barry Song
@ 2022-02-01  9:38           ` Srikar Dronamraju
  2022-02-01 20:20             ` Barry Song
  0 siblings, 1 reply; 27+ messages in thread
From: Srikar Dronamraju @ 2022-02-01  9:38 UTC (permalink / raw)
  To: Barry Song
  Cc: Gautham R. Shenoy, Yicong Yang, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Tim Chen, LKML, LAK,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, prime.zeng, Jonathan Cameron, ego,
	Linuxarm, Barry Song, Guodong Xu

* Barry Song <21cnbao@gmail.com> [2022-01-28 07:40:15]:

> On Fri, Jan 28, 2022 at 8:13 PM Srikar Dronamraju
> <srikar@linux.vnet.ibm.com> wrote:
> >
> > * Barry Song <21cnbao@gmail.com> [2022-01-28 09:21:08]:
> >
> > > On Fri, Jan 28, 2022 at 4:41 AM Gautham R. Shenoy
> > > <gautham.shenoy@amd.com> wrote:
> > > >
> > > > On Wed, Jan 26, 2022 at 04:09:47PM +0800, Yicong Yang wrote:
> > > > > From: Barry Song <song.bao.hua@hisilicon.com>
> > > > >
> > > > > For platforms having clusters like Kunpeng920, CPUs within the same
> > > > > cluster have lower latency when synchronizing and accessing shared
> > > > > resources like cache. Thus, this patch tries to find an idle cpu
> > > > > within the cluster of the target CPU before scanning the whole LLC
> > > > > to gain lower latency.
> > > > >
> > > > > Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this
> > > > > patch doesn't consider SMT for this moment.
> > > > >
> > > > > Testing has been done on Kunpeng920 by pinning tasks to one numa
> > > > > and two numa. On Kunpeng920, Each numa has 8 clusters and each
> > > > > cluster has 4 CPUs.
> > > > >
> > > > > With this patch, We noticed enhancement on tbench within one
> > > > > numa or cross two numa.
> > > > >
> > > > > On numa 0:
> > > > >                             5.17-rc1                patched
> > > > > Hmean     1        324.73 (   0.00%)      378.01 *  16.41%*
> > > > > Hmean     2        645.36 (   0.00%)      754.63 *  16.93%*
> > > > > Hmean     4       1302.09 (   0.00%)     1507.54 *  15.78%*
> > > > > Hmean     8       2612.03 (   0.00%)     2982.57 *  14.19%*
> > > > > Hmean     16      5307.12 (   0.00%)     5886.66 *  10.92%*
> > > > > Hmean     32      9354.22 (   0.00%)     9908.13 *   5.92%*
> > > > > Hmean     64      7240.35 (   0.00%)     7278.78 *   0.53%*
> > > > > Hmean     128     6186.40 (   0.00%)     6187.85 (   0.02%)
> > > > >
> > > > > On numa 0-1:
> > > > >                             5.17-rc1                patched
> > > > > Hmean     1        320.01 (   0.00%)      378.44 *  18.26%*
> > > > > Hmean     2        643.85 (   0.00%)      752.52 *  16.88%*
> > > > > Hmean     4       1287.36 (   0.00%)     1505.62 *  16.95%*
> > > > > Hmean     8       2564.60 (   0.00%)     2955.29 *  15.23%*
> > > > > Hmean     16      5195.69 (   0.00%)     5814.74 *  11.91%*
> > > > > Hmean     32      9769.16 (   0.00%)    10872.63 *  11.30%*
> > > > > Hmean     64     15952.50 (   0.00%)    17281.98 *   8.33%*
> > > > > Hmean     128    13113.77 (   0.00%)    13895.20 *   5.96%*
> > > > > Hmean     256    10997.59 (   0.00%)    11244.69 *   2.25%*
> > > > > Hmean     512    14623.60 (   0.00%)    15526.25 *   6.17%*
> > > > >
> > > > > This will also help to improve the MySQL. With MySQL server
> > > > > running on numa 0 and client running on numa 1, both QPS and
> > > > > latency is imporved on read-write case:
> > > > >                         5.17-rc1        patched
> > > > > QPS-16threads        143333.2633    145077.4033(+1.22%)
> > > > > QPS-24threads        195085.9367    202719.6133(+3.91%)
> > > > > QPS-32threads        241165.6867      249020.74(+3.26%)
> > > > > QPS-64threads        244586.8433    253387.7567(+3.60%)
> > > > > avg-lat-16threads           2.23           2.19(+1.19%)
> > > > > avg-lat-24threads           2.46           2.36(+3.79%)
> > > > > avg-lat-36threads           2.66           2.57(+3.26%)
> > > > > avg-lat-64threads           5.23           5.05(+3.44%)
> > > > >
> > > > > Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> > > > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > > > > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> > > > > ---
> > > > >  kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++----
> > > > >  1 file changed, 42 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index 5146163bfabb..2f84a933aedd 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -6262,12 +6262,46 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
> > > > >
> > > > >  #endif /* CONFIG_SCHED_SMT */
> > > > >
> > > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > > +/*
> > > > > + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
> > > > > + */
> > > > > +static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target)
> > > > > +{
> > > > > +     struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> > > > > +     struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
> > > > > +     int cpu, idle_cpu;
> > > > > +
> > > > > +     /* TODO: Support SMT case while a machine with both cluster and SMT born */
> > > > > +     if (!sched_smt_active() && sd) {
> > > > > +             for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> > > > > +                     idle_cpu = __select_idle_cpu(cpu, p);
> > > > > +                     if ((unsigned int)idle_cpu < nr_cpumask_bits)
> > > > > +                             return idle_cpu;
> > > > > +             }
> > > > > +
> > > > > +             /* Don't ping-pong tasks in and out cluster frequently */
> > > > > +             if (cpus_share_resources(target, prev_cpu))
> > > > > +                     return target;
> > > >
> > > > We reach here when there aren't any idle CPUs within the
> > > > cluster. However there might be idle CPUs in the MC domain. Is a busy
> > > > @target preferable to a potentially idle CPU within the larger domain
> > > > ?
> > >
> > > Hi Gautham,
> > >
> >
> > Hi Barry,
> >
> >
> > > My benchmark showed some performance regression while load was medium or above
> > > if we grabbed idle cpu in and out the cluster. it turned out the
> > > regression disappeared if
> > > we blocked the ping-pong. so the logic here is that if we have scanned
> > > and found an
> > > idle cpu within the cluster before, we don't let the task jumping back
> > > and forth frequently
> > > as cache synchronization is higher cost. but the code still allows
> > > scanning out of the cluster
> > > if we haven't packed waker and wakee together yet.
> > >
> >
> > Like what Gautham said, should we choose the same cluster if we find that
> > there are no idle-cpus in the LLC? This way we avoid ping-pong if there are
> > no idle-cpus but we still pick an idle-cpu to a busy cpu?
> 
> Hi Srikar,
> I am sorry I didn't get your question. Currently the code works as below:
> if task A wakes up task B, and task A is in LLC0 and task B is in LLC1.
> we will scan the cluster of A before scanning the whole LLC0, in this case,
> cluster of A is the closest sibling, so it is the better choice than other CPUs
> which are in LLC0 but not in the cluster of A. 

Yes, this is right.

> But we do scan all cpus of LLC0
> afterwards if we fail to find an idle CPU in the cluster.

However my reading of the patch, before we can scan other clusters within
the LLC (aka LLC0), we have a check in scan cluster which says 

	/* Don't ping-pong tasks in and out cluster frequently */
	if (cpus_share_resources(target, prev_cpu))
	   return target;

My reading of this is, ignore other clusters (at this point, we know there
are no idle CPUs in this cluster. We don't know if there are idle cpus in
them or not) if the previous CPU and target CPU happen to be from the same
cluster. This effectively means we are given preference to cache over idle
CPU.

Or Am I still missing something?

> 
> After a while, if the cluster of A gets an idle CPU and pulls B into the
> cluster, we prefer not pushing B out of the cluster of A again though
> there might be an idle CPU outside. as benchmark shows getting an
> idle CPU out of the cluster of A doesn't bring performance improvement
> but performance decreases as B might be getting in and getting out
> the cluster of A very frequently, then cache coherence ping-pong.
> 

The counter argument can be that Task A and Task B are related and were
running on the same cluster. But Load balancer moved Task B to a different
cluster. Now this check may cause them to continue to run on two different
clusters, even though the underlying load balance issues may have changed.

No?


-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-02-01  9:38           ` Srikar Dronamraju
@ 2022-02-01 20:20             ` Barry Song
  2022-02-04  7:33               ` Srikar Dronamraju
  0 siblings, 1 reply; 27+ messages in thread
From: Barry Song @ 2022-02-01 20:20 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Gautham R. Shenoy, Yicong Yang, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Tim Chen, LKML, LAK,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, prime.zeng, Jonathan Cameron, ego,
	Linuxarm, Barry Song, Guodong Xu

On Tue, Feb 1, 2022 at 10:39 PM Srikar Dronamraju
<srikar@linux.vnet.ibm.com> wrote:
>
> * Barry Song <21cnbao@gmail.com> [2022-01-28 07:40:15]:
>
> > On Fri, Jan 28, 2022 at 8:13 PM Srikar Dronamraju
> > <srikar@linux.vnet.ibm.com> wrote:
> > >
> > > * Barry Song <21cnbao@gmail.com> [2022-01-28 09:21:08]:
> > >
> > > > On Fri, Jan 28, 2022 at 4:41 AM Gautham R. Shenoy
> > > > <gautham.shenoy@amd.com> wrote:
> > > > >
> > > > > On Wed, Jan 26, 2022 at 04:09:47PM +0800, Yicong Yang wrote:
> > > > > > From: Barry Song <song.bao.hua@hisilicon.com>
> > > > > >
> > > > > > For platforms having clusters like Kunpeng920, CPUs within the same
> > > > > > cluster have lower latency when synchronizing and accessing shared
> > > > > > resources like cache. Thus, this patch tries to find an idle cpu
> > > > > > within the cluster of the target CPU before scanning the whole LLC
> > > > > > to gain lower latency.
> > > > > >
> > > > > > Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this
> > > > > > patch doesn't consider SMT for this moment.
> > > > > >
> > > > > > Testing has been done on Kunpeng920 by pinning tasks to one numa
> > > > > > and two numa. On Kunpeng920, Each numa has 8 clusters and each
> > > > > > cluster has 4 CPUs.
> > > > > >
> > > > > > With this patch, We noticed enhancement on tbench within one
> > > > > > numa or cross two numa.
> > > > > >
> > > > > > On numa 0:
> > > > > >                             5.17-rc1                patched
> > > > > > Hmean     1        324.73 (   0.00%)      378.01 *  16.41%*
> > > > > > Hmean     2        645.36 (   0.00%)      754.63 *  16.93%*
> > > > > > Hmean     4       1302.09 (   0.00%)     1507.54 *  15.78%*
> > > > > > Hmean     8       2612.03 (   0.00%)     2982.57 *  14.19%*
> > > > > > Hmean     16      5307.12 (   0.00%)     5886.66 *  10.92%*
> > > > > > Hmean     32      9354.22 (   0.00%)     9908.13 *   5.92%*
> > > > > > Hmean     64      7240.35 (   0.00%)     7278.78 *   0.53%*
> > > > > > Hmean     128     6186.40 (   0.00%)     6187.85 (   0.02%)
> > > > > >
> > > > > > On numa 0-1:
> > > > > >                             5.17-rc1                patched
> > > > > > Hmean     1        320.01 (   0.00%)      378.44 *  18.26%*
> > > > > > Hmean     2        643.85 (   0.00%)      752.52 *  16.88%*
> > > > > > Hmean     4       1287.36 (   0.00%)     1505.62 *  16.95%*
> > > > > > Hmean     8       2564.60 (   0.00%)     2955.29 *  15.23%*
> > > > > > Hmean     16      5195.69 (   0.00%)     5814.74 *  11.91%*
> > > > > > Hmean     32      9769.16 (   0.00%)    10872.63 *  11.30%*
> > > > > > Hmean     64     15952.50 (   0.00%)    17281.98 *   8.33%*
> > > > > > Hmean     128    13113.77 (   0.00%)    13895.20 *   5.96%*
> > > > > > Hmean     256    10997.59 (   0.00%)    11244.69 *   2.25%*
> > > > > > Hmean     512    14623.60 (   0.00%)    15526.25 *   6.17%*
> > > > > >
> > > > > > This will also help to improve the MySQL. With MySQL server
> > > > > > running on numa 0 and client running on numa 1, both QPS and
> > > > > > latency is imporved on read-write case:
> > > > > >                         5.17-rc1        patched
> > > > > > QPS-16threads        143333.2633    145077.4033(+1.22%)
> > > > > > QPS-24threads        195085.9367    202719.6133(+3.91%)
> > > > > > QPS-32threads        241165.6867      249020.74(+3.26%)
> > > > > > QPS-64threads        244586.8433    253387.7567(+3.60%)
> > > > > > avg-lat-16threads           2.23           2.19(+1.19%)
> > > > > > avg-lat-24threads           2.46           2.36(+3.79%)
> > > > > > avg-lat-36threads           2.66           2.57(+3.26%)
> > > > > > avg-lat-64threads           5.23           5.05(+3.44%)
> > > > > >
> > > > > > Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> > > > > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> > > > > > ---
> > > > > >  kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++----
> > > > > >  1 file changed, 42 insertions(+), 4 deletions(-)
> > > > > >
> > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > > index 5146163bfabb..2f84a933aedd 100644
> > > > > > --- a/kernel/sched/fair.c
> > > > > > +++ b/kernel/sched/fair.c
> > > > > > @@ -6262,12 +6262,46 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
> > > > > >
> > > > > >  #endif /* CONFIG_SCHED_SMT */
> > > > > >
> > > > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > > > +/*
> > > > > > + * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
> > > > > > + */
> > > > > > +static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target)
> > > > > > +{
> > > > > > +     struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> > > > > > +     struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
> > > > > > +     int cpu, idle_cpu;
> > > > > > +
> > > > > > +     /* TODO: Support SMT case while a machine with both cluster and SMT born */
> > > > > > +     if (!sched_smt_active() && sd) {
> > > > > > +             for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> > > > > > +                     idle_cpu = __select_idle_cpu(cpu, p);
> > > > > > +                     if ((unsigned int)idle_cpu < nr_cpumask_bits)
> > > > > > +                             return idle_cpu;
> > > > > > +             }
> > > > > > +
> > > > > > +             /* Don't ping-pong tasks in and out cluster frequently */
> > > > > > +             if (cpus_share_resources(target, prev_cpu))
> > > > > > +                     return target;
> > > > >
> > > > > We reach here when there aren't any idle CPUs within the
> > > > > cluster. However there might be idle CPUs in the MC domain. Is a busy
> > > > > @target preferable to a potentially idle CPU within the larger domain
> > > > > ?
> > > >
> > > > Hi Gautham,
> > > >
> > >
> > > Hi Barry,
> > >
> > >
> > > > My benchmark showed some performance regression while load was medium or above
> > > > if we grabbed idle cpu in and out the cluster. it turned out the
> > > > regression disappeared if
> > > > we blocked the ping-pong. so the logic here is that if we have scanned
> > > > and found an
> > > > idle cpu within the cluster before, we don't let the task jumping back
> > > > and forth frequently
> > > > as cache synchronization is higher cost. but the code still allows
> > > > scanning out of the cluster
> > > > if we haven't packed waker and wakee together yet.
> > > >
> > >
> > > Like what Gautham said, should we choose the same cluster if we find that
> > > there are no idle-cpus in the LLC? This way we avoid ping-pong if there are
> > > no idle-cpus but we still pick an idle-cpu to a busy cpu?
> >
> > Hi Srikar,
> > I am sorry I didn't get your question. Currently the code works as below:
> > if task A wakes up task B, and task A is in LLC0 and task B is in LLC1.
> > we will scan the cluster of A before scanning the whole LLC0, in this case,
> > cluster of A is the closest sibling, so it is the better choice than other CPUs
> > which are in LLC0 but not in the cluster of A.
>
> Yes, this is right.
>
> > But we do scan all cpus of LLC0
> > afterwards if we fail to find an idle CPU in the cluster.
>
> However my reading of the patch, before we can scan other clusters within
> the LLC (aka LLC0), we have a check in scan cluster which says
>
>         /* Don't ping-pong tasks in and out cluster frequently */
>         if (cpus_share_resources(target, prev_cpu))
>            return target;
>
> My reading of this is, ignore other clusters (at this point, we know there
> are no idle CPUs in this cluster. We don't know if there are idle cpus in
> them or not) if the previous CPU and target CPU happen to be from the same
> cluster. This effectively means we are given preference to cache over idle
> CPU.

Note we only ignore other cluster while prev_cpu and target are in same
cluster. if the condition is false, we are not ignoring other cpus. typically,
if waker is the target, and wakee is the prev_cpu, that means if they are
already in one cluster, we don't stupidly spread them in select_idle_cpu() path
as benchmark shows we are losing. so, yes, we are giving preference to
cache over CPU.

>
> Or Am I still missing something?
>
> >
> > After a while, if the cluster of A gets an idle CPU and pulls B into the
> > cluster, we prefer not pushing B out of the cluster of A again though
> > there might be an idle CPU outside. as benchmark shows getting an
> > idle CPU out of the cluster of A doesn't bring performance improvement
> > but performance decreases as B might be getting in and getting out
> > the cluster of A very frequently, then cache coherence ping-pong.
> >
>
> The counter argument can be that Task A and Task B are related and were
> running on the same cluster. But Load balancer moved Task B to a different
> cluster. Now this check may cause them to continue to run on two different
> clusters, even though the underlying load balance issues may have changed.
>
> No?

LB is much slower than select_idle_cpu().  select_idle_cpu() can dynamically
work afterwards. so it is always a dynamic balance and task migration.

>
>
> --
> Thanks and Regards
> Srikar Dronamraju

Thanks
Barry

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-02-01 20:20             ` Barry Song
@ 2022-02-04  7:33               ` Srikar Dronamraju
  2022-02-04 10:28                 ` Barry Song
  0 siblings, 1 reply; 27+ messages in thread
From: Srikar Dronamraju @ 2022-02-04  7:33 UTC (permalink / raw)
  To: Barry Song
  Cc: Gautham R. Shenoy, Yicong Yang, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Tim Chen, LKML, LAK,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, prime.zeng, Jonathan Cameron, ego,
	Linuxarm, Barry Song, Guodong Xu

* Barry Song <21cnbao@gmail.com> [2022-02-02 09:20:32]:

> On Tue, Feb 1, 2022 at 10:39 PM Srikar Dronamraju
> <srikar@linux.vnet.ibm.com> wrote:
> >
> > * Barry Song <21cnbao@gmail.com> [2022-01-28 07:40:15]:
> >
> > > On Fri, Jan 28, 2022 at 8:13 PM Srikar Dronamraju
> > > <srikar@linux.vnet.ibm.com> wrote:
> > > >
> > > > * Barry Song <21cnbao@gmail.com> [2022-01-28 09:21:08]:
> > > >
> > > > > On Fri, Jan 28, 2022 at 4:41 AM Gautham R. Shenoy
> > > > > <gautham.shenoy@amd.com> wrote:
> > > > > >
> > > > > > On Wed, Jan 26, 2022 at 04:09:47PM +0800, Yicong Yang wrote:
> > > > > > > From: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > >
> > > I am sorry I didn't get your question. Currently the code works as below:
> > > if task A wakes up task B, and task A is in LLC0 and task B is in LLC1.
> > > we will scan the cluster of A before scanning the whole LLC0, in this case,
> > > cluster of A is the closest sibling, so it is the better choice than other CPUs
> > > which are in LLC0 but not in the cluster of A.
> >
> > Yes, this is right.
> >
> > > But we do scan all cpus of LLC0
> > > afterwards if we fail to find an idle CPU in the cluster.
> >
> > However my reading of the patch, before we can scan other clusters within
> > the LLC (aka LLC0), we have a check in scan cluster which says
> >
> >         /* Don't ping-pong tasks in and out cluster frequently */
> >         if (cpus_share_resources(target, prev_cpu))
> >            return target;
> >
> > My reading of this is, ignore other clusters (at this point, we know there
> > are no idle CPUs in this cluster. We don't know if there are idle cpus in
> > them or not) if the previous CPU and target CPU happen to be from the same
> > cluster. This effectively means we are given preference to cache over idle
> > CPU.
> 
> Note we only ignore other cluster while prev_cpu and target are in same
> cluster. if the condition is false, we are not ignoring other cpus. typically,
> if waker is the target, and wakee is the prev_cpu, that means if they are
> already in one cluster, we don't stupidly spread them in select_idle_cpu() path
> as benchmark shows we are losing. so, yes, we are giving preference to
> cache over CPU.

We already figured out that there are no idle CPUs in this cluster. So dont
we gain performance by picking a idle CPU/core in the neighbouring cluster.
If there are no idle CPU/core in the neighbouring cluster, then it does make
sense to fallback on the current cluster.

> 
> >
> > Or Am I still missing something?
> >
> > >
> > > After a while, if the cluster of A gets an idle CPU and pulls B into the
> > > cluster, we prefer not pushing B out of the cluster of A again though
> > > there might be an idle CPU outside. as benchmark shows getting an
> > > idle CPU out of the cluster of A doesn't bring performance improvement
> > > but performance decreases as B might be getting in and getting out
> > > the cluster of A very frequently, then cache coherence ping-pong.
> > >
> >
> > The counter argument can be that Task A and Task B are related and were
> > running on the same cluster. But Load balancer moved Task B to a different
> > cluster. Now this check may cause them to continue to run on two different
> > clusters, even though the underlying load balance issues may have changed.
> >
> > No?
> 
> LB is much slower than select_idle_cpu().  select_idle_cpu() can dynamically
> work afterwards. so it is always a dynamic balance and task migration.
> 
> >
> >
> > --
> > Thanks and Regards
> > Srikar Dronamraju
> 
> Thanks
> Barry

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-02-04  7:33               ` Srikar Dronamraju
@ 2022-02-04 10:28                 ` Barry Song
  2022-02-04 10:49                   ` Barry Song
  2022-02-07 15:14                   ` Gautham R. Shenoy
  0 siblings, 2 replies; 27+ messages in thread
From: Barry Song @ 2022-02-04 10:28 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Gautham R. Shenoy, Yicong Yang, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Tim Chen, LKML, LAK,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, prime.zeng, Jonathan Cameron, ego,
	Linuxarm, Barry Song, Guodong Xu

On Fri, Feb 4, 2022 at 8:33 PM Srikar Dronamraju
<srikar@linux.vnet.ibm.com> wrote:
>
> * Barry Song <21cnbao@gmail.com> [2022-02-02 09:20:32]:
>
> > On Tue, Feb 1, 2022 at 10:39 PM Srikar Dronamraju
> > <srikar@linux.vnet.ibm.com> wrote:
> > >
> > > * Barry Song <21cnbao@gmail.com> [2022-01-28 07:40:15]:
> > >
> > > > On Fri, Jan 28, 2022 at 8:13 PM Srikar Dronamraju
> > > > <srikar@linux.vnet.ibm.com> wrote:
> > > > >
> > > > > * Barry Song <21cnbao@gmail.com> [2022-01-28 09:21:08]:
> > > > >
> > > > > > On Fri, Jan 28, 2022 at 4:41 AM Gautham R. Shenoy
> > > > > > <gautham.shenoy@amd.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jan 26, 2022 at 04:09:47PM +0800, Yicong Yang wrote:
> > > > > > > > From: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > > >
> > > > I am sorry I didn't get your question. Currently the code works as below:
> > > > if task A wakes up task B, and task A is in LLC0 and task B is in LLC1.
> > > > we will scan the cluster of A before scanning the whole LLC0, in this case,
> > > > cluster of A is the closest sibling, so it is the better choice than other CPUs
> > > > which are in LLC0 but not in the cluster of A.
> > >
> > > Yes, this is right.
> > >
> > > > But we do scan all cpus of LLC0
> > > > afterwards if we fail to find an idle CPU in the cluster.
> > >
> > > However my reading of the patch, before we can scan other clusters within
> > > the LLC (aka LLC0), we have a check in scan cluster which says
> > >
> > >         /* Don't ping-pong tasks in and out cluster frequently */
> > >         if (cpus_share_resources(target, prev_cpu))
> > >            return target;
> > >
> > > My reading of this is, ignore other clusters (at this point, we know there
> > > are no idle CPUs in this cluster. We don't know if there are idle cpus in
> > > them or not) if the previous CPU and target CPU happen to be from the same
> > > cluster. This effectively means we are given preference to cache over idle
> > > CPU.
> >
> > Note we only ignore other cluster while prev_cpu and target are in same
> > cluster. if the condition is false, we are not ignoring other cpus. typically,
> > if waker is the target, and wakee is the prev_cpu, that means if they are
> > already in one cluster, we don't stupidly spread them in select_idle_cpu() path
> > as benchmark shows we are losing. so, yes, we are giving preference to
> > cache over CPU.
>
> We already figured out that there are no idle CPUs in this cluster. So dont
> we gain performance by picking a idle CPU/core in the neighbouring cluster.
> If there are no idle CPU/core in the neighbouring cluster, then it does make
> sense to fallback on the current cluster.

What you suggested is exactly the approach we have tried at the first beginning
during debugging. but we didn't gain performance according to benchmark, we
were actually losing. that is why we added this line to stop ping-pong:
         /* Don't ping-pong tasks in and out cluster frequently */
         if (cpus_share_resources(target, prev_cpu))
            return target;

If we delete this, we are seeing a big loss of tbench while system
load is medium
and above.

>
> >
> > >
> > > Or Am I still missing something?
> > >
> > > >
> > > > After a while, if the cluster of A gets an idle CPU and pulls B into the
> > > > cluster, we prefer not pushing B out of the cluster of A again though
> > > > there might be an idle CPU outside. as benchmark shows getting an
> > > > idle CPU out of the cluster of A doesn't bring performance improvement
> > > > but performance decreases as B might be getting in and getting out
> > > > the cluster of A very frequently, then cache coherence ping-pong.
> > > >
> > >
> > > The counter argument can be that Task A and Task B are related and were
> > > running on the same cluster. But Load balancer moved Task B to a different
> > > cluster. Now this check may cause them to continue to run on two different
> > > clusters, even though the underlying load balance issues may have changed.
> > >
> > > No?
> >
> > LB is much slower than select_idle_cpu().  select_idle_cpu() can dynamically
> > work afterwards. so it is always a dynamic balance and task migration.
> >
> > >
> > >
> > > --
> > > Thanks and Regards
> > > Srikar Dronamraju
> >
> > Thanks
> > Barry
>
> --

Thanks
Barry

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-02-04 10:28                 ` Barry Song
@ 2022-02-04 10:49                   ` Barry Song
  2022-02-04 17:41                     ` Tim Chen
  2022-02-07 15:14                   ` Gautham R. Shenoy
  1 sibling, 1 reply; 27+ messages in thread
From: Barry Song @ 2022-02-04 10:49 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Gautham R. Shenoy, Yicong Yang, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Tim Chen, LKML, LAK,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, prime.zeng, Jonathan Cameron, ego,
	Linuxarm, Barry Song, Guodong Xu

On Fri, Feb 4, 2022 at 11:28 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Feb 4, 2022 at 8:33 PM Srikar Dronamraju
> <srikar@linux.vnet.ibm.com> wrote:
> >
> > * Barry Song <21cnbao@gmail.com> [2022-02-02 09:20:32]:
> >
> > > On Tue, Feb 1, 2022 at 10:39 PM Srikar Dronamraju
> > > <srikar@linux.vnet.ibm.com> wrote:
> > > >
> > > > * Barry Song <21cnbao@gmail.com> [2022-01-28 07:40:15]:
> > > >
> > > > > On Fri, Jan 28, 2022 at 8:13 PM Srikar Dronamraju
> > > > > <srikar@linux.vnet.ibm.com> wrote:
> > > > > >
> > > > > > * Barry Song <21cnbao@gmail.com> [2022-01-28 09:21:08]:
> > > > > >
> > > > > > > On Fri, Jan 28, 2022 at 4:41 AM Gautham R. Shenoy
> > > > > > > <gautham.shenoy@amd.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jan 26, 2022 at 04:09:47PM +0800, Yicong Yang wrote:
> > > > > > > > > From: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > > > >
> > > > > I am sorry I didn't get your question. Currently the code works as below:
> > > > > if task A wakes up task B, and task A is in LLC0 and task B is in LLC1.
> > > > > we will scan the cluster of A before scanning the whole LLC0, in this case,
> > > > > cluster of A is the closest sibling, so it is the better choice than other CPUs
> > > > > which are in LLC0 but not in the cluster of A.
> > > >
> > > > Yes, this is right.
> > > >
> > > > > But we do scan all cpus of LLC0
> > > > > afterwards if we fail to find an idle CPU in the cluster.
> > > >
> > > > However my reading of the patch, before we can scan other clusters within
> > > > the LLC (aka LLC0), we have a check in scan cluster which says
> > > >
> > > >         /* Don't ping-pong tasks in and out cluster frequently */
> > > >         if (cpus_share_resources(target, prev_cpu))
> > > >            return target;
> > > >
> > > > My reading of this is, ignore other clusters (at this point, we know there
> > > > are no idle CPUs in this cluster. We don't know if there are idle cpus in
> > > > them or not) if the previous CPU and target CPU happen to be from the same
> > > > cluster. This effectively means we are given preference to cache over idle
> > > > CPU.
> > >
> > > Note we only ignore other cluster while prev_cpu and target are in same
> > > cluster. if the condition is false, we are not ignoring other cpus. typically,
> > > if waker is the target, and wakee is the prev_cpu, that means if they are
> > > already in one cluster, we don't stupidly spread them in select_idle_cpu() path
> > > as benchmark shows we are losing. so, yes, we are giving preference to
> > > cache over CPU.
> >
> > We already figured out that there are no idle CPUs in this cluster. So dont
> > we gain performance by picking a idle CPU/core in the neighbouring cluster.
> > If there are no idle CPU/core in the neighbouring cluster, then it does make
> > sense to fallback on the current cluster.
>
> What you suggested is exactly the approach we have tried at the first beginning
> during debugging. but we didn't gain performance according to benchmark, we
> were actually losing. that is why we added this line to stop ping-pong:
>          /* Don't ping-pong tasks in and out cluster frequently */
>          if (cpus_share_resources(target, prev_cpu))
>             return target;
>
> If we delete this, we are seeing a big loss of tbench while system
> load is medium
> and above.

While one system has a high load,  if we scan the neighbour clusters
after we fail to
get an idle cpu in the target. we can really successfully find an idle
cpu in neighbours
sometimes. There is no doubt of this. Our experiments have shown this is 100%
true. But the problem is that actually all cpus are very busy, so each
cpu gets a
very small idle time. After we migrate tasks to neighbours,  shortly
the neighbour
clusters will be full of tasks, then the neighbours might kick their
tasks out afterwards.
so we will see tasks move all around many clusters like monkeys, then we don't
gain performance.

>
> >
> > >
> > > >
> > > > Or Am I still missing something?
> > > >
> > > > >
> > > > > After a while, if the cluster of A gets an idle CPU and pulls B into the
> > > > > cluster, we prefer not pushing B out of the cluster of A again though
> > > > > there might be an idle CPU outside. as benchmark shows getting an
> > > > > idle CPU out of the cluster of A doesn't bring performance improvement
> > > > > but performance decreases as B might be getting in and getting out
> > > > > the cluster of A very frequently, then cache coherence ping-pong.
> > > > >
> > > >
> > > > The counter argument can be that Task A and Task B are related and were
> > > > running on the same cluster. But Load balancer moved Task B to a different
> > > > cluster. Now this check may cause them to continue to run on two different
> > > > clusters, even though the underlying load balance issues may have changed.
> > > >
> > > > No?
> > >
> > > LB is much slower than select_idle_cpu().  select_idle_cpu() can dynamically
> > > work afterwards. so it is always a dynamic balance and task migration.
> > >
> > > >
> > > >
> > > > --
> > > > Thanks and Regards
> > > > Srikar Dronamraju
> > >
> > > Thanks
> > > Barry
> >
> > --
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-02-04 10:49                   ` Barry Song
@ 2022-02-04 17:41                     ` Tim Chen
  2022-02-05 17:16                       ` Chen Yu
  0 siblings, 1 reply; 27+ messages in thread
From: Tim Chen @ 2022-02-04 17:41 UTC (permalink / raw)
  To: Barry Song, Srikar Dronamraju
  Cc: Gautham R. Shenoy, Yicong Yang, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, LKML, LAK, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	prime.zeng, Jonathan Cameron, ego, Linuxarm, Barry Song,
	Guodong Xu, yu.c.chen

On Fri, 2022-02-04 at 23:49 +1300, Barry Song wrote:
> On Fri, Feb 4, 2022 at 11:28 PM Barry Song <21cnbao@gmail.com> wrote:
> > On Fri, Feb 4, 2022 at 8:33 PM Srikar Dronamraju
> > <srikar@linux.vnet.ibm.com> wrote:
> > > * Barry Song <21cnbao@gmail.com> [2022-02-02 09:20:32]:
> > > 
> > > > On Tue, Feb 1, 2022 at 10:39 PM Srikar Dronamraju
> > > > <srikar@linux.vnet.ibm.com> wrote:
> > > > > * Barry Song <21cnbao@gmail.com> [2022-01-28 07:40:15]:
> > > > > 
> > > > > > On Fri, Jan 28, 2022 at 8:13 PM Srikar Dronamraju
> > > > > > <srikar@linux.vnet.ibm.com> wrote:
> > > > > > > * Barry Song <21cnbao@gmail.com> [2022-01-28 09:21:08]:
> > > > > > > 
> > > > > > > > On Fri, Jan 28, 2022 at 4:41 AM Gautham R. Shenoy
> > > > > > > > <gautham.shenoy@amd.com> wrote:
> > > > > > > > > On Wed, Jan 26, 2022 at 04:09:47PM +0800, Yicong Yang
> > > > > > > > > wrote:
> > > > > > > > > > From: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > > > > > 
> > > > > > I am sorry I didn't get your question. Currently the code
> > > > > > works as below:
> > > > > > if task A wakes up task B, and task A is in LLC0 and task B
> > > > > > is in LLC1.
> > > > > > we will scan the cluster of A before scanning the whole
> > > > > > LLC0, in this case,
> > > > > > cluster of A is the closest sibling, so it is the better
> > > > > > choice than other CPUs
> > > > > > which are in LLC0 but not in the cluster of A.
> > > > > 
> > > > > Yes, this is right.
> > > > > 
> > > > > > But we do scan all cpus of LLC0
> > > > > > afterwards if we fail to find an idle CPU in the cluster.
> > > > > 
> > > > > However my reading of the patch, before we can scan other
> > > > > clusters within
> > > > > the LLC (aka LLC0), we have a check in scan cluster which
> > > > > says
> > > > > 
> > > > >         /* Don't ping-pong tasks in and out cluster
> > > > > frequently */
> > > > >         if (cpus_share_resources(target, prev_cpu))
> > > > >            return target;
> > > > > 
> > > > > My reading of this is, ignore other clusters (at this point,
> > > > > we know there
> > > > > are no idle CPUs in this cluster. We don't know if there are
> > > > > idle cpus in
> > > > > them or not) if the previous CPU and target CPU happen to be
> > > > > from the same
> > > > > cluster. This effectively means we are given preference to
> > > > > cache over idle
> > > > > CPU.
> > > > 
> > > > Note we only ignore other cluster while prev_cpu and target are
> > > > in same
> > > > cluster. if the condition is false, we are not ignoring other
> > > > cpus. typically,
> > > > if waker is the target, and wakee is the prev_cpu, that means
> > > > if they are
> > > > already in one cluster, we don't stupidly spread them in
> > > > select_idle_cpu() path
> > > > as benchmark shows we are losing. so, yes, we are giving
> > > > preference to
> > > > cache over CPU.
> > > 
> > > We already figured out that there are no idle CPUs in this
> > > cluster. So dont
> > > we gain performance by picking a idle CPU/core in the
> > > neighbouring cluster.
> > > If there are no idle CPU/core in the neighbouring cluster, then
> > > it does make
> > > sense to fallback on the current cluster.
> > 
> > 

We may need to take into consideration the utilization and
load average for the source and target cluster to make
better decision of whether it is worth placing the
task in the next cluster.  If the load of the target
cluster is too high, it is not worth pushing the task there.

Those stats can be gathered during load balancing without adding
overhead in the hot task wakeup path.

Chen Yu played around with cutting off the idle CPU search
in a LLC based on such stats and he saw some good
improvements over the default.

Tim


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-02-04 17:41                     ` Tim Chen
@ 2022-02-05 17:16                       ` Chen Yu
  2022-02-06  0:26                         ` Barry Song
  0 siblings, 1 reply; 27+ messages in thread
From: Chen Yu @ 2022-02-05 17:16 UTC (permalink / raw)
  To: Tim Chen
  Cc: Barry Song, Srikar Dronamraju, Gautham R. Shenoy, Yicong Yang,
	Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, LKML,
	LAK, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, prime.zeng, Jonathan Cameron, ego,
	Linuxarm, Barry Song, Guodong Xu

On Fri, Feb 04, 2022 at 09:41:21AM -0800, Tim Chen wrote:
> On Fri, 2022-02-04 at 23:49 +1300, Barry Song wrote:
> > On Fri, Feb 4, 2022 at 11:28 PM Barry Song <21cnbao@gmail.com> wrote:
> > > On Fri, Feb 4, 2022 at 8:33 PM Srikar Dronamraju
> > > <srikar@linux.vnet.ibm.com> wrote:
> > > > * Barry Song <21cnbao@gmail.com> [2022-02-02 09:20:32]:
> > > > 
> > > > > On Tue, Feb 1, 2022 at 10:39 PM Srikar Dronamraju
> > > > > <srikar@linux.vnet.ibm.com> wrote:
> > > > > > * Barry Song <21cnbao@gmail.com> [2022-01-28 07:40:15]:
> > > > > > 
> > > > > > > On Fri, Jan 28, 2022 at 8:13 PM Srikar Dronamraju
> > > > > > > <srikar@linux.vnet.ibm.com> wrote:
> > > > > > > > * Barry Song <21cnbao@gmail.com> [2022-01-28 09:21:08]:
> > > > > > > > 
> > > > > > > > > On Fri, Jan 28, 2022 at 4:41 AM Gautham R. Shenoy
> > > > > > > > > <gautham.shenoy@amd.com> wrote:
> > > > > > > > > > On Wed, Jan 26, 2022 at 04:09:47PM +0800, Yicong Yang
> > > > > > > > > > wrote:
> > > > > > > > > > > From: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > > > > > > 
> > > > > > > I am sorry I didn't get your question. Currently the code
> > > > > > > works as below:
> > > > > > > if task A wakes up task B, and task A is in LLC0 and task B
> > > > > > > is in LLC1.
> > > > > > > we will scan the cluster of A before scanning the whole
> > > > > > > LLC0, in this case,
> > > > > > > cluster of A is the closest sibling, so it is the better
> > > > > > > choice than other CPUs
> > > > > > > which are in LLC0 but not in the cluster of A.
> > > > > > 
> > > > > > Yes, this is right.
> > > > > > 
> > > > > > > But we do scan all cpus of LLC0
> > > > > > > afterwards if we fail to find an idle CPU in the cluster.
> > > > > > 
> > > > > > However my reading of the patch, before we can scan other
> > > > > > clusters within
> > > > > > the LLC (aka LLC0), we have a check in scan cluster which
> > > > > > says
> > > > > > 
> > > > > >         /* Don't ping-pong tasks in and out cluster
> > > > > > frequently */
> > > > > >         if (cpus_share_resources(target, prev_cpu))
> > > > > >            return target;
> > > > > > 
> > > > > > My reading of this is, ignore other clusters (at this point,
> > > > > > we know there
> > > > > > are no idle CPUs in this cluster. We don't know if there are
> > > > > > idle cpus in
> > > > > > them or not) if the previous CPU and target CPU happen to be
> > > > > > from the same
> > > > > > cluster. This effectively means we are given preference to
> > > > > > cache over idle
> > > > > > CPU.
> > > > > 
> > > > > Note we only ignore other cluster while prev_cpu and target are
> > > > > in same
> > > > > cluster. if the condition is false, we are not ignoring other
> > > > > cpus. typically,
> > > > > if waker is the target, and wakee is the prev_cpu, that means
> > > > > if they are
> > > > > already in one cluster, we don't stupidly spread them in
> > > > > select_idle_cpu() path
> > > > > as benchmark shows we are losing. so, yes, we are giving
> > > > > preference to
> > > > > cache over CPU.
> > > > 
> > > > We already figured out that there are no idle CPUs in this
> > > > cluster. So dont
> > > > we gain performance by picking a idle CPU/core in the
> > > > neighbouring cluster.
> > > > If there are no idle CPU/core in the neighbouring cluster, then
> > > > it does make
> > > > sense to fallback on the current cluster.
> > > 
> > > 
> 
> We may need to take into consideration the utilization and
> load average for the source and target cluster to make
> better decision of whether it is worth placing the
> task in the next cluster.  If the load of the target
> cluster is too high, it is not worth pushing the task there.
> 
> Those stats can be gathered during load balancing without adding
> overhead in the hot task wakeup path.
> 
> Chen Yu played around with cutting off the idle CPU search
> in a LLC based on such stats and he saw some good
> improvements over the default.
>
Yes, we used the sum of percpu util_avg to estimate if the LLC domain
is overloaded. If it is too busy, skip searching for an idle cpu/core in
that LLC domain. The util_avg is a metric of accumulated historic
activity, which might be more accurate than instantaneous metrics(such as
rq->nr_running) on calculating the probability of find an idle cpu.
So far this change has shown some benefits in several microbenchmarks and
OLTP benchmark when the system is quite busy. That change has introduced a
per-LLC-domain flag to indicate whether the LLC domain is oveloaded,
it seems that this flag could also be extended for cluster domain.
Maybe I could post the draft patch to see if it would be helpful for this
cluster patch serie.

thanks,
Chenyu
> Tim
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-02-05 17:16                       ` Chen Yu
@ 2022-02-06  0:26                         ` Barry Song
  0 siblings, 0 replies; 27+ messages in thread
From: Barry Song @ 2022-02-06  0:26 UTC (permalink / raw)
  To: Chen Yu
  Cc: Tim Chen, Srikar Dronamraju, Gautham R. Shenoy, Yicong Yang,
	Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, LKML,
	LAK, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, prime.zeng, Jonathan Cameron, ego,
	Linuxarm, Barry Song, Guodong Xu

On Sun, Feb 6, 2022 at 6:16 AM Chen Yu <yu.c.chen@intel.com> wrote:
>
> On Fri, Feb 04, 2022 at 09:41:21AM -0800, Tim Chen wrote:
> > On Fri, 2022-02-04 at 23:49 +1300, Barry Song wrote:
> > > On Fri, Feb 4, 2022 at 11:28 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > On Fri, Feb 4, 2022 at 8:33 PM Srikar Dronamraju
> > > > <srikar@linux.vnet.ibm.com> wrote:
> > > > > * Barry Song <21cnbao@gmail.com> [2022-02-02 09:20:32]:
> > > > >
> > > > > > On Tue, Feb 1, 2022 at 10:39 PM Srikar Dronamraju
> > > > > > <srikar@linux.vnet.ibm.com> wrote:
> > > > > > > * Barry Song <21cnbao@gmail.com> [2022-01-28 07:40:15]:
> > > > > > >
> > > > > > > > On Fri, Jan 28, 2022 at 8:13 PM Srikar Dronamraju
> > > > > > > > <srikar@linux.vnet.ibm.com> wrote:
> > > > > > > > > * Barry Song <21cnbao@gmail.com> [2022-01-28 09:21:08]:
> > > > > > > > >
> > > > > > > > > > On Fri, Jan 28, 2022 at 4:41 AM Gautham R. Shenoy
> > > > > > > > > > <gautham.shenoy@amd.com> wrote:
> > > > > > > > > > > On Wed, Jan 26, 2022 at 04:09:47PM +0800, Yicong Yang
> > > > > > > > > > > wrote:
> > > > > > > > > > > > From: Barry Song <song.bao.hua@hisilicon.com>
> > > > > > > > > > > >
> > > > > > > > I am sorry I didn't get your question. Currently the code
> > > > > > > > works as below:
> > > > > > > > if task A wakes up task B, and task A is in LLC0 and task B
> > > > > > > > is in LLC1.
> > > > > > > > we will scan the cluster of A before scanning the whole
> > > > > > > > LLC0, in this case,
> > > > > > > > cluster of A is the closest sibling, so it is the better
> > > > > > > > choice than other CPUs
> > > > > > > > which are in LLC0 but not in the cluster of A.
> > > > > > >
> > > > > > > Yes, this is right.
> > > > > > >
> > > > > > > > But we do scan all cpus of LLC0
> > > > > > > > afterwards if we fail to find an idle CPU in the cluster.
> > > > > > >
> > > > > > > However my reading of the patch, before we can scan other
> > > > > > > clusters within
> > > > > > > the LLC (aka LLC0), we have a check in scan cluster which
> > > > > > > says
> > > > > > >
> > > > > > >         /* Don't ping-pong tasks in and out cluster
> > > > > > > frequently */
> > > > > > >         if (cpus_share_resources(target, prev_cpu))
> > > > > > >            return target;
> > > > > > >
> > > > > > > My reading of this is, ignore other clusters (at this point,
> > > > > > > we know there
> > > > > > > are no idle CPUs in this cluster. We don't know if there are
> > > > > > > idle cpus in
> > > > > > > them or not) if the previous CPU and target CPU happen to be
> > > > > > > from the same
> > > > > > > cluster. This effectively means we are given preference to
> > > > > > > cache over idle
> > > > > > > CPU.
> > > > > >
> > > > > > Note we only ignore other cluster while prev_cpu and target are
> > > > > > in same
> > > > > > cluster. if the condition is false, we are not ignoring other
> > > > > > cpus. typically,
> > > > > > if waker is the target, and wakee is the prev_cpu, that means
> > > > > > if they are
> > > > > > already in one cluster, we don't stupidly spread them in
> > > > > > select_idle_cpu() path
> > > > > > as benchmark shows we are losing. so, yes, we are giving
> > > > > > preference to
> > > > > > cache over CPU.
> > > > >
> > > > > We already figured out that there are no idle CPUs in this
> > > > > cluster. So dont
> > > > > we gain performance by picking a idle CPU/core in the
> > > > > neighbouring cluster.
> > > > > If there are no idle CPU/core in the neighbouring cluster, then
> > > > > it does make
> > > > > sense to fallback on the current cluster.
> > > >
> > > >
> >
> > We may need to take into consideration the utilization and
> > load average for the source and target cluster to make
> > better decision of whether it is worth placing the
> > task in the next cluster.  If the load of the target
> > cluster is too high, it is not worth pushing the task there.
> >
> > Those stats can be gathered during load balancing without adding
> > overhead in the hot task wakeup path.
> >
> > Chen Yu played around with cutting off the idle CPU search
> > in a LLC based on such stats and he saw some good
> > improvements over the default.
> >
> Yes, we used the sum of percpu util_avg to estimate if the LLC domain
> is overloaded. If it is too busy, skip searching for an idle cpu/core in
> that LLC domain. The util_avg is a metric of accumulated historic
> activity, which might be more accurate than instantaneous metrics(such as
> rq->nr_running) on calculating the probability of find an idle cpu.
> So far this change has shown some benefits in several microbenchmarks and
> OLTP benchmark when the system is quite busy. That change has introduced a
> per-LLC-domain flag to indicate whether the LLC domain is oveloaded,
> it seems that this flag could also be extended for cluster domain.
> Maybe I could post the draft patch to see if it would be helpful for this
> cluster patch serie.

yes. please send. my feeling is that select_idle_cpu() can select an "idle"cpu
which is actually very busy, but can be in "idle" state for a very
short period. it
is not always correct to get this kind of "idle" cpu. It could be
better to be still.
I am not quite sure your patch is directly related with clusters, but we will
try to figure out some connection, maybe we can integrate your patch into
this series afterwards.

>
> thanks,
> Chenyu
> > Tim
> >

Thanks
Barry

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-02-04 10:28                 ` Barry Song
  2022-02-04 10:49                   ` Barry Song
@ 2022-02-07 15:14                   ` Gautham R. Shenoy
  2022-02-08  5:42                     ` Barry Song
  1 sibling, 1 reply; 27+ messages in thread
From: Gautham R. Shenoy @ 2022-02-07 15:14 UTC (permalink / raw)
  To: Barry Song
  Cc: Srikar Dronamraju, Yicong Yang, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Tim Chen, LKML, LAK,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, prime.zeng, Jonathan Cameron, ego,
	Linuxarm, Barry Song, Guodong Xu


On Fri, Feb 04, 2022 at 11:28:25PM +1300, Barry Song wrote:

> > We already figured out that there are no idle CPUs in this cluster. So dont
> > we gain performance by picking a idle CPU/core in the neighbouring cluster.
> > If there are no idle CPU/core in the neighbouring cluster, then it does make
> > sense to fallback on the current cluster.
> 
> What you suggested is exactly the approach we have tried at the first beginning
> during debugging. but we didn't gain performance according to benchmark, we
> were actually losing. that is why we added this line to stop ping-pong:
>          /* Don't ping-pong tasks in and out cluster frequently */
>          if (cpus_share_resources(target, prev_cpu))
>             return target;
> 
> If we delete this, we are seeing a big loss of tbench while system
> load is medium
> and above.

Thanks for clarifying this Barry. Indeed, if the workload is sensitive
to data ping-ponging across L2 clusters, this heuristic makes sense. I
was thinking of workloads that require lower tail latency, in which
case exploring the larger LLC would have made more sense, assuming
that the larger LLC has an idle core/CPU.

In the absence of any hints from the workload, like something that
Peter had previous suggested
(https://lore.kernel.org/lkml/YVwnsrZWrnWHaoqN@hirez.programming.kicks-ass.net/),
optimizing for cache-access seems to be the right thing to do.


> 
> Thanks
> Barry

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-02-07 15:14                   ` Gautham R. Shenoy
@ 2022-02-08  5:42                     ` Barry Song
  2022-02-16  9:12                       ` Barry Song
  0 siblings, 1 reply; 27+ messages in thread
From: Barry Song @ 2022-02-08  5:42 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Srikar Dronamraju, Yicong Yang, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Tim Chen, LKML, LAK,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, prime.zeng, Jonathan Cameron, ego,
	Linuxarm, Barry Song, Guodong Xu

On Tue, Feb 8, 2022 at 4:14 AM Gautham R. Shenoy <gautham.shenoy@amd.com> wrote:
>
>
> On Fri, Feb 04, 2022 at 11:28:25PM +1300, Barry Song wrote:
>
> > > We already figured out that there are no idle CPUs in this cluster. So dont
> > > we gain performance by picking a idle CPU/core in the neighbouring cluster.
> > > If there are no idle CPU/core in the neighbouring cluster, then it does make
> > > sense to fallback on the current cluster.
> >
> > What you suggested is exactly the approach we have tried at the first beginning
> > during debugging. but we didn't gain performance according to benchmark, we
> > were actually losing. that is why we added this line to stop ping-pong:
> >          /* Don't ping-pong tasks in and out cluster frequently */
> >          if (cpus_share_resources(target, prev_cpu))
> >             return target;
> >
> > If we delete this, we are seeing a big loss of tbench while system
> > load is medium
> > and above.
>
> Thanks for clarifying this Barry. Indeed, if the workload is sensitive
> to data ping-ponging across L2 clusters, this heuristic makes sense. I
> was thinking of workloads that require lower tail latency, in which
> case exploring the larger LLC would have made more sense, assuming
> that the larger LLC has an idle core/CPU.
>
> In the absence of any hints from the workload, like something that
> Peter had previous suggested
> (https://lore.kernel.org/lkml/YVwnsrZWrnWHaoqN@hirez.programming.kicks-ass.net/),
> optimizing for cache-access seems to be the right thing to do.

Thanks, gautham.

Yep. Peter mentioned some hints like SCHED_BATCH and SCHED_IDLE.
To me, the case we are discussing seems to be more complicated than
applying some scheduling policy on separate tasks by SCHED_BATCH
or IDLE.

For example, in case we have a process, and this process has 20 threads.
thread0-9 might care about cache-coherence latency and want to avoid
ping-ponging, and thread10-thread19 might want to have tail-latency
as small as possible. So we need some way to tell kernel, "hey, bro, please
try to keep thread0-9 still as ping-ponging will hurt them while trying your
best to find idle cpu in a wider range for thread10-19". But it seems
SCHED_XXX as a scheduler policy hint can't tell kernel how to organize tasks
into groups, and is also incapable of telling kernel different groups have
different needs.

So it seems we want some special cgroups to organize tasks and we can apply
some special hints on each different group. for example, putting thread0-9
in a cgroup and thread10-19 in another, then:
1. apply "COMMUNCATION-SENSITVE" on the 1st group
2. apply "TAIL-LATENCY-SENTIVE" on the 2nd one.
I am not quite sure how to do this and if this can find its way into
the mainline.

On the other hand, for this particular patch, the most controversial
part is those
two lines to avoid ping-ponging, and I am seeing dropping this can hurt workload
like tbench only when system load is high, so I wonder if the approach[1] from
Chen Yu and Tim can somehow resolve the problem alternatively, thus we can
avoid the controversial part.
since their patch can also shrink the scanning range while llc load is high.

[1] https://lore.kernel.org/lkml/20220207034013.599214-1-yu.c.chen@intel.com/

>
>
> >
> > Thanks
> > Barry
>
> --
> Thanks and Regards
> gautham.

Thanks
Barry

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-02-08  5:42                     ` Barry Song
@ 2022-02-16  9:12                       ` Barry Song
  2022-02-16  9:19                         ` Song Bao Hua (Barry Song)
  0 siblings, 1 reply; 27+ messages in thread
From: Barry Song @ 2022-02-16  9:12 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Srikar Dronamraju, Yicong Yang, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Tim Chen, LKML, LAK,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, prime.zeng, Jonathan Cameron, ego,
	Linuxarm, Barry Song, Guodong Xu

On Tue, Feb 8, 2022 at 6:42 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Feb 8, 2022 at 4:14 AM Gautham R. Shenoy <gautham.shenoy@amd.com> wrote:
> >
> >
> > On Fri, Feb 04, 2022 at 11:28:25PM +1300, Barry Song wrote:
> >
> > > > We already figured out that there are no idle CPUs in this cluster. So dont
> > > > we gain performance by picking a idle CPU/core in the neighbouring cluster.
> > > > If there are no idle CPU/core in the neighbouring cluster, then it does make
> > > > sense to fallback on the current cluster.
> > >
> > > What you suggested is exactly the approach we have tried at the first beginning
> > > during debugging. but we didn't gain performance according to benchmark, we
> > > were actually losing. that is why we added this line to stop ping-pong:
> > >          /* Don't ping-pong tasks in and out cluster frequently */
> > >          if (cpus_share_resources(target, prev_cpu))
> > >             return target;
> > >
> > > If we delete this, we are seeing a big loss of tbench while system
> > > load is medium
> > > and above.
> >
> > Thanks for clarifying this Barry. Indeed, if the workload is sensitive
> > to data ping-ponging across L2 clusters, this heuristic makes sense. I
> > was thinking of workloads that require lower tail latency, in which
> > case exploring the larger LLC would have made more sense, assuming
> > that the larger LLC has an idle core/CPU.
> >
> > In the absence of any hints from the workload, like something that
> > Peter had previous suggested
> > (https://lore.kernel.org/lkml/YVwnsrZWrnWHaoqN@hirez.programming.kicks-ass.net/),
> > optimizing for cache-access seems to be the right thing to do.
>
> Thanks, gautham.
>
> Yep. Peter mentioned some hints like SCHED_BATCH and SCHED_IDLE.
> To me, the case we are discussing seems to be more complicated than
> applying some scheduling policy on separate tasks by SCHED_BATCH
> or IDLE.
>
> For example, in case we have a process, and this process has 20 threads.
> thread0-9 might care about cache-coherence latency and want to avoid
> ping-ponging, and thread10-thread19 might want to have tail-latency
> as small as possible. So we need some way to tell kernel, "hey, bro, please
> try to keep thread0-9 still as ping-ponging will hurt them while trying your
> best to find idle cpu in a wider range for thread10-19". But it seems
> SCHED_XXX as a scheduler policy hint can't tell kernel how to organize tasks
> into groups, and is also incapable of telling kernel different groups have
> different needs.
>
> So it seems we want some special cgroups to organize tasks and we can apply
> some special hints on each different group. for example, putting thread0-9
> in a cgroup and thread10-19 in another, then:
> 1. apply "COMMUNCATION-SENSITVE" on the 1st group
> 2. apply "TAIL-LATENCY-SENTIVE" on the 2nd one.
> I am not quite sure how to do this and if this can find its way into
> the mainline.
>
> On the other hand, for this particular patch, the most controversial
> part is those
> two lines to avoid ping-ponging, and I am seeing dropping this can hurt workload
> like tbench only when system load is high, so I wonder if the approach[1] from
> Chen Yu and Tim can somehow resolve the problem alternatively, thus we can
> avoid the controversial part.
> since their patch can also shrink the scanning range while llc load is high.
>
> [1] https://lore.kernel.org/lkml/20220207034013.599214-1-yu.c.chen@intel.com/

Yicong's testing shows the patch from Chen Yu and Tim can somehow resolve the
problem and make sure there is no performance regression for tbench
while load is
high after we remove the code to avoid ping-pong:

5.17-rc1: vanilla
rc1 + chenyu: vanilla + chenyu's LLC overload patch
rc1+chenyu+cls: vanilla + chenyu's  patch + my this patchset
rc1+chenyu+cls-pingpong: vanilla + chenyu's patch + my this patchset -
the code avoiding ping-pong
rc1+cls: vanilla + my this patchset

tbench running on numa 0 &1:
                            5.17-rc1          rc1 + chenyu
rc1+chenyu+cls     rc1+chenyu+cls-pingpong  rc1+cls
Hmean     1        320.01 (   0.00%)      318.03 *  -0.62%*
357.15 *  11.61%*      375.43 *  17.32%*      378.44 *  18.26%*
Hmean     2        643.85 (   0.00%)      637.74 *  -0.95%*
714.36 *  10.95%*      745.82 *  15.84%*      752.52 *  16.88%*
Hmean     4       1287.36 (   0.00%)     1285.20 *  -0.17%*
1431.35 *  11.18%*     1481.71 *  15.10%*     1505.62 *  16.95%*
Hmean     8       2564.60 (   0.00%)     2551.02 *  -0.53%*
2812.74 *   9.68%*     2921.51 *  13.92%*     2955.29 *  15.23%*
Hmean     16      5195.69 (   0.00%)     5163.39 *  -0.62%*
5583.28 *   7.46%*     5726.08 *  10.21%*     5814.74 *  11.91%*
Hmean     32      9769.16 (   0.00%)     9815.63 *   0.48%*
10518.35 *   7.67%*    10852.89 *  11.09%*    10872.63 *  11.30%*
Hmean     64     15952.50 (   0.00%)    15780.41 *  -1.08%*
10608.36 * -33.50%*    17503.42 *   9.72%*    17281.98 *   8.33%*
Hmean     128    13113.77 (   0.00%)    12000.12 *  -8.49%*
13095.50 *  -0.14%*    13991.90 *   6.70%*    13895.20 *   5.96%*
Hmean     256    10997.59 (   0.00%)    12229.20 *  11.20%*
11902.60 *   8.23%*    12214.29 *  11.06%*    11244.69 *   2.25%*
Hmean     512    14623.60 (   0.00%)    15863.25 *   8.48%*
14103.38 *  -3.56%*    16422.56 *  12.30%*    15526.25 *   6.17%*

tbench running on numa 0 only:

                            5.17-rc1          rc1 + chenyu
rc1+chenyu+cls     rc1+chenyu+cls-pingpong   rc1+cls
Hmean     1        324.73 (   0.00%)      330.96 *   1.92%*
358.97 *  10.54%*      376.05 *  15.80%*      378.01 *  16.41%*
Hmean     2        645.36 (   0.00%)      643.13 *  -0.35%*
710.78 *  10.14%*      744.34 *  15.34%*      754.63 *  16.93%*
Hmean     4       1302.09 (   0.00%)     1297.11 *  -0.38%*
1425.22 *   9.46%*     1484.92 *  14.04%*     1507.54 *  15.78%*
Hmean     8       2612.03 (   0.00%)     2623.60 *   0.44%*
2843.15 *   8.85%*     2937.81 *  12.47%*     2982.57 *  14.19%*
Hmean     16      5307.12 (   0.00%)     5304.14 *  -0.06%*
5610.46 *   5.72%*     5763.24 *   8.59%*     5886.66 *  10.92%*
Hmean     32      9354.22 (   0.00%)     9738.21 *   4.11%*
9360.21 *   0.06%*     9699.05 *   3.69%*     9908.13 *   5.92%*
Hmean     64      7240.35 (   0.00%)     7210.75 *  -0.41%*
6992.70 *  -3.42%*     7321.52 *   1.12%*     7278.78 *   0.53%*
Hmean     128     6186.40 (   0.00%)     6314.89 *   2.08%*
6166.44 *  -0.32%*     6279.85 *   1.51%*     6187.85 (   0.02%)
Hmean     256     9231.40 (   0.00%)     9469.26 *   2.58%*
9134.42 *  -1.05%*     9322.88 *   0.99%*     9448.61 *   2.35%*
Hmean     512     8907.13 (   0.00%)     9130.46 *   2.51%*
9023.87 *   1.31%*     9276.19 *   4.14%*     9397.22 *   5.50%*

as you can see rc1+chenyu+cls-pingpong still shows similar improvement
like rc1+cls, in some
cases(256, 512 threads on numa0&1), it is even much better.

Thanks
Barry

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-02-16  9:12                       ` Barry Song
@ 2022-02-16  9:19                         ` Song Bao Hua (Barry Song)
  2022-02-16 10:00                           ` Yicong Yang
  0 siblings, 1 reply; 27+ messages in thread
From: Song Bao Hua (Barry Song) @ 2022-02-16  9:19 UTC (permalink / raw)
  To: Barry Song, Gautham R. Shenoy
  Cc: Srikar Dronamraju, yangyicong, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Tim Chen, LKML, LAK,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, Zengtao (B),
	Jonathan Cameron, ego, Linuxarm, Guodong Xu



> -----Original Message-----
> From: Barry Song [mailto:21cnbao@gmail.com]
> Sent: Wednesday, February 16, 2022 10:13 PM
> To: Gautham R. Shenoy <gautham.shenoy@amd.com>
> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>; yangyicong
> <yangyicong@huawei.com>; Peter Zijlstra <peterz@infradead.org>; Ingo Molnar
> <mingo@redhat.com>; Juri Lelli <juri.lelli@redhat.com>; Vincent Guittot
> <vincent.guittot@linaro.org>; Tim Chen <tim.c.chen@linux.intel.com>; LKML
> <linux-kernel@vger.kernel.org>; LAK <linux-arm-kernel@lists.infradead.org>;
> Dietmar Eggemann <dietmar.eggemann@arm.com>; Steven Rostedt
> <rostedt@goodmis.org>; Ben Segall <bsegall@google.com>; Daniel Bristot de
> Oliveira <bristot@redhat.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>; ego@linux.vnet.ibm.com;
> Linuxarm <linuxarm@huawei.com>; Song Bao Hua (Barry Song)
> <song.bao.hua@hisilicon.com>; Guodong Xu <guodong.xu@linaro.org>
> Subject: Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in
> wake-up path
> 
> On Tue, Feb 8, 2022 at 6:42 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Tue, Feb 8, 2022 at 4:14 AM Gautham R. Shenoy <gautham.shenoy@amd.com>
> wrote:
> > >
> > >
> > > On Fri, Feb 04, 2022 at 11:28:25PM +1300, Barry Song wrote:
> > >
> > > > > We already figured out that there are no idle CPUs in this cluster.
> So dont
> > > > > we gain performance by picking a idle CPU/core in the neighbouring cluster.
> > > > > If there are no idle CPU/core in the neighbouring cluster, then it does
> make
> > > > > sense to fallback on the current cluster.
> > > >
> > > > What you suggested is exactly the approach we have tried at the first
> beginning
> > > > during debugging. but we didn't gain performance according to benchmark,
> we
> > > > were actually losing. that is why we added this line to stop ping-pong:
> > > >          /* Don't ping-pong tasks in and out cluster frequently */
> > > >          if (cpus_share_resources(target, prev_cpu))
> > > >             return target;
> > > >
> > > > If we delete this, we are seeing a big loss of tbench while system
> > > > load is medium
> > > > and above.
> > >
> > > Thanks for clarifying this Barry. Indeed, if the workload is sensitive
> > > to data ping-ponging across L2 clusters, this heuristic makes sense. I
> > > was thinking of workloads that require lower tail latency, in which
> > > case exploring the larger LLC would have made more sense, assuming
> > > that the larger LLC has an idle core/CPU.
> > >
> > > In the absence of any hints from the workload, like something that
> > > Peter had previous suggested
> > >
> (https://lore.kernel.org/lkml/YVwnsrZWrnWHaoqN@hirez.programming.kicks-ass
> .net/),
> > > optimizing for cache-access seems to be the right thing to do.
> >
> > Thanks, gautham.
> >
> > Yep. Peter mentioned some hints like SCHED_BATCH and SCHED_IDLE.
> > To me, the case we are discussing seems to be more complicated than
> > applying some scheduling policy on separate tasks by SCHED_BATCH
> > or IDLE.
> >
> > For example, in case we have a process, and this process has 20 threads.
> > thread0-9 might care about cache-coherence latency and want to avoid
> > ping-ponging, and thread10-thread19 might want to have tail-latency
> > as small as possible. So we need some way to tell kernel, "hey, bro, please
> > try to keep thread0-9 still as ping-ponging will hurt them while trying your
> > best to find idle cpu in a wider range for thread10-19". But it seems
> > SCHED_XXX as a scheduler policy hint can't tell kernel how to organize tasks
> > into groups, and is also incapable of telling kernel different groups have
> > different needs.
> >
> > So it seems we want some special cgroups to organize tasks and we can apply
> > some special hints on each different group. for example, putting thread0-9
> > in a cgroup and thread10-19 in another, then:
> > 1. apply "COMMUNCATION-SENSITVE" on the 1st group
> > 2. apply "TAIL-LATENCY-SENTIVE" on the 2nd one.
> > I am not quite sure how to do this and if this can find its way into
> > the mainline.
> >
> > On the other hand, for this particular patch, the most controversial
> > part is those
> > two lines to avoid ping-ponging, and I am seeing dropping this can hurt workload
> > like tbench only when system load is high, so I wonder if the approach[1]
> from
> > Chen Yu and Tim can somehow resolve the problem alternatively, thus we can
> > avoid the controversial part.
> > since their patch can also shrink the scanning range while llc load is high.
> >
> > [1]
> https://lore.kernel.org/lkml/20220207034013.599214-1-yu.c.chen@intel.com/
> 
> Yicong's testing shows the patch from Chen Yu and Tim can somehow resolve the
> problem and make sure there is no performance regression for tbench
> while load is
> high after we remove the code to avoid ping-pong:
> 
> 5.17-rc1: vanilla
> rc1 + chenyu: vanilla + chenyu's LLC overload patch
> rc1+chenyu+cls: vanilla + chenyu's  patch + my this patchset
> rc1+chenyu+cls-pingpong: vanilla + chenyu's patch + my this patchset -
> the code avoiding ping-pong
> rc1+cls: vanilla + my this patchset
> 
> tbench running on numa 0 &1:
>                             5.17-rc1          rc1 + chenyu
> rc1+chenyu+cls     rc1+chenyu+cls-pingpong  rc1+cls
> Hmean     1        320.01 (   0.00%)      318.03 *  -0.62%*
> 357.15 *  11.61%*      375.43 *  17.32%*      378.44 *  18.26%*
> Hmean     2        643.85 (   0.00%)      637.74 *  -0.95%*
> 714.36 *  10.95%*      745.82 *  15.84%*      752.52 *  16.88%*
> Hmean     4       1287.36 (   0.00%)     1285.20 *  -0.17%*
> 1431.35 *  11.18%*     1481.71 *  15.10%*     1505.62 *  16.95%*
> Hmean     8       2564.60 (   0.00%)     2551.02 *  -0.53%*
> 2812.74 *   9.68%*     2921.51 *  13.92%*     2955.29 *  15.23%*
> Hmean     16      5195.69 (   0.00%)     5163.39 *  -0.62%*
> 5583.28 *   7.46%*     5726.08 *  10.21%*     5814.74 *  11.91%*
> Hmean     32      9769.16 (   0.00%)     9815.63 *   0.48%*
> 10518.35 *   7.67%*    10852.89 *  11.09%*    10872.63 *  11.30%*
> Hmean     64     15952.50 (   0.00%)    15780.41 *  -1.08%*
> 10608.36 * -33.50%*    17503.42 *   9.72%*    17281.98 *   8.33%*
> Hmean     128    13113.77 (   0.00%)    12000.12 *  -8.49%*
> 13095.50 *  -0.14%*    13991.90 *   6.70%*    13895.20 *   5.96%*
> Hmean     256    10997.59 (   0.00%)    12229.20 *  11.20%*
> 11902.60 *   8.23%*    12214.29 *  11.06%*    11244.69 *   2.25%*
> Hmean     512    14623.60 (   0.00%)    15863.25 *   8.48%*
> 14103.38 *  -3.56%*    16422.56 *  12.30%*    15526.25 *   6.17%*
> 
> tbench running on numa 0 only:
> 
>                             5.17-rc1          rc1 + chenyu
> rc1+chenyu+cls     rc1+chenyu+cls-pingpong   rc1+cls
> Hmean     1        324.73 (   0.00%)      330.96 *   1.92%*
> 358.97 *  10.54%*      376.05 *  15.80%*      378.01 *  16.41%*
> Hmean     2        645.36 (   0.00%)      643.13 *  -0.35%*
> 710.78 *  10.14%*      744.34 *  15.34%*      754.63 *  16.93%*
> Hmean     4       1302.09 (   0.00%)     1297.11 *  -0.38%*
> 1425.22 *   9.46%*     1484.92 *  14.04%*     1507.54 *  15.78%*
> Hmean     8       2612.03 (   0.00%)     2623.60 *   0.44%*
> 2843.15 *   8.85%*     2937.81 *  12.47%*     2982.57 *  14.19%*
> Hmean     16      5307.12 (   0.00%)     5304.14 *  -0.06%*
> 5610.46 *   5.72%*     5763.24 *   8.59%*     5886.66 *  10.92%*
> Hmean     32      9354.22 (   0.00%)     9738.21 *   4.11%*
> 9360.21 *   0.06%*     9699.05 *   3.69%*     9908.13 *   5.92%*
> Hmean     64      7240.35 (   0.00%)     7210.75 *  -0.41%*
> 6992.70 *  -3.42%*     7321.52 *   1.12%*     7278.78 *   0.53%*
> Hmean     128     6186.40 (   0.00%)     6314.89 *   2.08%*
> 6166.44 *  -0.32%*     6279.85 *   1.51%*     6187.85 (   0.02%)
> Hmean     256     9231.40 (   0.00%)     9469.26 *   2.58%*
> 9134.42 *  -1.05%*     9322.88 *   0.99%*     9448.61 *   2.35%*
> Hmean     512     8907.13 (   0.00%)     9130.46 *   2.51%*
> 9023.87 *   1.31%*     9276.19 *   4.14%*     9397.22 *   5.50%*
> 

Sorry, it seems the format is broken. Let me re-post the data.

 5.17-rc1: vanilla
 rc1 + chenyu: vanilla + chenyu's LLC overload patch
 rc1+chenyu+cls: vanilla + chenyu's  patch + my this patchset
 rc1+chenyu+cls-pingpong: vanilla + chenyu's patch + my this patchset - the code avoiding ping-pong
 rc1+cls: vanilla + my this patchset

tbench running on numa 0&1:
                            5.17-rc1          rc1 + chenyu          rc1+chenyu+cls     rc1+chenyu+cls-pingpong  rc1+cls
Hmean     1        320.01 (   0.00%)      318.03 *  -0.62%*      357.15 *  11.61%*      375.43 *  17.32%*      378.44 *  18.26%*
Hmean     2        643.85 (   0.00%)      637.74 *  -0.95%*      714.36 *  10.95%*      745.82 *  15.84%*      752.52 *  16.88%*
Hmean     4       1287.36 (   0.00%)     1285.20 *  -0.17%*     1431.35 *  11.18%*     1481.71 *  15.10%*     1505.62 *  16.95%*
Hmean     8       2564.60 (   0.00%)     2551.02 *  -0.53%*     2812.74 *   9.68%*     2921.51 *  13.92%*     2955.29 *  15.23%*
Hmean     16      5195.69 (   0.00%)     5163.39 *  -0.62%*     5583.28 *   7.46%*     5726.08 *  10.21%*     5814.74 *  11.91%*
Hmean     32      9769.16 (   0.00%)     9815.63 *   0.48%*    10518.35 *   7.67%*    10852.89 *  11.09%*    10872.63 *  11.30%*
Hmean     64     15952.50 (   0.00%)    15780.41 *  -1.08%*    10608.36 * -33.50%*    17503.42 *   9.72%*    17281.98 *   8.33%*
Hmean     128    13113.77 (   0.00%)    12000.12 *  -8.49%*    13095.50 *  -0.14%*    13991.90 *   6.70%*    13895.20 *   5.96%*
Hmean     256    10997.59 (   0.00%)    12229.20 *  11.20%*    11902.60 *   8.23%*    12214.29 *  11.06%*    11244.69 *   2.25%*
Hmean     512    14623.60 (   0.00%)    15863.25 *   8.48%*    14103.38 *  -3.56%*    16422.56 *  12.30%*    15526.25 *   6.17%*

tbench running on numa 0 only:
                            5.17-rc1          rc1 + chenyu          rc1+chenyu+cls     rc1+chenyu+cls-pingpong   rc1+cls
Hmean     1        324.73 (   0.00%)      330.96 *   1.92%*      358.97 *  10.54%*      376.05 *  15.80%*      378.01 *  16.41%*
Hmean     2        645.36 (   0.00%)      643.13 *  -0.35%*      710.78 *  10.14%*      744.34 *  15.34%*      754.63 *  16.93%*
Hmean     4       1302.09 (   0.00%)     1297.11 *  -0.38%*     1425.22 *   9.46%*     1484.92 *  14.04%*     1507.54 *  15.78%*
Hmean     8       2612.03 (   0.00%)     2623.60 *   0.44%*     2843.15 *   8.85%*     2937.81 *  12.47%*     2982.57 *  14.19%*
Hmean     16      5307.12 (   0.00%)     5304.14 *  -0.06%*     5610.46 *   5.72%*     5763.24 *   8.59%*     5886.66 *  10.92%*
Hmean     32      9354.22 (   0.00%)     9738.21 *   4.11%*     9360.21 *   0.06%*     9699.05 *   3.69%*     9908.13 *   5.92%*
Hmean     64      7240.35 (   0.00%)     7210.75 *  -0.41%*     6992.70 *  -3.42%*     7321.52 *   1.12%*     7278.78 *   0.53%*
Hmean     128     6186.40 (   0.00%)     6314.89 *   2.08%*     6166.44 *  -0.32%*     6279.85 *   1.51%*     6187.85 (   0.02%)
Hmean     256     9231.40 (   0.00%)     9469.26 *   2.58%*     9134.42 *  -1.05%*     9322.88 *   0.99%*     9448.61 *   2.35%*
Hmean     512     8907.13 (   0.00%)     9130.46 *   2.51%*     9023.87 *   1.31%*     9276.19 *   4.14%*     9397.22 *   5.50%*

> like rc1+cls, in some
> cases(256, 512 threads on numa0&1), it is even much better.
> 
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-02-16  9:19                         ` Song Bao Hua (Barry Song)
@ 2022-02-16 10:00                           ` Yicong Yang
  2022-02-17 18:00                             ` Tim Chen
  0 siblings, 1 reply; 27+ messages in thread
From: Yicong Yang @ 2022-02-16 10:00 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song), Barry Song, Gautham R. Shenoy
  Cc: yangyicong, Srikar Dronamraju, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Tim Chen, LKML, LAK,
	Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Daniel Bristot de Oliveira, Zengtao (B),
	Jonathan Cameron, ego, Linuxarm, Guodong Xu, Chen Yu

On 2022/2/16 17:19, Song Bao Hua (Barry Song) wrote:
> 
> 
>> -----Original Message-----
>> From: Barry Song [mailto:21cnbao@gmail.com]
>> Sent: Wednesday, February 16, 2022 10:13 PM
>> To: Gautham R. Shenoy <gautham.shenoy@amd.com>
>> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>; yangyicong
>> <yangyicong@huawei.com>; Peter Zijlstra <peterz@infradead.org>; Ingo Molnar
>> <mingo@redhat.com>; Juri Lelli <juri.lelli@redhat.com>; Vincent Guittot
>> <vincent.guittot@linaro.org>; Tim Chen <tim.c.chen@linux.intel.com>; LKML
>> <linux-kernel@vger.kernel.org>; LAK <linux-arm-kernel@lists.infradead.org>;
>> Dietmar Eggemann <dietmar.eggemann@arm.com>; Steven Rostedt
>> <rostedt@goodmis.org>; Ben Segall <bsegall@google.com>; Daniel Bristot de
>> Oliveira <bristot@redhat.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
>> Jonathan Cameron <jonathan.cameron@huawei.com>; ego@linux.vnet.ibm.com;
>> Linuxarm <linuxarm@huawei.com>; Song Bao Hua (Barry Song)
>> <song.bao.hua@hisilicon.com>; Guodong Xu <guodong.xu@linaro.org>
>> Subject: Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in
>> wake-up path
>>
>> On Tue, Feb 8, 2022 at 6:42 PM Barry Song <21cnbao@gmail.com> wrote:
>>>
>>> On Tue, Feb 8, 2022 at 4:14 AM Gautham R. Shenoy <gautham.shenoy@amd.com>
>> wrote:
>>>>
>>>>
>>>> On Fri, Feb 04, 2022 at 11:28:25PM +1300, Barry Song wrote:
>>>>
>>>>>> We already figured out that there are no idle CPUs in this cluster.
>> So dont
>>>>>> we gain performance by picking a idle CPU/core in the neighbouring cluster.
>>>>>> If there are no idle CPU/core in the neighbouring cluster, then it does
>> make
>>>>>> sense to fallback on the current cluster.
>>>>>
>>>>> What you suggested is exactly the approach we have tried at the first
>> beginning
>>>>> during debugging. but we didn't gain performance according to benchmark,
>> we
>>>>> were actually losing. that is why we added this line to stop ping-pong:
>>>>>          /* Don't ping-pong tasks in and out cluster frequently */
>>>>>          if (cpus_share_resources(target, prev_cpu))
>>>>>             return target;
>>>>>
>>>>> If we delete this, we are seeing a big loss of tbench while system
>>>>> load is medium
>>>>> and above.
>>>>
>>>> Thanks for clarifying this Barry. Indeed, if the workload is sensitive
>>>> to data ping-ponging across L2 clusters, this heuristic makes sense. I
>>>> was thinking of workloads that require lower tail latency, in which
>>>> case exploring the larger LLC would have made more sense, assuming
>>>> that the larger LLC has an idle core/CPU.
>>>>
>>>> In the absence of any hints from the workload, like something that
>>>> Peter had previous suggested
>>>>
>> (https://lore.kernel.org/lkml/YVwnsrZWrnWHaoqN@hirez.programming.kicks-ass
>> .net/),
>>>> optimizing for cache-access seems to be the right thing to do.
>>>
>>> Thanks, gautham.
>>>
>>> Yep. Peter mentioned some hints like SCHED_BATCH and SCHED_IDLE.
>>> To me, the case we are discussing seems to be more complicated than
>>> applying some scheduling policy on separate tasks by SCHED_BATCH
>>> or IDLE.
>>>
>>> For example, in case we have a process, and this process has 20 threads.
>>> thread0-9 might care about cache-coherence latency and want to avoid
>>> ping-ponging, and thread10-thread19 might want to have tail-latency
>>> as small as possible. So we need some way to tell kernel, "hey, bro, please
>>> try to keep thread0-9 still as ping-ponging will hurt them while trying your
>>> best to find idle cpu in a wider range for thread10-19". But it seems
>>> SCHED_XXX as a scheduler policy hint can't tell kernel how to organize tasks
>>> into groups, and is also incapable of telling kernel different groups have
>>> different needs.
>>>
>>> So it seems we want some special cgroups to organize tasks and we can apply
>>> some special hints on each different group. for example, putting thread0-9
>>> in a cgroup and thread10-19 in another, then:
>>> 1. apply "COMMUNCATION-SENSITVE" on the 1st group
>>> 2. apply "TAIL-LATENCY-SENTIVE" on the 2nd one.
>>> I am not quite sure how to do this and if this can find its way into
>>> the mainline.
>>>
>>> On the other hand, for this particular patch, the most controversial
>>> part is those
>>> two lines to avoid ping-ponging, and I am seeing dropping this can hurt workload
>>> like tbench only when system load is high, so I wonder if the approach[1]
>> from
>>> Chen Yu and Tim can somehow resolve the problem alternatively, thus we can
>>> avoid the controversial part.
>>> since their patch can also shrink the scanning range while llc load is high.
>>>
>>> [1]
>> https://lore.kernel.org/lkml/20220207034013.599214-1-yu.c.chen@intel.com/
>>
>> Yicong's testing shows the patch from Chen Yu and Tim can somehow resolve the
>> problem and make sure there is no performance regression for tbench
>> while load is
>> high after we remove the code to avoid ping-pong:
>>
>> 5.17-rc1: vanilla
>> rc1 + chenyu: vanilla + chenyu's LLC overload patch
>> rc1+chenyu+cls: vanilla + chenyu's  patch + my this patchset
>> rc1+chenyu+cls-pingpong: vanilla + chenyu's patch + my this patchset -
>> the code avoiding ping-pong
>> rc1+cls: vanilla + my this patchset
>>
>> tbench running on numa 0 &1:
>>                             5.17-rc1          rc1 + chenyu
>> rc1+chenyu+cls     rc1+chenyu+cls-pingpong  rc1+cls
>> Hmean     1        320.01 (   0.00%)      318.03 *  -0.62%*
>> 357.15 *  11.61%*      375.43 *  17.32%*      378.44 *  18.26%*
>> Hmean     2        643.85 (   0.00%)      637.74 *  -0.95%*
>> 714.36 *  10.95%*      745.82 *  15.84%*      752.52 *  16.88%*
>> Hmean     4       1287.36 (   0.00%)     1285.20 *  -0.17%*
>> 1431.35 *  11.18%*     1481.71 *  15.10%*     1505.62 *  16.95%*
>> Hmean     8       2564.60 (   0.00%)     2551.02 *  -0.53%*
>> 2812.74 *   9.68%*     2921.51 *  13.92%*     2955.29 *  15.23%*
>> Hmean     16      5195.69 (   0.00%)     5163.39 *  -0.62%*
>> 5583.28 *   7.46%*     5726.08 *  10.21%*     5814.74 *  11.91%*
>> Hmean     32      9769.16 (   0.00%)     9815.63 *   0.48%*
>> 10518.35 *   7.67%*    10852.89 *  11.09%*    10872.63 *  11.30%*
>> Hmean     64     15952.50 (   0.00%)    15780.41 *  -1.08%*
>> 10608.36 * -33.50%*    17503.42 *   9.72%*    17281.98 *   8.33%*
>> Hmean     128    13113.77 (   0.00%)    12000.12 *  -8.49%*
>> 13095.50 *  -0.14%*    13991.90 *   6.70%*    13895.20 *   5.96%*
>> Hmean     256    10997.59 (   0.00%)    12229.20 *  11.20%*
>> 11902.60 *   8.23%*    12214.29 *  11.06%*    11244.69 *   2.25%*
>> Hmean     512    14623.60 (   0.00%)    15863.25 *   8.48%*
>> 14103.38 *  -3.56%*    16422.56 *  12.30%*    15526.25 *   6.17%*
>>
>> tbench running on numa 0 only:
>>
>>                             5.17-rc1          rc1 + chenyu
>> rc1+chenyu+cls     rc1+chenyu+cls-pingpong   rc1+cls
>> Hmean     1        324.73 (   0.00%)      330.96 *   1.92%*
>> 358.97 *  10.54%*      376.05 *  15.80%*      378.01 *  16.41%*
>> Hmean     2        645.36 (   0.00%)      643.13 *  -0.35%*
>> 710.78 *  10.14%*      744.34 *  15.34%*      754.63 *  16.93%*
>> Hmean     4       1302.09 (   0.00%)     1297.11 *  -0.38%*
>> 1425.22 *   9.46%*     1484.92 *  14.04%*     1507.54 *  15.78%*
>> Hmean     8       2612.03 (   0.00%)     2623.60 *   0.44%*
>> 2843.15 *   8.85%*     2937.81 *  12.47%*     2982.57 *  14.19%*
>> Hmean     16      5307.12 (   0.00%)     5304.14 *  -0.06%*
>> 5610.46 *   5.72%*     5763.24 *   8.59%*     5886.66 *  10.92%*
>> Hmean     32      9354.22 (   0.00%)     9738.21 *   4.11%*
>> 9360.21 *   0.06%*     9699.05 *   3.69%*     9908.13 *   5.92%*
>> Hmean     64      7240.35 (   0.00%)     7210.75 *  -0.41%*
>> 6992.70 *  -3.42%*     7321.52 *   1.12%*     7278.78 *   0.53%*
>> Hmean     128     6186.40 (   0.00%)     6314.89 *   2.08%*
>> 6166.44 *  -0.32%*     6279.85 *   1.51%*     6187.85 (   0.02%)
>> Hmean     256     9231.40 (   0.00%)     9469.26 *   2.58%*
>> 9134.42 *  -1.05%*     9322.88 *   0.99%*     9448.61 *   2.35%*
>> Hmean     512     8907.13 (   0.00%)     9130.46 *   2.51%*
>> 9023.87 *   1.31%*     9276.19 *   4.14%*     9397.22 *   5.50%*
>>
> 
> Sorry, it seems the format is broken. Let me re-post the data.
> 
>  5.17-rc1: vanilla
>  rc1 + chenyu: vanilla + chenyu's LLC overload patch
>  rc1+chenyu+cls: vanilla + chenyu's  patch + my this patchset
>  rc1+chenyu+cls-pingpong: vanilla + chenyu's patch + my this patchset - the code avoiding ping-pong
>  rc1+cls: vanilla + my this patchset
> 
> tbench running on numa 0&1:
>                             5.17-rc1          rc1 + chenyu          rc1+chenyu+cls     rc1+chenyu+cls-pingpong  rc1+cls
> Hmean     1        320.01 (   0.00%)      318.03 *  -0.62%*      357.15 *  11.61%*      375.43 *  17.32%*      378.44 *  18.26%*
> Hmean     2        643.85 (   0.00%)      637.74 *  -0.95%*      714.36 *  10.95%*      745.82 *  15.84%*      752.52 *  16.88%*
> Hmean     4       1287.36 (   0.00%)     1285.20 *  -0.17%*     1431.35 *  11.18%*     1481.71 *  15.10%*     1505.62 *  16.95%*
> Hmean     8       2564.60 (   0.00%)     2551.02 *  -0.53%*     2812.74 *   9.68%*     2921.51 *  13.92%*     2955.29 *  15.23%*
> Hmean     16      5195.69 (   0.00%)     5163.39 *  -0.62%*     5583.28 *   7.46%*     5726.08 *  10.21%*     5814.74 *  11.91%*
> Hmean     32      9769.16 (   0.00%)     9815.63 *   0.48%*    10518.35 *   7.67%*    10852.89 *  11.09%*    10872.63 *  11.30%*
> Hmean     64     15952.50 (   0.00%)    15780.41 *  -1.08%*    10608.36 * -33.50%*    17503.42 *   9.72%*    17281.98 *   8.33%*
> Hmean     128    13113.77 (   0.00%)    12000.12 *  -8.49%*    13095.50 *  -0.14%*    13991.90 *   6.70%*    13895.20 *   5.96%*
> Hmean     256    10997.59 (   0.00%)    12229.20 *  11.20%*    11902.60 *   8.23%*    12214.29 *  11.06%*    11244.69 *   2.25%*
> Hmean     512    14623.60 (   0.00%)    15863.25 *   8.48%*    14103.38 *  -3.56%*    16422.56 *  12.30%*    15526.25 *   6.17%*
> 

Yes I think it'll also benefit for the cluster's conditon.

But 128 threads seems like a weired point that Chen's patch on 5.17-rc1 (without this series) causes degradation,
which in Chen's tbench test it doesn't cause that much when the 2 * cpu number == threads[*]:

case            	load    	baseline(std%)	compare%( std%)
loopback        	thread-224	 1.00 (  0.17)	 +2.30 (  0.10)

[*] https://lore.kernel.org/lkml/20220207034013.599214-1-yu.c.chen@intel.com/

> tbench running on numa 0 only:
>                             5.17-rc1          rc1 + chenyu          rc1+chenyu+cls     rc1+chenyu+cls-pingpong   rc1+cls
> Hmean     1        324.73 (   0.00%)      330.96 *   1.92%*      358.97 *  10.54%*      376.05 *  15.80%*      378.01 *  16.41%*
> Hmean     2        645.36 (   0.00%)      643.13 *  -0.35%*      710.78 *  10.14%*      744.34 *  15.34%*      754.63 *  16.93%*
> Hmean     4       1302.09 (   0.00%)     1297.11 *  -0.38%*     1425.22 *   9.46%*     1484.92 *  14.04%*     1507.54 *  15.78%*
> Hmean     8       2612.03 (   0.00%)     2623.60 *   0.44%*     2843.15 *   8.85%*     2937.81 *  12.47%*     2982.57 *  14.19%*
> Hmean     16      5307.12 (   0.00%)     5304.14 *  -0.06%*     5610.46 *   5.72%*     5763.24 *   8.59%*     5886.66 *  10.92%*
> Hmean     32      9354.22 (   0.00%)     9738.21 *   4.11%*     9360.21 *   0.06%*     9699.05 *   3.69%*     9908.13 *   5.92%*
> Hmean     64      7240.35 (   0.00%)     7210.75 *  -0.41%*     6992.70 *  -3.42%*     7321.52 *   1.12%*     7278.78 *   0.53%*
> Hmean     128     6186.40 (   0.00%)     6314.89 *   2.08%*     6166.44 *  -0.32%*     6279.85 *   1.51%*     6187.85 (   0.02%)
> Hmean     256     9231.40 (   0.00%)     9469.26 *   2.58%*     9134.42 *  -1.05%*     9322.88 *   0.99%*     9448.61 *   2.35%*
> Hmean     512     8907.13 (   0.00%)     9130.46 *   2.51%*     9023.87 *   1.31%*     9276.19 *   4.14%*     9397.22 *   5.50%*
> 
>> like rc1+cls, in some
>> cases(256, 512 threads on numa0&1), it is even much better.
>>
>> Thanks
>> Barry

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-02-16 10:00                           ` Yicong Yang
@ 2022-02-17 18:00                             ` Tim Chen
  0 siblings, 0 replies; 27+ messages in thread
From: Tim Chen @ 2022-02-17 18:00 UTC (permalink / raw)
  To: Yicong Yang, Song Bao Hua (Barry Song), Barry Song, Gautham R. Shenoy
  Cc: yangyicong, Srikar Dronamraju, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, LKML, LAK, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	Zengtao (B),
	Jonathan Cameron, ego, Linuxarm, Guodong Xu, Chen Yu

On Wed, 2022-02-16 at 18:00 +0800, Yicong Yang wrote:
> On 2022/2/16 17:19, Song Bao Hua (Barry Song) wrote:
> > 
> > tbench running on numa 0&1:
> >                             5.17-rc1          rc1 + chenyu          rc1+chenyu+cls     rc1+chenyu+cls-pingpong  rc1+cls
> > Hmean     1        320.01 (   0.00%)      318.03 *  -0.62%*      357.15 *  11.61%*      375.43 *  17.32%*      378.44 *  18.26%*
> > Hmean     2        643.85 (   0.00%)      637.74 *  -0.95%*      714.36 *  10.95%*      745.82 *  15.84%*      752.52 *  16.88%*
> > Hmean     4       1287.36 (   0.00%)     1285.20 *  -0.17%*     1431.35 *  11.18%*     1481.71 *  15.10%*     1505.62 *  16.95%*
> > Hmean     8       2564.60 (   0.00%)     2551.02 *  -0.53%*     2812.74 *   9.68%*     2921.51 *  13.92%*     2955.29 *  15.23%*
> > Hmean     16      5195.69 (   0.00%)     5163.39 *  -0.62%*     5583.28 *   7.46%*     5726.08 *  10.21%*     5814.74 *  11.91%*
> > Hmean     32      9769.16 (   0.00%)     9815.63 *   0.48%*    10518.35 *   7.67%*    10852.89 *  11.09%*    10872.63 *  11.30%*
> > Hmean     64     15952.50 (   0.00%)    15780.41 *  -1.08%*    10608.36 * -33.50%*    17503.42 *   9.72%*    17281.98 *   8.33%*
> > Hmean     128    13113.77 (   0.00%)    12000.12 *  -8.49%*    13095.50 *  -0.14%*    13991.90 *   6.70%*    13895.20 *   5.96%*
> > Hmean     256    10997.59 (   0.00%)    12229.20 *  11.20%*    11902.60 *   8.23%*    12214.29 *  11.06%*    11244.69 *   2.25%*
> > Hmean     512    14623.60 (   0.00%)    15863.25 *   8.48%*    14103.38 *  -3.56%*    16422.56 *  12.30%*    15526.25 *   6.17%*
> > 
> 
> Yes I think it'll also benefit for the cluster's conditon.
> 
> But 128 threads seems like a weired point that Chen's patch on 5.17-rc1 (without this series) causes degradation,
> which in Chen's tbench test it doesn't cause that much when the 2 * cpu number == threads[*]:
> 

From the data, it seems like Chen Yu's patch benefits the overloaded condition (as expected) while
the cluster scheduling has benefit most at the low end (also expected).  It is nice that
by combining these two approaches we can get the most benefit.
 
Chen Yu's patch has a hard transition to stop search for idle CPU at about 85% utilization.
So we may be hitting that knee and we may benefit from not stopping search completely
but reducing number of CPUs searched, as Peter pointed out.

Tim 

> case            	load    	baseline(std%)	compare%( std%)
> loopback        	thread-224	 1.00 (  0.17)	 +2.30 (  0.10)
> 
> [*] https://lore.kernel.org/lkml/20220207034013.599214-1-yu.c.chen@intel.com/
> 
> > tbench running on numa 0 only:
> >                             5.17-rc1          rc1 + chenyu          rc1+chenyu+cls     rc1+chenyu+cls-pingpong   rc1+cls
> > Hmean     1        324.73 (   0.00%)      330.96 *   1.92%*      358.97 *  10.54%*      376.05 *  15.80%*      378.01 *  16.41%*
> > Hmean     2        645.36 (   0.00%)      643.13 *  -0.35%*      710.78 *  10.14%*      744.34 *  15.34%*      754.63 *  16.93%*
> > Hmean     4       1302.09 (   0.00%)     1297.11 *  -0.38%*     1425.22 *   9.46%*     1484.92 *  14.04%*     1507.54 *  15.78%*
> > Hmean     8       2612.03 (   0.00%)     2623.60 *   0.44%*     2843.15 *   8.85%*     2937.81 *  12.47%*     2982.57 *  14.19%*
> > Hmean     16      5307.12 (   0.00%)     5304.14 *  -0.06%*     5610.46 *   5.72%*     5763.24 *   8.59%*     5886.66 *  10.92%*
> > Hmean     32      9354.22 (   0.00%)     9738.21 *   4.11%*     9360.21 *   0.06%*     9699.05 *   3.69%*     9908.13 *   5.92%*
> > Hmean     64      7240.35 (   0.00%)     7210.75 *  -0.41%*     6992.70 *  -3.42%*     7321.52 *   1.12%*     7278.78 *   0.53%*
> > Hmean     128     6186.40 (   0.00%)     6314.89 *   2.08%*     6166.44 *  -0.32%*     6279.85 *   1.51%*     6187.85 (   0.02%)
> > Hmean     256     9231.40 (   0.00%)     9469.26 *   2.58%*     9134.42 *  -1.05%*     9322.88 *   0.99%*     9448.61 *   2.35%*
> > Hmean     512     8907.13 (   0.00%)     9130.46 *   2.51%*     9023.87 *   1.31%*     9276.19 *   4.14%*     9397.22 *   5.50%*
> > 
> > > like rc1+cls, in some
> > > cases(256, 512 threads on numa0&1), it is even much better.
> > > 
> > > Thanks
> > > Barry


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-02-17 18:00 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-26  8:09 [PATCH v2 0/2] sched/fair: Wake task within the cluster when possible Yicong Yang
2022-01-26  8:09 ` [PATCH v2 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
2022-01-27 15:26   ` Gautham R. Shenoy
2022-01-26  8:09 ` [PATCH v2 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
2022-01-27  1:14   ` Tim Chen
2022-01-27  2:02     ` Yicong Yang
2022-01-27  2:30       ` Tim Chen
2022-01-27  2:36         ` Tim Chen
2022-01-27  3:05           ` Yicong Yang
2022-01-27 15:41   ` Gautham R. Shenoy
2022-01-27 20:21     ` Barry Song
2022-01-28  7:13       ` Srikar Dronamraju
2022-01-27 18:40         ` Barry Song
2022-02-01  9:38           ` Srikar Dronamraju
2022-02-01 20:20             ` Barry Song
2022-02-04  7:33               ` Srikar Dronamraju
2022-02-04 10:28                 ` Barry Song
2022-02-04 10:49                   ` Barry Song
2022-02-04 17:41                     ` Tim Chen
2022-02-05 17:16                       ` Chen Yu
2022-02-06  0:26                         ` Barry Song
2022-02-07 15:14                   ` Gautham R. Shenoy
2022-02-08  5:42                     ` Barry Song
2022-02-16  9:12                       ` Barry Song
2022-02-16  9:19                         ` Song Bao Hua (Barry Song)
2022-02-16 10:00                           ` Yicong Yang
2022-02-17 18:00                             ` Tim Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).