linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/2] sched/fair: Wake task within the cluster when possible
@ 2022-06-08  9:57 Yicong Yang
  2022-06-08  9:57 ` [PATCH v3 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
  2022-06-08  9:57 ` [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
  0 siblings, 2 replies; 7+ messages in thread
From: Yicong Yang @ 2022-06-08  9:57 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	guodong.xu, hesham.almatary, john.garry, shenyang39

This is the follow-up work to support cluster scheduler. Previously
we have added cluster level in the scheduler for both ARM64[1] and
X86[2] to support load balance between clusters to bring more memory
bandwidth and decrease cache contention. This patchset, on the other
hand, takes care of wake-up path by giving CPUs within the same cluster
a try before scanning the whole LLC to benefit those tasks communicating
with each other.

[1] 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
[2] 66558b730f25 ("sched: Add cluster scheduler level for x86")

Change since v2:
- leverage SIS_PROP to suspend redundant scanning when LLC is overloaded
- remove the ping-pong suppression
- address the comment from Tim, thanks.
Link: https://lore.kernel.org/lkml/20220126080947.4529-1-yangyicong@hisilicon.com/

Change since v1:
- regain the performance data based on v5.17-rc1
- rename cpus_share_cluster to cpus_share_resources per Vincent and Gautham, thanks!
Link: https://lore.kernel.org/lkml/20211215041149.73171-1-yangyicong@hisilicon.com/

Barry Song (2):
  sched: Add per_cpu cluster domain info and cpus_share_resources API
  sched/fair: Scan cluster before scanning LLC in wake-up path

 include/linux/sched/sd_flags.h |  7 ++++++
 include/linux/sched/topology.h |  8 ++++++-
 kernel/sched/core.c            | 12 ++++++++++
 kernel/sched/fair.c            | 43 +++++++++++++++++++++++++++++++---
 kernel/sched/sched.h           |  2 ++
 kernel/sched/topology.c        | 15 ++++++++++++
 6 files changed, 83 insertions(+), 4 deletions(-)

-- 
2.24.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v3 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
  2022-06-08  9:57 [PATCH v3 0/2] sched/fair: Wake task within the cluster when possible Yicong Yang
@ 2022-06-08  9:57 ` Yicong Yang
  2022-06-15 13:29   ` K Prateek Nayak
  2022-06-08  9:57 ` [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
  1 sibling, 1 reply; 7+ messages in thread
From: Yicong Yang @ 2022-06-08  9:57 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	guodong.xu, hesham.almatary, john.garry, shenyang39

From: Barry Song <song.bao.hua@hisilicon.com>

Add per-cpu cluster domain info and cpus_share_resources() API.
This is the preparation for the optimization of select_idle_cpu()
on platforms with cluster scheduler level.

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
---
 include/linux/sched/sd_flags.h |  7 +++++++
 include/linux/sched/topology.h |  8 +++++++-
 kernel/sched/core.c            | 12 ++++++++++++
 kernel/sched/sched.h           |  2 ++
 kernel/sched/topology.c        | 15 +++++++++++++++
 5 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 57bde66d95f7..42ed454e8b18 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -109,6 +109,13 @@ SD_FLAG(SD_ASYM_CPUCAPACITY_FULL, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
  */
 SD_FLAG(SD_SHARE_CPUCAPACITY, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
 
+/*
+ * Domain members share CPU cluster (LLC tags or L2 cache)
+ *
+ * NEEDS_GROUPS: Clusters are shared between groups.
+ */
+SD_FLAG(SD_CLUSTER, SDF_NEEDS_GROUPS)
+
 /*
  * Domain members share CPU package resources (i.e. caches)
  *
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..df489a1db6b7 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -45,7 +45,7 @@ static inline int cpu_smt_flags(void)
 #ifdef CONFIG_SCHED_CLUSTER
 static inline int cpu_cluster_flags(void)
 {
-	return SD_SHARE_PKG_RESOURCES;
+	return SD_CLUSTER | SD_SHARE_PKG_RESOURCES;
 }
 #endif
 
@@ -178,6 +178,7 @@ cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
 void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
 
 bool cpus_share_cache(int this_cpu, int that_cpu);
+bool cpus_share_resources(int this_cpu, int that_cpu);
 
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 typedef int (*sched_domain_flags_f)(void);
@@ -231,6 +232,11 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
 	return true;
 }
 
+static inline bool cpus_share_resources(int this_cpu, int that_cpu)
+{
+	return true;
+}
+
 #endif	/* !CONFIG_SMP */
 
 #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bfa7452ca92e..79a6f012b0cf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3808,6 +3808,18 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
 
+/*
+ * Whether CPUs are share cache resources, which means LLC on non-cluster
+ * machines and LLC tag or L2 on machines with clusters.
+ */
+bool cpus_share_resources(int this_cpu, int that_cpu)
+{
+	if (this_cpu == that_cpu)
+		return true;
+
+	return per_cpu(sd_share_id, this_cpu) == per_cpu(sd_share_id, that_cpu);
+}
+
 static inline bool ttwu_queue_cond(int cpu, int wake_flags)
 {
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 01259611beb9..b9bcfcf8d14d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(int, sd_share_id);
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 05b6c2ad90b9..0595827d481d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(int, sd_share_id);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
@@ -689,6 +691,18 @@ static void update_top_cache_domain(int cpu)
 	per_cpu(sd_llc_id, cpu) = id;
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
+	sd = lowest_flag_domain(cpu, SD_CLUSTER);
+	if (sd)
+		id = cpumask_first(sched_domain_span(sd));
+	rcu_assign_pointer(per_cpu(sd_cluster, cpu), sd);
+
+	/*
+	 * This assignment should be placed after the sd_llc_id as
+	 * we want this id equals to cluster id on cluster machines
+	 * but equals to LLC id on non-Cluster machines.
+	 */
+	per_cpu(sd_share_id, cpu) = id;
+
 	sd = lowest_flag_domain(cpu, SD_NUMA);
 	rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
 
@@ -1532,6 +1546,7 @@ static struct cpumask		***sched_domains_numa_masks;
  */
 #define TOPOLOGY_SD_FLAGS		\
 	(SD_SHARE_CPUCAPACITY	|	\
+	 SD_CLUSTER		|	\
 	 SD_SHARE_PKG_RESOURCES |	\
 	 SD_NUMA		|	\
 	 SD_ASYM_PACKING)
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-06-08  9:57 [PATCH v3 0/2] sched/fair: Wake task within the cluster when possible Yicong Yang
  2022-06-08  9:57 ` [PATCH v3 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
@ 2022-06-08  9:57 ` Yicong Yang
  2022-06-09  9:54   ` kernel test robot
  2022-06-09 10:14   ` kernel test robot
  1 sibling, 2 replies; 7+ messages in thread
From: Yicong Yang @ 2022-06-08  9:57 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	guodong.xu, hesham.almatary, john.garry, shenyang39

From: Barry Song <song.bao.hua@hisilicon.com>

For platforms having clusters like Kunpeng920, CPUs within the same cluster
have lower latency when synchronizing and accessing shared resources like
cache. Thus, this patch tries to find an idle cpu within the cluster of the
target CPU before scanning the whole LLC to gain lower latency.

Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch
doesn't consider SMT for this moment.

Testing has been done on Kunpeng920 by pinning tasks to one numa and two
numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.

With this patch, We noticed enhancement on tbench within one numa or cross
two numa.

On numa 0:
                            5.19-rc1                patched
Hmean     1        350.27 (   0.00%)      406.88 *  16.16%*
Hmean     2        702.01 (   0.00%)      808.22 *  15.13%*
Hmean     4       1405.14 (   0.00%)     1614.34 *  14.89%*
Hmean     8       2830.53 (   0.00%)     3169.02 *  11.96%*
Hmean     16      5597.95 (   0.00%)     6224.20 *  11.19%*
Hmean     32     10537.38 (   0.00%)    10524.97 *  -0.12%*
Hmean     64      8366.04 (   0.00%)     8437.41 *   0.85%*
Hmean     128     7060.87 (   0.00%)     7150.25 *   1.27%*

On numa 0-1:
                            5.19-rc1                patched
Hmean     1        346.11 (   0.00%)      408.47 *  18.02%*
Hmean     2        693.34 (   0.00%)      805.78 *  16.22%*
Hmean     4       1384.96 (   0.00%)     1602.49 *  15.71%*
Hmean     8       2699.45 (   0.00%)     3069.98 *  13.73%*
Hmean     16      5327.11 (   0.00%)     5688.19 *   6.78%*
Hmean     32     10019.10 (   0.00%)    11862.56 *  18.40%*
Hmean     64     13850.57 (   0.00%)    17748.54 *  28.14%*
Hmean     128    12498.25 (   0.00%)    15541.59 *  24.35%*
Hmean     256    11195.77 (   0.00%)    13854.06 *  23.74%*

Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
---
 kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 40 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 77b2048a9326..f0496b93449c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6327,6 +6327,39 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
 
 #endif /* CONFIG_SCHED_SMT */
 
+#ifdef CONFIG_SCHED_CLUSTER
+/*
+ * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
+ */
+static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus,
+			       int target, int *nr)
+{
+	struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
+	int cpu, idle_cpu;
+
+	/* TODO: Support SMT system with cluster topology */
+	if (!sched_smt_active() && sd) {
+		for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
+			if (!--*nr)
+				break;
+
+			idle_cpu = __select_idle_cpu(cpu, p);
+			if ((unsigned int)idle_cpu < nr_cpumask_bits)
+				return idle_cpu;
+		}
+
+		cpumask_andnot(cpus, cpus, sched_domain_span(sd));
+	}
+
+	return -1;
+}
+#else
+static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target, int *nr)
+{
+	return -1;
+}
+#endif
+
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6375,6 +6408,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		time = cpu_clock(this);
 	}
 
+	idle_cpu = scan_cluster(p, cpus, target, &nr);
+	if ((unsigned int)idle_cpu < nr_cpumask_bits)
+		return idle_cpu;
+
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -6382,7 +6419,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 				return i;
 
 		} else {
-			if (!--nr)
+			if (--nr <= 0)
 				return -1;
 			idle_cpu = __select_idle_cpu(cpu, p);
 			if ((unsigned int)idle_cpu < nr_cpumask_bits)
@@ -6481,7 +6518,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	/*
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
-	if (prev != target && cpus_share_cache(prev, target) &&
+	if (prev != target && cpus_share_resources(prev, target) &&
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
 	    asym_fits_capacity(task_util, prev))
 		return prev;
@@ -6507,7 +6544,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	p->recent_used_cpu = prev;
 	if (recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
-	    cpus_share_cache(recent_used_cpu, target) &&
+	    cpus_share_resources(recent_used_cpu, target) &&
 	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
 	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
 	    asym_fits_capacity(task_util, recent_used_cpu)) {
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-06-08  9:57 ` [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
@ 2022-06-09  9:54   ` kernel test robot
  2022-06-09 10:14   ` kernel test robot
  1 sibling, 0 replies; 7+ messages in thread
From: kernel test robot @ 2022-06-09  9:54 UTC (permalink / raw)
  To: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
	tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: llvm, kbuild-all, dietmar.eggemann, rostedt, bsegall, bristot,
	prime.zeng, yangyicong, jonathan.cameron, ego, srikar, linuxarm,
	21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39

Hi Yicong,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on linus/master v5.19-rc1 next-20220609]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Yicong-Yang/sched-fair-Wake-task-within-the-cluster-when-possible/20220608-181847
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 991d8d8142cad94f9c5c05db25e67fa83d6f772a
config: x86_64-randconfig-a005 (https://download.01.org/0day-ci/archive/20220609/202206091721.rhB7mm5c-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project b92436efcb7813fc481b30f2593a4907568d917a)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/f2b15e8641f351783c1d47bc654ace164300b7f1
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yicong-Yang/sched-fair-Wake-task-within-the-cluster-when-possible/20220608-181847
        git checkout f2b15e8641f351783c1d47bc654ace164300b7f1
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash kernel/sched/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   kernel/sched/fair.c:5512:6: warning: no previous prototype for function 'init_cfs_bandwidth' [-Wmissing-prototypes]
   void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
        ^
   kernel/sched/fair.c:5512:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
   ^
   static 
>> kernel/sched/fair.c:6381:29: warning: incompatible pointer to integer conversion passing 'struct cpumask *' to parameter of type 'int' [-Wint-conversion]
           idle_cpu = scan_cluster(p, cpus, target, &nr);
                                      ^~~~
   kernel/sched/fair.c:6327:59: note: passing argument to parameter 'prev_cpu' here
   static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target, int *nr)
                                                             ^
   kernel/sched/fair.c:11734:6: warning: no previous prototype for function 'free_fair_sched_group' [-Wmissing-prototypes]
   void free_fair_sched_group(struct task_group *tg) { }
        ^
   kernel/sched/fair.c:11734:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void free_fair_sched_group(struct task_group *tg) { }
   ^
   static 
   kernel/sched/fair.c:11736:5: warning: no previous prototype for function 'alloc_fair_sched_group' [-Wmissing-prototypes]
   int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
       ^
   kernel/sched/fair.c:11736:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
   ^
   static 
   kernel/sched/fair.c:11741:6: warning: no previous prototype for function 'online_fair_sched_group' [-Wmissing-prototypes]
   void online_fair_sched_group(struct task_group *tg) { }
        ^
   kernel/sched/fair.c:11741:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void online_fair_sched_group(struct task_group *tg) { }
   ^
   static 
   kernel/sched/fair.c:11743:6: warning: no previous prototype for function 'unregister_fair_sched_group' [-Wmissing-prototypes]
   void unregister_fair_sched_group(struct task_group *tg) { }
        ^
   kernel/sched/fair.c:11743:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void unregister_fair_sched_group(struct task_group *tg) { }
   ^
   static 
   kernel/sched/fair.c:489:20: warning: unused function 'list_del_leaf_cfs_rq' [-Wunused-function]
   static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
                      ^
   kernel/sched/fair.c:510:19: warning: unused function 'tg_is_idle' [-Wunused-function]
   static inline int tg_is_idle(struct task_group *tg)
                     ^
   kernel/sched/fair.c:5493:20: warning: unused function 'sync_throttle' [-Wunused-function]
   static inline void sync_throttle(struct task_group *tg, int cpu) {}
                      ^
   kernel/sched/fair.c:5518:37: warning: unused function 'tg_cfs_bandwidth' [-Wunused-function]
   static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
                                       ^
   kernel/sched/fair.c:5522:20: warning: unused function 'destroy_cfs_bandwidth' [-Wunused-function]
   static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
                      ^
   11 warnings generated.


vim +6381 kernel/sched/fair.c

  6332	
  6333	/*
  6334	 * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  6335	 * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
  6336	 * average idle time for this rq (as found in rq->avg_idle).
  6337	 */
  6338	static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
  6339	{
  6340		struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
  6341		int i, cpu, idle_cpu = -1, nr = INT_MAX;
  6342		struct rq *this_rq = this_rq();
  6343		int this = smp_processor_id();
  6344		struct sched_domain *this_sd;
  6345		u64 time = 0;
  6346	
  6347		this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
  6348		if (!this_sd)
  6349			return -1;
  6350	
  6351		cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
  6352	
  6353		if (sched_feat(SIS_PROP) && !has_idle_core) {
  6354			u64 avg_cost, avg_idle, span_avg;
  6355			unsigned long now = jiffies;
  6356	
  6357			/*
  6358			 * If we're busy, the assumption that the last idle period
  6359			 * predicts the future is flawed; age away the remaining
  6360			 * predicted idle time.
  6361			 */
  6362			if (unlikely(this_rq->wake_stamp < now)) {
  6363				while (this_rq->wake_stamp < now && this_rq->wake_avg_idle) {
  6364					this_rq->wake_stamp++;
  6365					this_rq->wake_avg_idle >>= 1;
  6366				}
  6367			}
  6368	
  6369			avg_idle = this_rq->wake_avg_idle;
  6370			avg_cost = this_sd->avg_scan_cost + 1;
  6371	
  6372			span_avg = sd->span_weight * avg_idle;
  6373			if (span_avg > 4*avg_cost)
  6374				nr = div_u64(span_avg, avg_cost);
  6375			else
  6376				nr = 4;
  6377	
  6378			time = cpu_clock(this);
  6379		}
  6380	
> 6381		idle_cpu = scan_cluster(p, cpus, target, &nr);
  6382		if ((unsigned int)idle_cpu < nr_cpumask_bits)
  6383			return idle_cpu;
  6384	
  6385		for_each_cpu_wrap(cpu, cpus, target + 1) {
  6386			if (has_idle_core) {
  6387				i = select_idle_core(p, cpu, cpus, &idle_cpu);
  6388				if ((unsigned int)i < nr_cpumask_bits)
  6389					return i;
  6390	
  6391			} else {
  6392				if (--nr <= 0)
  6393					return -1;
  6394				idle_cpu = __select_idle_cpu(cpu, p);
  6395				if ((unsigned int)idle_cpu < nr_cpumask_bits)
  6396					break;
  6397			}
  6398		}
  6399	
  6400		if (has_idle_core)
  6401			set_idle_cores(target, false);
  6402	
  6403		if (sched_feat(SIS_PROP) && !has_idle_core) {
  6404			time = cpu_clock(this) - time;
  6405	
  6406			/*
  6407			 * Account for the scan cost of wakeups against the average
  6408			 * idle time.
  6409			 */
  6410			this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
  6411	
  6412			update_avg(&this_sd->avg_scan_cost, time);
  6413		}
  6414	
  6415		return idle_cpu;
  6416	}
  6417	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-06-08  9:57 ` [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
  2022-06-09  9:54   ` kernel test robot
@ 2022-06-09 10:14   ` kernel test robot
  2022-06-09 11:25     ` Yicong Yang
  1 sibling, 1 reply; 7+ messages in thread
From: kernel test robot @ 2022-06-09 10:14 UTC (permalink / raw)
  To: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
	tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: kbuild-all, dietmar.eggemann, rostedt, bsegall, bristot,
	prime.zeng, yangyicong, jonathan.cameron, ego, srikar, linuxarm,
	21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39

Hi Yicong,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on linus/master v5.19-rc1 next-20220609]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Yicong-Yang/sched-fair-Wake-task-within-the-cluster-when-possible/20220608-181847
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 991d8d8142cad94f9c5c05db25e67fa83d6f772a
config: x86_64-randconfig-a006 (https://download.01.org/0day-ci/archive/20220609/202206091846.fm1bYjWk-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/f2b15e8641f351783c1d47bc654ace164300b7f1
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yicong-Yang/sched-fair-Wake-task-within-the-cluster-when-possible/20220608-181847
        git checkout f2b15e8641f351783c1d47bc654ace164300b7f1
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash kernel/sched/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   kernel/sched/fair.c: In function 'select_idle_cpu':
>> kernel/sched/fair.c:6381:36: warning: passing argument 2 of 'scan_cluster' makes integer from pointer without a cast [-Wint-conversion]
    6381 |         idle_cpu = scan_cluster(p, cpus, target, &nr);
         |                                    ^~~~
         |                                    |
         |                                    struct cpumask *
   kernel/sched/fair.c:6327:59: note: expected 'int' but argument is of type 'struct cpumask *'
    6327 | static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target, int *nr)
         |                                                       ~~~~^~~~~~~~
   kernel/sched/fair.c: At top level:
   kernel/sched/fair.c:11114:6: warning: no previous prototype for 'task_vruntime_update' [-Wmissing-prototypes]
   11114 | void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi)
         |      ^~~~~~~~~~~~~~~~~~~~


vim +/scan_cluster +6381 kernel/sched/fair.c

  6332	
  6333	/*
  6334	 * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  6335	 * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
  6336	 * average idle time for this rq (as found in rq->avg_idle).
  6337	 */
  6338	static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
  6339	{
  6340		struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
  6341		int i, cpu, idle_cpu = -1, nr = INT_MAX;
  6342		struct rq *this_rq = this_rq();
  6343		int this = smp_processor_id();
  6344		struct sched_domain *this_sd;
  6345		u64 time = 0;
  6346	
  6347		this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
  6348		if (!this_sd)
  6349			return -1;
  6350	
  6351		cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
  6352	
  6353		if (sched_feat(SIS_PROP) && !has_idle_core) {
  6354			u64 avg_cost, avg_idle, span_avg;
  6355			unsigned long now = jiffies;
  6356	
  6357			/*
  6358			 * If we're busy, the assumption that the last idle period
  6359			 * predicts the future is flawed; age away the remaining
  6360			 * predicted idle time.
  6361			 */
  6362			if (unlikely(this_rq->wake_stamp < now)) {
  6363				while (this_rq->wake_stamp < now && this_rq->wake_avg_idle) {
  6364					this_rq->wake_stamp++;
  6365					this_rq->wake_avg_idle >>= 1;
  6366				}
  6367			}
  6368	
  6369			avg_idle = this_rq->wake_avg_idle;
  6370			avg_cost = this_sd->avg_scan_cost + 1;
  6371	
  6372			span_avg = sd->span_weight * avg_idle;
  6373			if (span_avg > 4*avg_cost)
  6374				nr = div_u64(span_avg, avg_cost);
  6375			else
  6376				nr = 4;
  6377	
  6378			time = cpu_clock(this);
  6379		}
  6380	
> 6381		idle_cpu = scan_cluster(p, cpus, target, &nr);
  6382		if ((unsigned int)idle_cpu < nr_cpumask_bits)
  6383			return idle_cpu;
  6384	
  6385		for_each_cpu_wrap(cpu, cpus, target + 1) {
  6386			if (has_idle_core) {
  6387				i = select_idle_core(p, cpu, cpus, &idle_cpu);
  6388				if ((unsigned int)i < nr_cpumask_bits)
  6389					return i;
  6390	
  6391			} else {
  6392				if (--nr <= 0)
  6393					return -1;
  6394				idle_cpu = __select_idle_cpu(cpu, p);
  6395				if ((unsigned int)idle_cpu < nr_cpumask_bits)
  6396					break;
  6397			}
  6398		}
  6399	
  6400		if (has_idle_core)
  6401			set_idle_cores(target, false);
  6402	
  6403		if (sched_feat(SIS_PROP) && !has_idle_core) {
  6404			time = cpu_clock(this) - time;
  6405	
  6406			/*
  6407			 * Account for the scan cost of wakeups against the average
  6408			 * idle time.
  6409			 */
  6410			this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
  6411	
  6412			update_avg(&this_sd->avg_scan_cost, time);
  6413		}
  6414	
  6415		return idle_cpu;
  6416	}
  6417	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-06-09 10:14   ` kernel test robot
@ 2022-06-09 11:25     ` Yicong Yang
  0 siblings, 0 replies; 7+ messages in thread
From: Yicong Yang @ 2022-06-09 11:25 UTC (permalink / raw)
  To: kernel test robot, Yicong Yang, peterz, mingo, juri.lelli,
	vincent.guittot, tim.c.chen, gautham.shenoy, linux-kernel,
	linux-arm-kernel
  Cc: kbuild-all, dietmar.eggemann, rostedt, bsegall, bristot,
	prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	guodong.xu, hesham.almatary, john.garry, shenyang39

On 2022/6/9 18:14, kernel test robot wrote:
> Hi Yicong,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on tip/sched/core]
> [also build test WARNING on linus/master v5.19-rc1 next-20220609]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Yicong-Yang/sched-fair-Wake-task-within-the-cluster-when-possible/20220608-181847
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 991d8d8142cad94f9c5c05db25e67fa83d6f772a
> config: x86_64-randconfig-a006 (https://download.01.org/0day-ci/archive/20220609/202206091846.fm1bYjWk-lkp@intel.com/config)
> compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
> reproduce (this is a W=1 build):
>         # https://github.com/intel-lab-lkp/linux/commit/f2b15e8641f351783c1d47bc654ace164300b7f1
>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>         git fetch --no-tags linux-review Yicong-Yang/sched-fair-Wake-task-within-the-cluster-when-possible/20220608-181847
>         git checkout f2b15e8641f351783c1d47bc654ace164300b7f1
>         # save the config file
>         mkdir build_dir && cp config build_dir/.config
>         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash kernel/sched/
> 
> If you fix the issue, kindly add following tag where applicable
> Reported-by: kernel test robot <lkp@intel.com>
> 
> All warnings (new ones prefixed by >>):
> 
>    kernel/sched/fair.c: In function 'select_idle_cpu':
>>> kernel/sched/fair.c:6381:36: warning: passing argument 2 of 'scan_cluster' makes integer from pointer without a cast [-Wint-conversion]

I didn't change the scan_cluster() stub correspondingly, which leads to this error. thanks for catching this and will fix in v4.

>     6381 |         idle_cpu = scan_cluster(p, cpus, target, &nr);
>          |                                    ^~~~
>          |                                    |
>          |                                    struct cpumask *
>    kernel/sched/fair.c:6327:59: note: expected 'int' but argument is of type 'struct cpumask *'
>     6327 | static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target, int *nr)
>          |                                                       ~~~~^~~~~~~~
>    kernel/sched/fair.c: At top level:
>    kernel/sched/fair.c:11114:6: warning: no previous prototype for 'task_vruntime_update' [-Wmissing-prototypes]
>    11114 | void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi)
>          |      ^~~~~~~~~~~~~~~~~~~~
> 
> 
> vim +/scan_cluster +6381 kernel/sched/fair.c
> 
>   6332	
>   6333	/*
>   6334	 * Scan the LLC domain for idle CPUs; this is dynamically regulated by
>   6335	 * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
>   6336	 * average idle time for this rq (as found in rq->avg_idle).
>   6337	 */
>   6338	static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
>   6339	{
>   6340		struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>   6341		int i, cpu, idle_cpu = -1, nr = INT_MAX;
>   6342		struct rq *this_rq = this_rq();
>   6343		int this = smp_processor_id();
>   6344		struct sched_domain *this_sd;
>   6345		u64 time = 0;
>   6346	
>   6347		this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>   6348		if (!this_sd)
>   6349			return -1;
>   6350	
>   6351		cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>   6352	
>   6353		if (sched_feat(SIS_PROP) && !has_idle_core) {
>   6354			u64 avg_cost, avg_idle, span_avg;
>   6355			unsigned long now = jiffies;
>   6356	
>   6357			/*
>   6358			 * If we're busy, the assumption that the last idle period
>   6359			 * predicts the future is flawed; age away the remaining
>   6360			 * predicted idle time.
>   6361			 */
>   6362			if (unlikely(this_rq->wake_stamp < now)) {
>   6363				while (this_rq->wake_stamp < now && this_rq->wake_avg_idle) {
>   6364					this_rq->wake_stamp++;
>   6365					this_rq->wake_avg_idle >>= 1;
>   6366				}
>   6367			}
>   6368	
>   6369			avg_idle = this_rq->wake_avg_idle;
>   6370			avg_cost = this_sd->avg_scan_cost + 1;
>   6371	
>   6372			span_avg = sd->span_weight * avg_idle;
>   6373			if (span_avg > 4*avg_cost)
>   6374				nr = div_u64(span_avg, avg_cost);
>   6375			else
>   6376				nr = 4;
>   6377	
>   6378			time = cpu_clock(this);
>   6379		}
>   6380	
>> 6381		idle_cpu = scan_cluster(p, cpus, target, &nr);
>   6382		if ((unsigned int)idle_cpu < nr_cpumask_bits)
>   6383			return idle_cpu;
>   6384	
>   6385		for_each_cpu_wrap(cpu, cpus, target + 1) {
>   6386			if (has_idle_core) {
>   6387				i = select_idle_core(p, cpu, cpus, &idle_cpu);
>   6388				if ((unsigned int)i < nr_cpumask_bits)
>   6389					return i;
>   6390	
>   6391			} else {
>   6392				if (--nr <= 0)
>   6393					return -1;
>   6394				idle_cpu = __select_idle_cpu(cpu, p);
>   6395				if ((unsigned int)idle_cpu < nr_cpumask_bits)
>   6396					break;
>   6397			}
>   6398		}
>   6399	
>   6400		if (has_idle_core)
>   6401			set_idle_cores(target, false);
>   6402	
>   6403		if (sched_feat(SIS_PROP) && !has_idle_core) {
>   6404			time = cpu_clock(this) - time;
>   6405	
>   6406			/*
>   6407			 * Account for the scan cost of wakeups against the average
>   6408			 * idle time.
>   6409			 */
>   6410			this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
>   6411	
>   6412			update_avg(&this_sd->avg_scan_cost, time);
>   6413		}
>   6414	
>   6415		return idle_cpu;
>   6416	}
>   6417	
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
  2022-06-08  9:57 ` [PATCH v3 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
@ 2022-06-15 13:29   ` K Prateek Nayak
  0 siblings, 0 replies; 7+ messages in thread
From: K Prateek Nayak @ 2022-06-15 13:29 UTC (permalink / raw)
  To: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
	tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39

Hello Yicong,

We are observing some serious regression with tbench with this patch
series applied. The issue doesn't seem to be related to the actual 
functionality of the patch but how the patch changes the per CPU
variable layout.

Discussed below are the results from running tbench on a dual
socket Zen3 (2 x 64C/128T) configured in different NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Benchmark Results:

Kernel versions:
- tip:      5.19.0-rc2 tip sched/core
- cluster:  5.19.0-rc2 tip sched/core + both the patches of the series

When we started testing, the tip was at:
commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle"

* - Data points of concern

~~~~~~
tbench
~~~~~~

NPS1

Clients:       tip                     cluster
    1    444.41 (0.00 pct)       439.27 (-1.15 pct)
    2    879.23 (0.00 pct)       831.49 (-5.42 pct)     *
    4    1648.83 (0.00 pct)      1608.07 (-2.47 pct)
    8    3263.81 (0.00 pct)      3086.81 (-5.42 pct)    *
   16    6011.19 (0.00 pct)      5360.28 (-10.82 pct)   *
   32    12058.31 (0.00 pct)     8769.08 (-27.27 pct)   *
   64    21258.21 (0.00 pct)     19021.09 (-10.52 pct)  *
  128    30795.27 (0.00 pct)     30861.34 (0.21 pct)
  256    25138.43 (0.00 pct)     24711.90 (-1.69 pct)
  512    51287.93 (0.00 pct)     51855.55 (1.10 pct)
 1024    53176.97 (0.00 pct)     52554.55 (-1.17 pct)

NPS2

Clients:       tip                     cluster
    1    445.45 (0.00 pct)       441.75 (-0.83 pct)
    2    869.24 (0.00 pct)       845.61 (-2.71 pct)
    4    1644.28 (0.00 pct)      1586.49 (-3.51 pct)
    8    3120.83 (0.00 pct)      2967.01 (-4.92 pct) 	*
   16    5972.29 (0.00 pct)      5208.58 (-12.78 pct)   *
   32    11776.38 (0.00 pct)     10229.53 (-13.13 pct)  *
   64    20933.15 (0.00 pct)     17033.45 (-18.62 pct)  *
  128    32195.00 (0.00 pct)     29507.85 (-8.34 pct)   *
  256    24641.52 (0.00 pct)     27225.00 (10.48 pct)
  512    50806.96 (0.00 pct)     51377.50 (1.12 pct)
 1024    51993.96 (0.00 pct)     50773.35 (-2.34 pct)

NPS4

Clients:      tip                   cluster
    1    442.10 (0.00 pct)       435.06 (-1.59 pct)
    2    870.94 (0.00 pct)       858.64 (-1.41 pct)
    4    1615.30 (0.00 pct)      1607.27 (-0.49 pct)
    8    3195.95 (0.00 pct)      3020.63 (-5.48 pct)    *
   16    5937.41 (0.00 pct)      5719.87 (-3.66 pct)
   32    11800.41 (0.00 pct)     11229.65 (-4.83 pct)	*
   64    20844.71 (0.00 pct)     20432.79 (-1.97 pct)
  128    31003.62 (0.00 pct)     29441.20 (-5.03 pct)   *
  256    27476.37 (0.00 pct)     25857.30 (-5.89 pct)   * [Know to have run to run variance]
  512    52276.72 (0.00 pct)     51659.16 (-1.18 pct)
 1024    51372.10 (0.00 pct)     51026.87 (-0.67 pct)

Note: tbench results for 256 workers are known to have
run to run variation on the test machine. Any regression
seen for the data point can be safely ignored.

The behavior is consistent for both tip and patched kernel
across multiple runs of tbench.

~~~~~~~~~~~~~~~~~~~~
Analysis done so far
~~~~~~~~~~~~~~~~~~~~

To root cause this issue quicker, we have focused on 8 to 64 clients
data points with the machine running in NPS1 mode.

- Even on disabling HW prefetcher, the behavior remains consistent
  signifying HW prefetcher is not the cause of the problem.

- Bisecting:

When we ran the tests with only Patch 1 of the series, the
regression was visible and the numbers were worse.

Clients:       tip                     cluster              Patch 1 Only
    8    3263.81 (0.00 pct)      3086.81 (-5.42 pct)     3018.63 (-7.51 pct)
   16    6011.19 (0.00 pct)      5360.28 (-10.82 pct)    4869.26 (-18.99 pct)
   32    12058.31 (0.00 pct)     8769.08 (-27.27 pct)    8159.60 (-32.33 pct)
   64    21258.21 (0.00 pct)     19021.09 (-10.52 pct)   13161.92 (-38.08 pct)

We further bisected the hunks to narrow down the cause to the per CPU
variable declarations. 

On 6/8/2022 3:27 PM, Yicong Yang wrote:
>
> [..snip..]
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 01259611beb9..b9bcfcf8d14d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DECLARE_PER_CPU(int, sd_llc_size);
>  DECLARE_PER_CPU(int, sd_llc_id);
> +DECLARE_PER_CPU(int, sd_share_id);
>  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);

The main reason for the regression seems to be the above declarations.
The regression seem to go away if we do one of the following:

- Declare sd_share_id and sd_cluster using DECLARE_PER_CPU_READ_MOSTLY()
  instead of DECLARE_PER_CPU() and change the corresponding definition
  below to DEFINE_PER_CPU_READ_MOSTLY().

  Clients:       tip                     Patch 1           Patch 1 (READ_MOSTLY)
    8      3255.69 (0.00 pct)      3018.63 (-7.28 pct)     3237.33 (-0.56 pct)
   16      6092.67 (0.00 pct)      4869.26 (-20.08 pct)    5914.53 (-2.92 pct)
   32      11156.56 (0.00 pct)     8159.60 (-26.86 pct)    11536.05 (3.40 pct)
   64      21019.97 (0.00 pct)     13161.92 (-37.38 pct)   21162.33 (0.67 pct)

- Convert sd_share_id and sd_cluster to static arrays.
  
  Clients:       tip                    Patch 1            Patch 1 (Static Array)
    8      3255.69 (0.00 pct)      3018.63 (-7.28 pct)     3203.27 (-1.61 pct)
   16      6092.67 (0.00 pct)      4869.26 (-20.08 pct)    6198.35 (1.73 pct)
   32      11156.56 (0.00 pct)     8159.60 (-26.86 pct)    11385.76 (2.05 pct)
   64      21019.97 (0.00 pct)     13161.92 (-37.38 pct)   21919.80 (4.28 pct)

- Move the declarations of sd_share_id and sd_cluster to the top

  Clients:       tip                    Patch 1            Patch 1 (Declarion on Top)
    8      3255.69 (0.00 pct)      3018.63 (-7.28 pct)     3072.30 (-5.63 pct)
   16      6092.67 (0.00 pct)      4869.26 (-20.08 pct)    5586.59 (-8.30 pct)
   32      11156.56 (0.00 pct)     8159.60 (-26.86 pct)    11184.17 (0.24 pct)
   64      21019.97 (0.00 pct)     13161.92 (-37.38 pct)   20289.70 (-3.47 pct)

Unfortunately, none of these are complete solutions. For example, using
DECLARE_PER_CPU_READ_MOSTLY() with both patches applied reduces the regression
but doesn't eliminate it entirely:

   Clients:     tip                     cluster           cluster (READ_MOSTLY)  
    1      444.41 (0.00 pct)       439.27 (-1.15 pct)      435.95 (-1.90 pct)
    2      879.23 (0.00 pct)       831.49 (-5.42 pct)      842.09 (-4.22 pct)
    4      1648.83 (0.00 pct)      1608.07 (-2.47 pct)     1598.77 (-3.03 pct)
    8      3263.81 (0.00 pct)      3086.81 (-5.42 pct)     3090.86 (-5.29 pct)	*
   16      6011.19 (0.00 pct)      5360.28 (-10.82 pct)    5360.28 (-10.82 pct)	*
   32      12058.31 (0.00 pct)     8769.08 (-27.27 pct)    11083.66 (-8.08 pct)	*
   64      21258.21 (0.00 pct)     19021.09 (-10.52 pct)   20984.30 (-1.28 pct)
  128      30795.27 (0.00 pct)     30861.34 (0.21 pct)     30735.20 (-0.19 pct)
  256      25138.43 (0.00 pct)     24711.90 (-1.69 pct)    24021.21 (-4.44 pct)
  512      51287.93 (0.00 pct)     51855.55 (1.10 pct)     51672.73 (0.75 pct)
 1024      53176.97 (0.00 pct)     52554.55 (-1.17 pct)    52620.02 (-1.04 pct)

We are still trying to root cause the underlying issue that
brought about such drastic regression in tbench performance. 

>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 05b6c2ad90b9..0595827d481d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DEFINE_PER_CPU(int, sd_llc_size);
>  DEFINE_PER_CPU(int, sd_llc_id);
> +DEFINE_PER_CPU(int, sd_share_id);
> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> @@ -689,6 +691,18 @@ static void update_top_cache_domain(int cpu)
>
> [..snip..]
>

We would like some time to investigate this issue and root cause
the reason for this regression.
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-06-15 13:30 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-08  9:57 [PATCH v3 0/2] sched/fair: Wake task within the cluster when possible Yicong Yang
2022-06-08  9:57 ` [PATCH v3 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
2022-06-15 13:29   ` K Prateek Nayak
2022-06-08  9:57 ` [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
2022-06-09  9:54   ` kernel test robot
2022-06-09 10:14   ` kernel test robot
2022-06-09 11:25     ` Yicong Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).