* [PATCH v3 0/2] sched/fair: Wake task within the cluster when possible
@ 2022-06-08 9:57 Yicong Yang
2022-06-08 9:57 ` [PATCH v3 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
2022-06-08 9:57 ` [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
0 siblings, 2 replies; 7+ messages in thread
From: Yicong Yang @ 2022-06-08 9:57 UTC (permalink / raw)
To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
gautham.shenoy, linux-kernel, linux-arm-kernel
Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
guodong.xu, hesham.almatary, john.garry, shenyang39
This is the follow-up work to support cluster scheduler. Previously
we have added cluster level in the scheduler for both ARM64[1] and
X86[2] to support load balance between clusters to bring more memory
bandwidth and decrease cache contention. This patchset, on the other
hand, takes care of wake-up path by giving CPUs within the same cluster
a try before scanning the whole LLC to benefit those tasks communicating
with each other.
[1] 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
[2] 66558b730f25 ("sched: Add cluster scheduler level for x86")
Change since v2:
- leverage SIS_PROP to suspend redundant scanning when LLC is overloaded
- remove the ping-pong suppression
- address the comment from Tim, thanks.
Link: https://lore.kernel.org/lkml/20220126080947.4529-1-yangyicong@hisilicon.com/
Change since v1:
- regain the performance data based on v5.17-rc1
- rename cpus_share_cluster to cpus_share_resources per Vincent and Gautham, thanks!
Link: https://lore.kernel.org/lkml/20211215041149.73171-1-yangyicong@hisilicon.com/
Barry Song (2):
sched: Add per_cpu cluster domain info and cpus_share_resources API
sched/fair: Scan cluster before scanning LLC in wake-up path
include/linux/sched/sd_flags.h | 7 ++++++
include/linux/sched/topology.h | 8 ++++++-
kernel/sched/core.c | 12 ++++++++++
kernel/sched/fair.c | 43 +++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 2 ++
kernel/sched/topology.c | 15 ++++++++++++
6 files changed, 83 insertions(+), 4 deletions(-)
--
2.24.0
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v3 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
2022-06-08 9:57 [PATCH v3 0/2] sched/fair: Wake task within the cluster when possible Yicong Yang
@ 2022-06-08 9:57 ` Yicong Yang
2022-06-15 13:29 ` K Prateek Nayak
2022-06-08 9:57 ` [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
1 sibling, 1 reply; 7+ messages in thread
From: Yicong Yang @ 2022-06-08 9:57 UTC (permalink / raw)
To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
gautham.shenoy, linux-kernel, linux-arm-kernel
Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
guodong.xu, hesham.almatary, john.garry, shenyang39
From: Barry Song <song.bao.hua@hisilicon.com>
Add per-cpu cluster domain info and cpus_share_resources() API.
This is the preparation for the optimization of select_idle_cpu()
on platforms with cluster scheduler level.
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
---
include/linux/sched/sd_flags.h | 7 +++++++
include/linux/sched/topology.h | 8 +++++++-
kernel/sched/core.c | 12 ++++++++++++
kernel/sched/sched.h | 2 ++
kernel/sched/topology.c | 15 +++++++++++++++
5 files changed, 43 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 57bde66d95f7..42ed454e8b18 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -109,6 +109,13 @@ SD_FLAG(SD_ASYM_CPUCAPACITY_FULL, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
*/
SD_FLAG(SD_SHARE_CPUCAPACITY, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
+/*
+ * Domain members share CPU cluster (LLC tags or L2 cache)
+ *
+ * NEEDS_GROUPS: Clusters are shared between groups.
+ */
+SD_FLAG(SD_CLUSTER, SDF_NEEDS_GROUPS)
+
/*
* Domain members share CPU package resources (i.e. caches)
*
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..df489a1db6b7 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -45,7 +45,7 @@ static inline int cpu_smt_flags(void)
#ifdef CONFIG_SCHED_CLUSTER
static inline int cpu_cluster_flags(void)
{
- return SD_SHARE_PKG_RESOURCES;
+ return SD_CLUSTER | SD_SHARE_PKG_RESOURCES;
}
#endif
@@ -178,6 +178,7 @@ cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
bool cpus_share_cache(int this_cpu, int that_cpu);
+bool cpus_share_resources(int this_cpu, int that_cpu);
typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
typedef int (*sched_domain_flags_f)(void);
@@ -231,6 +232,11 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
return true;
}
+static inline bool cpus_share_resources(int this_cpu, int that_cpu)
+{
+ return true;
+}
+
#endif /* !CONFIG_SMP */
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bfa7452ca92e..79a6f012b0cf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3808,6 +3808,18 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
}
+/*
+ * Whether CPUs are share cache resources, which means LLC on non-cluster
+ * machines and LLC tag or L2 on machines with clusters.
+ */
+bool cpus_share_resources(int this_cpu, int that_cpu)
+{
+ if (this_cpu == that_cpu)
+ return true;
+
+ return per_cpu(sd_share_id, this_cpu) == per_cpu(sd_share_id, that_cpu);
+}
+
static inline bool ttwu_queue_cond(int cpu, int wake_flags)
{
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 01259611beb9..b9bcfcf8d14d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
DECLARE_PER_CPU(int, sd_llc_size);
DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(int, sd_share_id);
DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 05b6c2ad90b9..0595827d481d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_size);
DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(int, sd_share_id);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
@@ -689,6 +691,18 @@ static void update_top_cache_domain(int cpu)
per_cpu(sd_llc_id, cpu) = id;
rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
+ sd = lowest_flag_domain(cpu, SD_CLUSTER);
+ if (sd)
+ id = cpumask_first(sched_domain_span(sd));
+ rcu_assign_pointer(per_cpu(sd_cluster, cpu), sd);
+
+ /*
+ * This assignment should be placed after the sd_llc_id as
+ * we want this id equals to cluster id on cluster machines
+ * but equals to LLC id on non-Cluster machines.
+ */
+ per_cpu(sd_share_id, cpu) = id;
+
sd = lowest_flag_domain(cpu, SD_NUMA);
rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
@@ -1532,6 +1546,7 @@ static struct cpumask ***sched_domains_numa_masks;
*/
#define TOPOLOGY_SD_FLAGS \
(SD_SHARE_CPUCAPACITY | \
+ SD_CLUSTER | \
SD_SHARE_PKG_RESOURCES | \
SD_NUMA | \
SD_ASYM_PACKING)
--
2.24.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
2022-06-08 9:57 [PATCH v3 0/2] sched/fair: Wake task within the cluster when possible Yicong Yang
2022-06-08 9:57 ` [PATCH v3 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
@ 2022-06-08 9:57 ` Yicong Yang
2022-06-09 9:54 ` kernel test robot
2022-06-09 10:14 ` kernel test robot
1 sibling, 2 replies; 7+ messages in thread
From: Yicong Yang @ 2022-06-08 9:57 UTC (permalink / raw)
To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
gautham.shenoy, linux-kernel, linux-arm-kernel
Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
guodong.xu, hesham.almatary, john.garry, shenyang39
From: Barry Song <song.bao.hua@hisilicon.com>
For platforms having clusters like Kunpeng920, CPUs within the same cluster
have lower latency when synchronizing and accessing shared resources like
cache. Thus, this patch tries to find an idle cpu within the cluster of the
target CPU before scanning the whole LLC to gain lower latency.
Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so this patch
doesn't consider SMT for this moment.
Testing has been done on Kunpeng920 by pinning tasks to one numa and two
numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
With this patch, We noticed enhancement on tbench within one numa or cross
two numa.
On numa 0:
5.19-rc1 patched
Hmean 1 350.27 ( 0.00%) 406.88 * 16.16%*
Hmean 2 702.01 ( 0.00%) 808.22 * 15.13%*
Hmean 4 1405.14 ( 0.00%) 1614.34 * 14.89%*
Hmean 8 2830.53 ( 0.00%) 3169.02 * 11.96%*
Hmean 16 5597.95 ( 0.00%) 6224.20 * 11.19%*
Hmean 32 10537.38 ( 0.00%) 10524.97 * -0.12%*
Hmean 64 8366.04 ( 0.00%) 8437.41 * 0.85%*
Hmean 128 7060.87 ( 0.00%) 7150.25 * 1.27%*
On numa 0-1:
5.19-rc1 patched
Hmean 1 346.11 ( 0.00%) 408.47 * 18.02%*
Hmean 2 693.34 ( 0.00%) 805.78 * 16.22%*
Hmean 4 1384.96 ( 0.00%) 1602.49 * 15.71%*
Hmean 8 2699.45 ( 0.00%) 3069.98 * 13.73%*
Hmean 16 5327.11 ( 0.00%) 5688.19 * 6.78%*
Hmean 32 10019.10 ( 0.00%) 11862.56 * 18.40%*
Hmean 64 13850.57 ( 0.00%) 17748.54 * 28.14%*
Hmean 128 12498.25 ( 0.00%) 15541.59 * 24.35%*
Hmean 256 11195.77 ( 0.00%) 13854.06 * 23.74%*
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
---
kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++++++++---
1 file changed, 40 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 77b2048a9326..f0496b93449c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6327,6 +6327,39 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
#endif /* CONFIG_SCHED_SMT */
+#ifdef CONFIG_SCHED_CLUSTER
+/*
+ * Scan the cluster domain for idle CPUs and clear cluster cpumask after scanning
+ */
+static inline int scan_cluster(struct task_struct *p, struct cpumask *cpus,
+ int target, int *nr)
+{
+ struct sched_domain *sd = rcu_dereference(per_cpu(sd_cluster, target));
+ int cpu, idle_cpu;
+
+ /* TODO: Support SMT system with cluster topology */
+ if (!sched_smt_active() && sd) {
+ for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
+ if (!--*nr)
+ break;
+
+ idle_cpu = __select_idle_cpu(cpu, p);
+ if ((unsigned int)idle_cpu < nr_cpumask_bits)
+ return idle_cpu;
+ }
+
+ cpumask_andnot(cpus, cpus, sched_domain_span(sd));
+ }
+
+ return -1;
+}
+#else
+static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target, int *nr)
+{
+ return -1;
+}
+#endif
+
/*
* Scan the LLC domain for idle CPUs; this is dynamically regulated by
* comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6375,6 +6408,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
time = cpu_clock(this);
}
+ idle_cpu = scan_cluster(p, cpus, target, &nr);
+ if ((unsigned int)idle_cpu < nr_cpumask_bits)
+ return idle_cpu;
+
for_each_cpu_wrap(cpu, cpus, target + 1) {
if (has_idle_core) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -6382,7 +6419,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
return i;
} else {
- if (!--nr)
+ if (--nr <= 0)
return -1;
idle_cpu = __select_idle_cpu(cpu, p);
if ((unsigned int)idle_cpu < nr_cpumask_bits)
@@ -6481,7 +6518,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
/*
* If the previous CPU is cache affine and idle, don't be stupid:
*/
- if (prev != target && cpus_share_cache(prev, target) &&
+ if (prev != target && cpus_share_resources(prev, target) &&
(available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
asym_fits_capacity(task_util, prev))
return prev;
@@ -6507,7 +6544,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
p->recent_used_cpu = prev;
if (recent_used_cpu != prev &&
recent_used_cpu != target &&
- cpus_share_cache(recent_used_cpu, target) &&
+ cpus_share_resources(recent_used_cpu, target) &&
(available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
asym_fits_capacity(task_util, recent_used_cpu)) {
--
2.24.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
2022-06-08 9:57 ` [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
@ 2022-06-09 9:54 ` kernel test robot
2022-06-09 10:14 ` kernel test robot
1 sibling, 0 replies; 7+ messages in thread
From: kernel test robot @ 2022-06-09 9:54 UTC (permalink / raw)
To: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel
Cc: llvm, kbuild-all, dietmar.eggemann, rostedt, bsegall, bristot,
prime.zeng, yangyicong, jonathan.cameron, ego, srikar, linuxarm,
21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39
Hi Yicong,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on tip/sched/core]
[also build test WARNING on linus/master v5.19-rc1 next-20220609]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/intel-lab-lkp/linux/commits/Yicong-Yang/sched-fair-Wake-task-within-the-cluster-when-possible/20220608-181847
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 991d8d8142cad94f9c5c05db25e67fa83d6f772a
config: x86_64-randconfig-a005 (https://download.01.org/0day-ci/archive/20220609/202206091721.rhB7mm5c-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project b92436efcb7813fc481b30f2593a4907568d917a)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/f2b15e8641f351783c1d47bc654ace164300b7f1
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Yicong-Yang/sched-fair-Wake-task-within-the-cluster-when-possible/20220608-181847
git checkout f2b15e8641f351783c1d47bc654ace164300b7f1
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash kernel/sched/
If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>
All warnings (new ones prefixed by >>):
kernel/sched/fair.c:5512:6: warning: no previous prototype for function 'init_cfs_bandwidth' [-Wmissing-prototypes]
void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
^
kernel/sched/fair.c:5512:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
^
static
>> kernel/sched/fair.c:6381:29: warning: incompatible pointer to integer conversion passing 'struct cpumask *' to parameter of type 'int' [-Wint-conversion]
idle_cpu = scan_cluster(p, cpus, target, &nr);
^~~~
kernel/sched/fair.c:6327:59: note: passing argument to parameter 'prev_cpu' here
static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target, int *nr)
^
kernel/sched/fair.c:11734:6: warning: no previous prototype for function 'free_fair_sched_group' [-Wmissing-prototypes]
void free_fair_sched_group(struct task_group *tg) { }
^
kernel/sched/fair.c:11734:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void free_fair_sched_group(struct task_group *tg) { }
^
static
kernel/sched/fair.c:11736:5: warning: no previous prototype for function 'alloc_fair_sched_group' [-Wmissing-prototypes]
int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
^
kernel/sched/fair.c:11736:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
^
static
kernel/sched/fair.c:11741:6: warning: no previous prototype for function 'online_fair_sched_group' [-Wmissing-prototypes]
void online_fair_sched_group(struct task_group *tg) { }
^
kernel/sched/fair.c:11741:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void online_fair_sched_group(struct task_group *tg) { }
^
static
kernel/sched/fair.c:11743:6: warning: no previous prototype for function 'unregister_fair_sched_group' [-Wmissing-prototypes]
void unregister_fair_sched_group(struct task_group *tg) { }
^
kernel/sched/fair.c:11743:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void unregister_fair_sched_group(struct task_group *tg) { }
^
static
kernel/sched/fair.c:489:20: warning: unused function 'list_del_leaf_cfs_rq' [-Wunused-function]
static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
^
kernel/sched/fair.c:510:19: warning: unused function 'tg_is_idle' [-Wunused-function]
static inline int tg_is_idle(struct task_group *tg)
^
kernel/sched/fair.c:5493:20: warning: unused function 'sync_throttle' [-Wunused-function]
static inline void sync_throttle(struct task_group *tg, int cpu) {}
^
kernel/sched/fair.c:5518:37: warning: unused function 'tg_cfs_bandwidth' [-Wunused-function]
static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
^
kernel/sched/fair.c:5522:20: warning: unused function 'destroy_cfs_bandwidth' [-Wunused-function]
static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
^
11 warnings generated.
vim +6381 kernel/sched/fair.c
6332
6333 /*
6334 * Scan the LLC domain for idle CPUs; this is dynamically regulated by
6335 * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
6336 * average idle time for this rq (as found in rq->avg_idle).
6337 */
6338 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
6339 {
6340 struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
6341 int i, cpu, idle_cpu = -1, nr = INT_MAX;
6342 struct rq *this_rq = this_rq();
6343 int this = smp_processor_id();
6344 struct sched_domain *this_sd;
6345 u64 time = 0;
6346
6347 this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
6348 if (!this_sd)
6349 return -1;
6350
6351 cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
6352
6353 if (sched_feat(SIS_PROP) && !has_idle_core) {
6354 u64 avg_cost, avg_idle, span_avg;
6355 unsigned long now = jiffies;
6356
6357 /*
6358 * If we're busy, the assumption that the last idle period
6359 * predicts the future is flawed; age away the remaining
6360 * predicted idle time.
6361 */
6362 if (unlikely(this_rq->wake_stamp < now)) {
6363 while (this_rq->wake_stamp < now && this_rq->wake_avg_idle) {
6364 this_rq->wake_stamp++;
6365 this_rq->wake_avg_idle >>= 1;
6366 }
6367 }
6368
6369 avg_idle = this_rq->wake_avg_idle;
6370 avg_cost = this_sd->avg_scan_cost + 1;
6371
6372 span_avg = sd->span_weight * avg_idle;
6373 if (span_avg > 4*avg_cost)
6374 nr = div_u64(span_avg, avg_cost);
6375 else
6376 nr = 4;
6377
6378 time = cpu_clock(this);
6379 }
6380
> 6381 idle_cpu = scan_cluster(p, cpus, target, &nr);
6382 if ((unsigned int)idle_cpu < nr_cpumask_bits)
6383 return idle_cpu;
6384
6385 for_each_cpu_wrap(cpu, cpus, target + 1) {
6386 if (has_idle_core) {
6387 i = select_idle_core(p, cpu, cpus, &idle_cpu);
6388 if ((unsigned int)i < nr_cpumask_bits)
6389 return i;
6390
6391 } else {
6392 if (--nr <= 0)
6393 return -1;
6394 idle_cpu = __select_idle_cpu(cpu, p);
6395 if ((unsigned int)idle_cpu < nr_cpumask_bits)
6396 break;
6397 }
6398 }
6399
6400 if (has_idle_core)
6401 set_idle_cores(target, false);
6402
6403 if (sched_feat(SIS_PROP) && !has_idle_core) {
6404 time = cpu_clock(this) - time;
6405
6406 /*
6407 * Account for the scan cost of wakeups against the average
6408 * idle time.
6409 */
6410 this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
6411
6412 update_avg(&this_sd->avg_scan_cost, time);
6413 }
6414
6415 return idle_cpu;
6416 }
6417
--
0-DAY CI Kernel Test Service
https://01.org/lkp
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
2022-06-08 9:57 ` [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
2022-06-09 9:54 ` kernel test robot
@ 2022-06-09 10:14 ` kernel test robot
2022-06-09 11:25 ` Yicong Yang
1 sibling, 1 reply; 7+ messages in thread
From: kernel test robot @ 2022-06-09 10:14 UTC (permalink / raw)
To: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel
Cc: kbuild-all, dietmar.eggemann, rostedt, bsegall, bristot,
prime.zeng, yangyicong, jonathan.cameron, ego, srikar, linuxarm,
21cnbao, guodong.xu, hesham.almatary, john.garry, shenyang39
Hi Yicong,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on tip/sched/core]
[also build test WARNING on linus/master v5.19-rc1 next-20220609]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/intel-lab-lkp/linux/commits/Yicong-Yang/sched-fair-Wake-task-within-the-cluster-when-possible/20220608-181847
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 991d8d8142cad94f9c5c05db25e67fa83d6f772a
config: x86_64-randconfig-a006 (https://download.01.org/0day-ci/archive/20220609/202206091846.fm1bYjWk-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/f2b15e8641f351783c1d47bc654ace164300b7f1
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Yicong-Yang/sched-fair-Wake-task-within-the-cluster-when-possible/20220608-181847
git checkout f2b15e8641f351783c1d47bc654ace164300b7f1
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash kernel/sched/
If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>
All warnings (new ones prefixed by >>):
kernel/sched/fair.c: In function 'select_idle_cpu':
>> kernel/sched/fair.c:6381:36: warning: passing argument 2 of 'scan_cluster' makes integer from pointer without a cast [-Wint-conversion]
6381 | idle_cpu = scan_cluster(p, cpus, target, &nr);
| ^~~~
| |
| struct cpumask *
kernel/sched/fair.c:6327:59: note: expected 'int' but argument is of type 'struct cpumask *'
6327 | static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target, int *nr)
| ~~~~^~~~~~~~
kernel/sched/fair.c: At top level:
kernel/sched/fair.c:11114:6: warning: no previous prototype for 'task_vruntime_update' [-Wmissing-prototypes]
11114 | void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi)
| ^~~~~~~~~~~~~~~~~~~~
vim +/scan_cluster +6381 kernel/sched/fair.c
6332
6333 /*
6334 * Scan the LLC domain for idle CPUs; this is dynamically regulated by
6335 * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
6336 * average idle time for this rq (as found in rq->avg_idle).
6337 */
6338 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
6339 {
6340 struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
6341 int i, cpu, idle_cpu = -1, nr = INT_MAX;
6342 struct rq *this_rq = this_rq();
6343 int this = smp_processor_id();
6344 struct sched_domain *this_sd;
6345 u64 time = 0;
6346
6347 this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
6348 if (!this_sd)
6349 return -1;
6350
6351 cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
6352
6353 if (sched_feat(SIS_PROP) && !has_idle_core) {
6354 u64 avg_cost, avg_idle, span_avg;
6355 unsigned long now = jiffies;
6356
6357 /*
6358 * If we're busy, the assumption that the last idle period
6359 * predicts the future is flawed; age away the remaining
6360 * predicted idle time.
6361 */
6362 if (unlikely(this_rq->wake_stamp < now)) {
6363 while (this_rq->wake_stamp < now && this_rq->wake_avg_idle) {
6364 this_rq->wake_stamp++;
6365 this_rq->wake_avg_idle >>= 1;
6366 }
6367 }
6368
6369 avg_idle = this_rq->wake_avg_idle;
6370 avg_cost = this_sd->avg_scan_cost + 1;
6371
6372 span_avg = sd->span_weight * avg_idle;
6373 if (span_avg > 4*avg_cost)
6374 nr = div_u64(span_avg, avg_cost);
6375 else
6376 nr = 4;
6377
6378 time = cpu_clock(this);
6379 }
6380
> 6381 idle_cpu = scan_cluster(p, cpus, target, &nr);
6382 if ((unsigned int)idle_cpu < nr_cpumask_bits)
6383 return idle_cpu;
6384
6385 for_each_cpu_wrap(cpu, cpus, target + 1) {
6386 if (has_idle_core) {
6387 i = select_idle_core(p, cpu, cpus, &idle_cpu);
6388 if ((unsigned int)i < nr_cpumask_bits)
6389 return i;
6390
6391 } else {
6392 if (--nr <= 0)
6393 return -1;
6394 idle_cpu = __select_idle_cpu(cpu, p);
6395 if ((unsigned int)idle_cpu < nr_cpumask_bits)
6396 break;
6397 }
6398 }
6399
6400 if (has_idle_core)
6401 set_idle_cores(target, false);
6402
6403 if (sched_feat(SIS_PROP) && !has_idle_core) {
6404 time = cpu_clock(this) - time;
6405
6406 /*
6407 * Account for the scan cost of wakeups against the average
6408 * idle time.
6409 */
6410 this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
6411
6412 update_avg(&this_sd->avg_scan_cost, time);
6413 }
6414
6415 return idle_cpu;
6416 }
6417
--
0-DAY CI Kernel Test Service
https://01.org/lkp
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
2022-06-09 10:14 ` kernel test robot
@ 2022-06-09 11:25 ` Yicong Yang
0 siblings, 0 replies; 7+ messages in thread
From: Yicong Yang @ 2022-06-09 11:25 UTC (permalink / raw)
To: kernel test robot, Yicong Yang, peterz, mingo, juri.lelli,
vincent.guittot, tim.c.chen, gautham.shenoy, linux-kernel,
linux-arm-kernel
Cc: kbuild-all, dietmar.eggemann, rostedt, bsegall, bristot,
prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
guodong.xu, hesham.almatary, john.garry, shenyang39
On 2022/6/9 18:14, kernel test robot wrote:
> Hi Yicong,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on tip/sched/core]
> [also build test WARNING on linus/master v5.19-rc1 next-20220609]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Yicong-Yang/sched-fair-Wake-task-within-the-cluster-when-possible/20220608-181847
> base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 991d8d8142cad94f9c5c05db25e67fa83d6f772a
> config: x86_64-randconfig-a006 (https://download.01.org/0day-ci/archive/20220609/202206091846.fm1bYjWk-lkp@intel.com/config)
> compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
> reproduce (this is a W=1 build):
> # https://github.com/intel-lab-lkp/linux/commit/f2b15e8641f351783c1d47bc654ace164300b7f1
> git remote add linux-review https://github.com/intel-lab-lkp/linux
> git fetch --no-tags linux-review Yicong-Yang/sched-fair-Wake-task-within-the-cluster-when-possible/20220608-181847
> git checkout f2b15e8641f351783c1d47bc654ace164300b7f1
> # save the config file
> mkdir build_dir && cp config build_dir/.config
> make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash kernel/sched/
>
> If you fix the issue, kindly add following tag where applicable
> Reported-by: kernel test robot <lkp@intel.com>
>
> All warnings (new ones prefixed by >>):
>
> kernel/sched/fair.c: In function 'select_idle_cpu':
>>> kernel/sched/fair.c:6381:36: warning: passing argument 2 of 'scan_cluster' makes integer from pointer without a cast [-Wint-conversion]
I didn't change the scan_cluster() stub correspondingly, which leads to this error. thanks for catching this and will fix in v4.
> 6381 | idle_cpu = scan_cluster(p, cpus, target, &nr);
> | ^~~~
> | |
> | struct cpumask *
> kernel/sched/fair.c:6327:59: note: expected 'int' but argument is of type 'struct cpumask *'
> 6327 | static inline int scan_cluster(struct task_struct *p, int prev_cpu, int target, int *nr)
> | ~~~~^~~~~~~~
> kernel/sched/fair.c: At top level:
> kernel/sched/fair.c:11114:6: warning: no previous prototype for 'task_vruntime_update' [-Wmissing-prototypes]
> 11114 | void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi)
> | ^~~~~~~~~~~~~~~~~~~~
>
>
> vim +/scan_cluster +6381 kernel/sched/fair.c
>
> 6332
> 6333 /*
> 6334 * Scan the LLC domain for idle CPUs; this is dynamically regulated by
> 6335 * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> 6336 * average idle time for this rq (as found in rq->avg_idle).
> 6337 */
> 6338 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
> 6339 {
> 6340 struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> 6341 int i, cpu, idle_cpu = -1, nr = INT_MAX;
> 6342 struct rq *this_rq = this_rq();
> 6343 int this = smp_processor_id();
> 6344 struct sched_domain *this_sd;
> 6345 u64 time = 0;
> 6346
> 6347 this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> 6348 if (!this_sd)
> 6349 return -1;
> 6350
> 6351 cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> 6352
> 6353 if (sched_feat(SIS_PROP) && !has_idle_core) {
> 6354 u64 avg_cost, avg_idle, span_avg;
> 6355 unsigned long now = jiffies;
> 6356
> 6357 /*
> 6358 * If we're busy, the assumption that the last idle period
> 6359 * predicts the future is flawed; age away the remaining
> 6360 * predicted idle time.
> 6361 */
> 6362 if (unlikely(this_rq->wake_stamp < now)) {
> 6363 while (this_rq->wake_stamp < now && this_rq->wake_avg_idle) {
> 6364 this_rq->wake_stamp++;
> 6365 this_rq->wake_avg_idle >>= 1;
> 6366 }
> 6367 }
> 6368
> 6369 avg_idle = this_rq->wake_avg_idle;
> 6370 avg_cost = this_sd->avg_scan_cost + 1;
> 6371
> 6372 span_avg = sd->span_weight * avg_idle;
> 6373 if (span_avg > 4*avg_cost)
> 6374 nr = div_u64(span_avg, avg_cost);
> 6375 else
> 6376 nr = 4;
> 6377
> 6378 time = cpu_clock(this);
> 6379 }
> 6380
>> 6381 idle_cpu = scan_cluster(p, cpus, target, &nr);
> 6382 if ((unsigned int)idle_cpu < nr_cpumask_bits)
> 6383 return idle_cpu;
> 6384
> 6385 for_each_cpu_wrap(cpu, cpus, target + 1) {
> 6386 if (has_idle_core) {
> 6387 i = select_idle_core(p, cpu, cpus, &idle_cpu);
> 6388 if ((unsigned int)i < nr_cpumask_bits)
> 6389 return i;
> 6390
> 6391 } else {
> 6392 if (--nr <= 0)
> 6393 return -1;
> 6394 idle_cpu = __select_idle_cpu(cpu, p);
> 6395 if ((unsigned int)idle_cpu < nr_cpumask_bits)
> 6396 break;
> 6397 }
> 6398 }
> 6399
> 6400 if (has_idle_core)
> 6401 set_idle_cores(target, false);
> 6402
> 6403 if (sched_feat(SIS_PROP) && !has_idle_core) {
> 6404 time = cpu_clock(this) - time;
> 6405
> 6406 /*
> 6407 * Account for the scan cost of wakeups against the average
> 6408 * idle time.
> 6409 */
> 6410 this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
> 6411
> 6412 update_avg(&this_sd->avg_scan_cost, time);
> 6413 }
> 6414
> 6415 return idle_cpu;
> 6416 }
> 6417
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API
2022-06-08 9:57 ` [PATCH v3 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
@ 2022-06-15 13:29 ` K Prateek Nayak
0 siblings, 0 replies; 7+ messages in thread
From: K Prateek Nayak @ 2022-06-15 13:29 UTC (permalink / raw)
To: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel
Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
hesham.almatary, john.garry, shenyang39
Hello Yicong,
We are observing some serious regression with tbench with this patch
series applied. The issue doesn't seem to be related to the actual
functionality of the patch but how the patch changes the per CPU
variable layout.
Discussed below are the results from running tbench on a dual
socket Zen3 (2 x 64C/128T) configured in different NPS modes.
NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:
NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.
Node 0: 0-63, 128-191
Node 1: 64-127, 192-255
NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.
Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255
NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.
Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255
Benchmark Results:
Kernel versions:
- tip: 5.19.0-rc2 tip sched/core
- cluster: 5.19.0-rc2 tip sched/core + both the patches of the series
When we started testing, the tip was at:
commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle"
* - Data points of concern
~~~~~~
tbench
~~~~~~
NPS1
Clients: tip cluster
1 444.41 (0.00 pct) 439.27 (-1.15 pct)
2 879.23 (0.00 pct) 831.49 (-5.42 pct) *
4 1648.83 (0.00 pct) 1608.07 (-2.47 pct)
8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) *
16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) *
32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) *
64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) *
128 30795.27 (0.00 pct) 30861.34 (0.21 pct)
256 25138.43 (0.00 pct) 24711.90 (-1.69 pct)
512 51287.93 (0.00 pct) 51855.55 (1.10 pct)
1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct)
NPS2
Clients: tip cluster
1 445.45 (0.00 pct) 441.75 (-0.83 pct)
2 869.24 (0.00 pct) 845.61 (-2.71 pct)
4 1644.28 (0.00 pct) 1586.49 (-3.51 pct)
8 3120.83 (0.00 pct) 2967.01 (-4.92 pct) *
16 5972.29 (0.00 pct) 5208.58 (-12.78 pct) *
32 11776.38 (0.00 pct) 10229.53 (-13.13 pct) *
64 20933.15 (0.00 pct) 17033.45 (-18.62 pct) *
128 32195.00 (0.00 pct) 29507.85 (-8.34 pct) *
256 24641.52 (0.00 pct) 27225.00 (10.48 pct)
512 50806.96 (0.00 pct) 51377.50 (1.12 pct)
1024 51993.96 (0.00 pct) 50773.35 (-2.34 pct)
NPS4
Clients: tip cluster
1 442.10 (0.00 pct) 435.06 (-1.59 pct)
2 870.94 (0.00 pct) 858.64 (-1.41 pct)
4 1615.30 (0.00 pct) 1607.27 (-0.49 pct)
8 3195.95 (0.00 pct) 3020.63 (-5.48 pct) *
16 5937.41 (0.00 pct) 5719.87 (-3.66 pct)
32 11800.41 (0.00 pct) 11229.65 (-4.83 pct) *
64 20844.71 (0.00 pct) 20432.79 (-1.97 pct)
128 31003.62 (0.00 pct) 29441.20 (-5.03 pct) *
256 27476.37 (0.00 pct) 25857.30 (-5.89 pct) * [Know to have run to run variance]
512 52276.72 (0.00 pct) 51659.16 (-1.18 pct)
1024 51372.10 (0.00 pct) 51026.87 (-0.67 pct)
Note: tbench results for 256 workers are known to have
run to run variation on the test machine. Any regression
seen for the data point can be safely ignored.
The behavior is consistent for both tip and patched kernel
across multiple runs of tbench.
~~~~~~~~~~~~~~~~~~~~
Analysis done so far
~~~~~~~~~~~~~~~~~~~~
To root cause this issue quicker, we have focused on 8 to 64 clients
data points with the machine running in NPS1 mode.
- Even on disabling HW prefetcher, the behavior remains consistent
signifying HW prefetcher is not the cause of the problem.
- Bisecting:
When we ran the tests with only Patch 1 of the series, the
regression was visible and the numbers were worse.
Clients: tip cluster Patch 1 Only
8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3018.63 (-7.51 pct)
16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 4869.26 (-18.99 pct)
32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 8159.60 (-32.33 pct)
64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 13161.92 (-38.08 pct)
We further bisected the hunks to narrow down the cause to the per CPU
variable declarations.
On 6/8/2022 3:27 PM, Yicong Yang wrote:
>
> [..snip..]
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 01259611beb9..b9bcfcf8d14d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> DECLARE_PER_CPU(int, sd_llc_size);
> DECLARE_PER_CPU(int, sd_llc_id);
> +DECLARE_PER_CPU(int, sd_share_id);
> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
The main reason for the regression seems to be the above declarations.
The regression seem to go away if we do one of the following:
- Declare sd_share_id and sd_cluster using DECLARE_PER_CPU_READ_MOSTLY()
instead of DECLARE_PER_CPU() and change the corresponding definition
below to DEFINE_PER_CPU_READ_MOSTLY().
Clients: tip Patch 1 Patch 1 (READ_MOSTLY)
8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3237.33 (-0.56 pct)
16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5914.53 (-2.92 pct)
32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11536.05 (3.40 pct)
64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 21162.33 (0.67 pct)
- Convert sd_share_id and sd_cluster to static arrays.
Clients: tip Patch 1 Patch 1 (Static Array)
8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3203.27 (-1.61 pct)
16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 6198.35 (1.73 pct)
32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11385.76 (2.05 pct)
64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 21919.80 (4.28 pct)
- Move the declarations of sd_share_id and sd_cluster to the top
Clients: tip Patch 1 Patch 1 (Declarion on Top)
8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3072.30 (-5.63 pct)
16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5586.59 (-8.30 pct)
32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11184.17 (0.24 pct)
64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 20289.70 (-3.47 pct)
Unfortunately, none of these are complete solutions. For example, using
DECLARE_PER_CPU_READ_MOSTLY() with both patches applied reduces the regression
but doesn't eliminate it entirely:
Clients: tip cluster cluster (READ_MOSTLY)
1 444.41 (0.00 pct) 439.27 (-1.15 pct) 435.95 (-1.90 pct)
2 879.23 (0.00 pct) 831.49 (-5.42 pct) 842.09 (-4.22 pct)
4 1648.83 (0.00 pct) 1608.07 (-2.47 pct) 1598.77 (-3.03 pct)
8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3090.86 (-5.29 pct) *
16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 5360.28 (-10.82 pct) *
32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 11083.66 (-8.08 pct) *
64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 20984.30 (-1.28 pct)
128 30795.27 (0.00 pct) 30861.34 (0.21 pct) 30735.20 (-0.19 pct)
256 25138.43 (0.00 pct) 24711.90 (-1.69 pct) 24021.21 (-4.44 pct)
512 51287.93 (0.00 pct) 51855.55 (1.10 pct) 51672.73 (0.75 pct)
1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct) 52620.02 (-1.04 pct)
We are still trying to root cause the underlying issue that
brought about such drastic regression in tbench performance.
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 05b6c2ad90b9..0595827d481d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> DEFINE_PER_CPU(int, sd_llc_size);
> DEFINE_PER_CPU(int, sd_llc_id);
> +DEFINE_PER_CPU(int, sd_share_id);
> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> @@ -689,6 +691,18 @@ static void update_top_cache_domain(int cpu)
>
> [..snip..]
>
We would like some time to investigate this issue and root cause
the reason for this regression.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-06-15 13:30 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-08 9:57 [PATCH v3 0/2] sched/fair: Wake task within the cluster when possible Yicong Yang
2022-06-08 9:57 ` [PATCH v3 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API Yicong Yang
2022-06-15 13:29 ` K Prateek Nayak
2022-06-08 9:57 ` [PATCH v3 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
2022-06-09 9:54 ` kernel test robot
2022-06-09 10:14 ` kernel test robot
2022-06-09 11:25 ` Yicong Yang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).