* [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
@ 2022-10-19 12:28 Abel Wu
2022-10-19 12:28 ` [PATCH v6 1/4] sched/fair: Skip core update if task pending Abel Wu
` (6 more replies)
0 siblings, 7 replies; 18+ messages in thread
From: Abel Wu @ 2022-10-19 12:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider
Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
Aubrey Li, Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang,
Barry Song, linux-kernel, Abel Wu
This patchset tries to improve SIS scan efficiency by recording idle
cpus in a cpumask for each LLC which will be used as a target cpuset
in the domain scan. The cpus are recorded at CORE granule to avoid
tasks being stack on same core.
v5 -> v6:
- Rename SIS_FILTER to SIS_CORE as it can only be activated when
SMT is enabled and better describes the behavior of CORE granule
update & load delivery.
- Removed the part of limited scan for idle cores since it might be
better to open another thread to discuss the strategies such as
limited or scaled depth. But keep the part of full scan for idle
cores when LLC is overloaded because SIS_CORE can greatly reduce
the overhead of full scan in such case.
- Removed the state of sd_is_busy which indicates an LLC is fully
busy and we can safely skip the SIS domain scan. I would prefer
leave this to SIS_UTIL.
- The filter generation mechanism is replaced by in-place updates
during domain scan to better deal with partial scan failures.
- Collect Reviewed-bys from Tim Chen
v4 -> v5:
- Add limited scan for idle cores when overloaded, suggested by Mel
- Split out several patches since they are irrelevant to this scope
- Add quick check on ttwu_pending before core update
- Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
- Move the main filter logic to the idle path, because the newidle
balance can bail out early if rq->avg_idle is small enough and
lose chances to update the filter.
v3 -> v4:
- Update filter in load_balance rather than in the tick
- Now the filter contains unoccupied cpus rather than overloaded ones
- Added mechanisms to deal with the false positive cases
v2 -> v3:
- Removed sched-idle balance feature and focus on SIS
- Take non-CFS tasks into consideration
- Several fixes/improvement suggested by Josh Don
v1 -> v2:
- Several optimizations on sched-idle balancing
- Ignore asym topos in can_migrate_task
- Add more benchmarks including SIS efficiency
- Re-organize patch as suggested by Mel Gorman
Abel Wu (4):
sched/fair: Skip core update if task pending
sched/fair: Ignore SIS_UTIL when has_idle_core
sched/fair: Introduce SIS_CORE
sched/fair: Deal with SIS scan failures
include/linux/sched/topology.h | 15 ++++
kernel/sched/fair.c | 122 +++++++++++++++++++++++++++++----
kernel/sched/features.h | 7 ++
kernel/sched/sched.h | 3 +
kernel/sched/topology.c | 8 ++-
5 files changed, 141 insertions(+), 14 deletions(-)
--
2.37.3
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v6 1/4] sched/fair: Skip core update if task pending
2022-10-19 12:28 [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
@ 2022-10-19 12:28 ` Abel Wu
2022-10-19 12:28 ` [PATCH v6 2/4] sched/fair: Ignore SIS_UTIL when has_idle_core Abel Wu
` (5 subsequent siblings)
6 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2022-10-19 12:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider
Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
Aubrey Li, Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang,
Barry Song, linux-kernel, Abel Wu
The function __update_idle_core() considers this cpu is idle so
only checks its siblings to decide whether the resident core is
idle or not and update has_idle_cores hint if necessary. But the
problem is that this cpu might not be idle at that moment any
more, resulting in the hint being misleading.
It's not proper to make this check everywhere in the idle path,
but checking just before core updating can make the has_idle_core
hint more reliable with negligible cost.
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5ffec4370602..e7f82fa92c5b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6294,6 +6294,9 @@ void __update_idle_core(struct rq *rq)
int core = cpu_of(rq);
int cpu;
+ if (rq->ttwu_pending)
+ return;
+
rcu_read_lock();
if (test_idle_cores(core))
goto unlock;
--
2.37.3
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v6 2/4] sched/fair: Ignore SIS_UTIL when has_idle_core
2022-10-19 12:28 [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
2022-10-19 12:28 ` [PATCH v6 1/4] sched/fair: Skip core update if task pending Abel Wu
@ 2022-10-19 12:28 ` Abel Wu
2022-10-19 12:28 ` [PATCH v6 3/4] sched/fair: Introduce SIS_CORE Abel Wu
` (4 subsequent siblings)
6 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2022-10-19 12:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider
Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
Aubrey Li, Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang,
Barry Song, linux-kernel, Abel Wu
When SIS_UTIL is enabled, SIS domain scan will be skipped if the
LLC is overloaded even the has_idle_core hint is true. Since idle
load balancing is triggered at tick boundary, the idle cores can
stay cold for the whole tick period wasting time meanwhile some
of other cpus might be overloaded.
Give it a chance to scan for idle cores if the hint implies a
worthy effort.
Benchmark
=========
All of the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled, and test machines are:
A) A dual socket machine modeled Intel Xeon(R) Platinum 8260 with SNC
disabled, so there are 2 NUMA nodes each of which has 24C/48T. Each
NUMA shares an LLC.
B) A dual socket machine modeled AMD EPYC 7Y83 64-Core Processor with
NPS1 enabled, so there are 2 NUMA nodes each of which has 64C/128T.
Each NUMA node contains several LLCs sized of 16 cpus.
Based on tip sched/core fb04563d1cae (v5.19.0).
Results
=======
hackbench-process-pipes
(A) vanilla patched
Amean 1 0.2767 ( 0.00%) 0.2540 ( 8.19%)
Amean 4 0.6080 ( 0.00%) 0.6220 ( -2.30%)
Amean 7 0.7923 ( 0.00%) 0.8020 ( -1.22%)
Amean 12 1.3917 ( 0.00%) 1.1823 ( 15.04%)
Amean 21 3.6747 ( 0.00%) 2.7717 ( 24.57%)
Amean 30 6.7070 ( 0.00%) 5.1200 * 23.66%*
Amean 48 9.3537 ( 0.00%) 8.5890 * 8.18%*
Amean 79 11.6627 ( 0.00%) 11.2580 ( 3.47%)
Amean 110 13.4473 ( 0.00%) 13.1283 ( 2.37%)
Amean 141 16.4747 ( 0.00%) 15.5967 * 5.33%*
Amean 172 19.0000 ( 0.00%) 18.1153 * 4.66%*
Amean 203 21.4200 ( 0.00%) 21.1340 ( 1.34%)
Amean 234 24.2250 ( 0.00%) 23.8227 ( 1.66%)
Amean 265 27.2400 ( 0.00%) 26.8293 ( 1.51%)
Amean 296 30.6937 ( 0.00%) 29.5800 * 3.63%*
(B)
Amean 1 0.3543 ( 0.00%) 0.3650 ( -3.01%)
Amean 4 0.4623 ( 0.00%) 0.4837 ( -4.61%)
Amean 7 0.5117 ( 0.00%) 0.4997 ( 2.35%)
Amean 12 0.5707 ( 0.00%) 0.5863 ( -2.75%)
Amean 21 0.9717 ( 0.00%) 0.8930 * 8.10%*
Amean 30 1.4423 ( 0.00%) 1.2530 ( 13.13%)
Amean 48 2.3520 ( 0.00%) 1.9743 * 16.06%*
Amean 79 5.7193 ( 0.00%) 3.4933 * 38.92%*
Amean 110 6.9893 ( 0.00%) 5.5963 * 19.93%*
Amean 141 9.1103 ( 0.00%) 7.6550 ( 15.97%)
Amean 172 10.2490 ( 0.00%) 8.8323 * 13.82%*
Amean 203 11.3727 ( 0.00%) 10.8683 ( 4.43%)
Amean 234 12.7627 ( 0.00%) 11.8683 ( 7.01%)
Amean 265 13.8947 ( 0.00%) 13.4717 ( 3.04%)
Amean 296 14.1093 ( 0.00%) 13.8130 ( 2.10%)
The results can approximately divided into 3 sections:
- busy, e.g. <12 groups on A and <21 groups on B
- overloaded, e.g. 12~48 groups on A and 21~172 groups on B
- saturated, the rest part
For the busy part the result is neutral with slight wins or loss.
It is probably because there are still idle cpus not hard to be find
so the effort we paid for locating an idle core will bring limited
benefit which can be negated by the cost of full scan easily.
While for the overloaded but not saturated part, great improvement
can be seen due to exploiting the cpu resources by more actively
kicking idle cores working. But once all cpus are totally saturated,
scanning for idle cores doesn't help much.
One concern of the full scan is that the cost gets bigger in larger
LLCs, but the test result seems still positive. One possible reason
is due to the low SIS success rate (<2%), so the paid effort will
indeed trade for efficiency.
tbench4 Throughput
(A) vanilla patched
Hmean 1 275.61 ( 0.00%) 280.53 * 1.78%*
Hmean 2 541.28 ( 0.00%) 561.94 * 3.82%*
Hmean 4 1102.62 ( 0.00%) 1109.14 * 0.59%*
Hmean 8 2149.58 ( 0.00%) 2229.39 * 3.71%*
Hmean 16 4305.40 ( 0.00%) 4383.06 * 1.80%*
Hmean 32 7088.36 ( 0.00%) 7124.14 * 0.50%*
Hmean 64 8609.16 ( 0.00%) 8815.41 * 2.40%*
Hmean 128 19304.92 ( 0.00%) 19519.35 * 1.11%*
Hmean 256 19147.04 ( 0.00%) 19392.24 * 1.28%*
Hmean 384 18970.86 ( 0.00%) 19201.07 * 1.21%*
(B)
Hmean 1 519.62 ( 0.00%) 515.98 * -0.70%*
Hmean 2 1042.92 ( 0.00%) 1031.54 * -1.09%*
Hmean 4 1959.10 ( 0.00%) 1953.44 * -0.29%*
Hmean 8 3842.82 ( 0.00%) 3622.52 * -5.73%*
Hmean 16 6768.50 ( 0.00%) 6545.82 * -3.29%*
Hmean 32 12589.50 ( 0.00%) 13697.73 * 8.80%*
Hmean 64 24797.23 ( 0.00%) 25589.59 * 3.20%*
Hmean 128 38036.66 ( 0.00%) 35667.64 * -6.23%*
Hmean 256 65069.93 ( 0.00%) 65215.85 * 0.22%*
Hmean 512 61147.99 ( 0.00%) 66035.57 * 7.99%*
Hmean 1024 48542.73 ( 0.00%) 53391.64 * 9.99%*
The tbench4 test has a ~40% success rate on used target, prev or
recent cpus, and ~45% of total success rate. And the core scan is
also not very frequent, so the benefit this patch brings is limited
while still some gains can be seen.
netperf
The netperf has an almost 100% success rate on used target, prev or
recent cpus, so the domain scan is generally not performed and not
affected by this patch.
Conclusion
==========
Taking full scan for idle cores is generally good for making better
use of the cpu resources.
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7f82fa92c5b..7b668e16812e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6436,7 +6436,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
time = cpu_clock(this);
}
- if (sched_feat(SIS_UTIL)) {
+ if (sched_feat(SIS_UTIL) && !has_idle_core) {
sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
if (sd_share) {
/* because !--nr is the condition to stop scan */
--
2.37.3
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v6 3/4] sched/fair: Introduce SIS_CORE
2022-10-19 12:28 [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
2022-10-19 12:28 ` [PATCH v6 1/4] sched/fair: Skip core update if task pending Abel Wu
2022-10-19 12:28 ` [PATCH v6 2/4] sched/fair: Ignore SIS_UTIL when has_idle_core Abel Wu
@ 2022-10-19 12:28 ` Abel Wu
2022-10-21 4:03 ` Chen Yu
2022-10-19 12:28 ` [PATCH v6 4/4] sched/fair: Deal with SIS scan failures Abel Wu
` (3 subsequent siblings)
6 siblings, 1 reply; 18+ messages in thread
From: Abel Wu @ 2022-10-19 12:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider
Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
Aubrey Li, Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang,
Barry Song, linux-kernel, Abel Wu
The wakeup fastpath for fair tasks, select_idle_sibling() aka SIS,
plays an important role in maximizing the usage of cpu resources and
can greatly affect overall performance of the system.
The SIS tries to find an idle cpu inside the targeting LLC to place
the woken-up task. The cache hot cpus are preferred, but if none of
them is idle, a domain scan is fired to check the other cpus of that
LLC. It currently scans in linear fashion, and it works well under
light pressure due to lots of idle cpus can be available.
But things change. The LLC is getting bigger in modern and future
machines, and players like cloud service providers are continuously
trying to use all kinds of resources more efficiently to reduce TCO.
In either case, locating an idle cpu is no longer as easy as before.
So the linear scan doesn't fit well in such cases.
There are already features like SIS_{UTIL,PROP} exist to deal with
the scalability issue by limiting the scan depth, and it would be
better if we can improve the way how it scans as well. And this is
exactly what the SIS_CORE is born for.
When SIS_CORE is enabled, a cpumask containing idle cpus for each
LLC is maintained and stored in each LLC shared domain. The idle
cpus are recorded at CORE granule, so in theory only one idle cpu
of a CORE can be set to the mask. The ideas behind this:
- Recording idle cpus is to narrow down the SIS scan, so we can
avoid touching the runqueues that must be in a busy state, as
we all know that the runqueues are one of the most hot data in
the system. And because all the possibly idle cpus are in the
mask, the hint of has_idle_core still works.
- The rule of CORE granule update helps spreading load out to
different cores trying to make better use of cpu capacity.
A major concern is the accuracy of the idle cpumask. A cpu present
in the mask might not be idle any more, which is called the false
positive cpu. Such cpus will negate lots of benefit this feature
brings. The strategy against the false positives will be introduced
in next patch.
Another concern is the overhead of accessing the LLC-shared cpumask,
which can be more severe in large LLCs. But a perf stat on cache
miss rate during hackbench doesn't show obvious difference.
This patch records idle cpus when they goes idle at CORE granule for
each LLC, and the cpumask is in LLC shared domain. The false positive
cpus are cleared when SIS domain scan fails.
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
include/linux/sched/topology.h | 15 ++++++++++
kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++---
kernel/sched/features.h | 7 +++++
kernel/sched/topology.c | 8 +++++-
4 files changed, 76 insertions(+), 5 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 816df6cc444e..ac2162f33ada 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -82,6 +82,16 @@ struct sched_domain_shared {
atomic_t nr_busy_cpus;
int has_idle_cores;
int nr_idle_scan;
+
+ /*
+ * Used by sched feature SIS_CORE to record idle cpus at core
+ * granule to improve efficiency of SIS domain scan.
+ *
+ * NOTE: this field is variable length. (Allocated dynamically
+ * by attaching extra space to the end of the structure,
+ * depending on how many CPUs the kernel has booted up with)
+ */
+ unsigned long icpus[];
};
struct sched_domain {
@@ -167,6 +177,11 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
return to_cpumask(sd->span);
}
+static inline struct cpumask *sched_domain_icpus(struct sched_domain *sd)
+{
+ return to_cpumask(sd->shared->icpus);
+}
+
extern void partition_sched_domains_locked(int ndoms_new,
cpumask_var_t doms_new[],
struct sched_domain_attr *dattr_new);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7b668e16812e..3aa699e9d4af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6282,6 +6282,33 @@ static inline bool test_idle_cores(int cpu)
return false;
}
+/*
+ * To honor the rule of CORE granule update, set this cpu to the LLC idle
+ * cpumask only if there is no cpu of this core showed up in the cpumask.
+ */
+static void update_idle_cpu(int cpu)
+{
+ struct sched_domain_shared *sds;
+
+ if (!sched_feat(SIS_CORE))
+ return;
+
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (sds) {
+ struct cpumask *icpus = to_cpumask(sds->icpus);
+
+ /*
+ * This is racy against clearing in select_idle_cpu(),
+ * and can lead to idle cpus miss the chance to be set to
+ * the idle cpumask, thus the idle cpus are temporarily
+ * out of reach in SIS domain scan. But it should be rare
+ * and we still have ILB to kick them working.
+ */
+ if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
+ cpumask_set_cpu(cpu, icpus);
+ }
+}
+
/*
* Scans the local SMT mask to see if the entire core is idle, and records this
* information in sd_llc_shared->has_idle_cores.
@@ -6298,6 +6325,7 @@ void __update_idle_core(struct rq *rq)
return;
rcu_read_lock();
+ update_idle_cpu(core);
if (test_idle_cores(core))
goto unlock;
@@ -6343,7 +6371,13 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
if (idle)
return core;
- cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+ /*
+ * It is unlikely that more than one cpu of a core show up
+ * in the @cpus if SIS_CORE enabled.
+ */
+ if (!sched_feat(SIS_CORE))
+ cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+
return -1;
}
@@ -6394,7 +6428,7 @@ static inline int select_idle_smt(struct task_struct *p, int target)
*/
static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
{
- struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask), *icpus = NULL;
int i, cpu, idle_cpu = -1, nr = INT_MAX;
struct sched_domain_shared *sd_share;
struct rq *this_rq = this_rq();
@@ -6402,8 +6436,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
struct sched_domain *this_sd = NULL;
u64 time = 0;
- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
-
if (sched_feat(SIS_PROP) && !has_idle_core) {
u64 avg_cost, avg_idle, span_avg;
unsigned long now = jiffies;
@@ -6447,6 +6479,11 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
}
}
+ if (sched_feat(SIS_CORE) && sched_smt_active())
+ icpus = sched_domain_icpus(sd);
+
+ cpumask_and(cpus, icpus ? icpus : sched_domain_span(sd), p->cpus_ptr);
+
for_each_cpu_wrap(cpu, cpus, target + 1) {
if (has_idle_core) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -6465,6 +6502,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
if (has_idle_core)
set_idle_cores(target, false);
+ if (icpus && idle_cpu == -1) {
+ /* Reset the idle cpu mask if a full scan fails */
+ if (nr > 0)
+ cpumask_clear(icpus);
+ }
+
if (sched_feat(SIS_PROP) && this_sd && !has_idle_core) {
time = cpu_clock(this) - time;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..bf3cae94caa6 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,6 +63,13 @@ SCHED_FEAT(TTWU_QUEUE, true)
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+/*
+ * Record idle cpus at core granule for each LLC to improve efficiency of
+ * SIS domain scan. Combine with the above features of limiting scan depth
+ * to better deal with the scalability issue.
+ */
+SCHED_FEAT(SIS_CORE, true)
+
/*
* Issue a WARN when we do multiple update_rq_clock() calls
* in a single rq->lock section. Default disabled because the
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..a2bb0091c10d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1641,6 +1641,12 @@ sd_init(struct sched_domain_topology_level *tl,
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+
+ /*
+ * This will temporarily break the rule of CORE granule,
+ * but will be fixed after SIS scan failures.
+ */
+ cpumask_copy(sched_domain_icpus(sd), sd_span);
}
sd->private = sdd;
@@ -2106,7 +2112,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
*per_cpu_ptr(sdd->sd, j) = sd;
- sds = kzalloc_node(sizeof(struct sched_domain_shared),
+ sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sds)
return -ENOMEM;
--
2.37.3
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v6 4/4] sched/fair: Deal with SIS scan failures
2022-10-19 12:28 [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
` (2 preceding siblings ...)
2022-10-19 12:28 ` [PATCH v6 3/4] sched/fair: Introduce SIS_CORE Abel Wu
@ 2022-10-19 12:28 ` Abel Wu
2022-11-04 7:29 ` [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
` (2 subsequent siblings)
6 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2022-10-19 12:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider
Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
Aubrey Li, Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang,
Barry Song, linux-kernel, Abel Wu
When SIS_CORE is active, idle cpus are recorded into a cpumask at
CORE granule, and the SIS domain scan tries to locate an idle cpu
in this cpumask. So the quality of the cpumask matters. There are
generally two types of cpus that need to be dealt with:
- False positive: the cpus that are present in the idle cpumask
but is actually busy. This can be caused by task enqueuing to
these cpus without removing from the cpumask immediately. To
solve this, the cpumask needs to be shrinked if poor quality
of the mask is observed.
- True negative: the cpus that are idle but not show up in the
idle cpumask. This is due to the rule of CORE granule update
that these idle cpus will be ignored when their SMT siblings
are already present in the idle cpumask.
According to the nature of the two type cpus and some heuristics,
the strategies against SIS scan failures are as follows:
- It can be predicted that if a scan fails, the next scan from the
same cpu will probably fail too. So mark these cpus fail-prone.
A scan from fail-prone cpu is the time to shrink the cpumask.
- All true negative cpus are the SMT siblings of the false positive
cpus, and are taken for granted to be treated as the fallbacks of
the false positive cpus. So a fail-prone scan should also check
the SMT siblings to see if any true negative cpu exists.
The number of false positive cpus being removed during one scan is
not constrained, but it will be implicitly constrained by the load
of the LLC. If LLC is under heavy pressure, both the weight of idle
cpumask and the scan depth are reduced, so does the number of cpus
removed.
To sum up, this patch marks the cpus fail-prone if scans from them
failed last time, so next time scanning from them will also check
their SMT siblings to see if any true negative cpus available. The
false positive cpus will be removed during fail-prone scans if its
belonging core is fully busy.
Benchmark
=========
All of the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled, and test machines are:
A) A dual socket machine modeled Intel Xeon(R) Platinum 8260 with SNC
disabled, so there are 2 NUMA nodes each of which has 24C/48T. Each
NUMA shares an LLC.
B) A dual socket machine modeled AMD EPYC 7Y83 64-Core Processor with
NPS1 enabled, so there are 2 NUMA nodes each of which has 64C/128T.
Each NUMA node contains several LLCs sized of 16 cpus.
The 'baseline' is based on tip sched/core fb04563d1cae (v5.19.0) with
the first 2 patches of this series applied. While 'sis_core' includes
the whole patchset.
Results
=======
hackbench-process-pipes
(A) baseline sis_core
Amean 1 0.2540 ( 0.00%) 0.2533 ( 0.26%)
Amean 4 0.6220 ( 0.00%) 0.5993 ( 3.64%)
Amean 7 0.8020 ( 0.00%) 0.7663 * 4.45%*
Amean 12 1.1823 ( 0.00%) 1.1037 * 6.65%*
Amean 21 2.7717 ( 0.00%) 2.2203 * 19.89%*
Amean 30 5.1200 ( 0.00%) 3.7133 * 27.47%*
Amean 48 8.5890 ( 0.00%) 7.0863 * 17.50%*
Amean 79 11.2580 ( 0.00%) 10.3717 * 7.87%*
Amean 110 13.1283 ( 0.00%) 12.4133 * 5.45%*
Amean 141 15.5967 ( 0.00%) 14.4883 * 7.11%*
Amean 172 18.1153 ( 0.00%) 17.2557 * 4.75%*
Amean 203 21.1340 ( 0.00%) 20.2807 ( 4.04%)
Amean 234 23.8227 ( 0.00%) 22.8510 ( 4.08%)
Amean 265 26.8293 ( 0.00%) 25.7367 * 4.07%*
Amean 296 29.5800 ( 0.00%) 28.7847 ( 2.69%)
(B)
Amean 1 0.3650 ( 0.00%) 0.3510 ( 3.84%)
Amean 4 0.4837 ( 0.00%) 0.4753 ( 1.72%)
Amean 7 0.4997 ( 0.00%) 0.5073 ( -1.53%)
Amean 12 0.5863 ( 0.00%) 0.5807 ( 0.97%)
Amean 21 0.8930 ( 0.00%) 0.8953 ( -0.26%)
Amean 30 1.2530 ( 0.00%) 1.2633 ( -0.82%)
Amean 48 1.9743 ( 0.00%) 1.9023 ( 3.65%)
Amean 79 3.4933 ( 0.00%) 3.2820 ( 6.05%)
Amean 110 5.5963 ( 0.00%) 5.3923 ( 3.65%)
Amean 141 7.6550 ( 0.00%) 6.8633 ( 10.34%)
Amean 172 8.8323 ( 0.00%) 8.2973 * 6.06%*
Amean 203 10.8683 ( 0.00%) 9.5170 * 12.43%*
Amean 234 11.8683 ( 0.00%) 10.6217 ( 10.50%)
Amean 265 13.4717 ( 0.00%) 11.9357 * 11.40%*
Amean 296 13.8130 ( 0.00%) 12.7430 * 7.75%*
The results on machine A can approximately divided into 3 sections:
- busy, e.g. <21 groups
- overloaded, e.g. 21~48 groups
- saturated, the rest part
The two cases of 296 groups in B and 110 groups in A have same amount
of tasks per cpu. So does 30 groups in B and 12 groups in A. So the
sections on A also apply to B, except that B only has the first two.
This implies that this feature seems to have consistant behaviour on
LLCs with different sizes.
For the busy part the result is neutral with slight wins or loss.
It's not hard to find an idle cpu in such case, so SIS_CORE doesn't
outperform the linear scanner, considering the cpumask is maintained
at a cost which will negate the slight benefit.
Once load increases, SIS_CORE helps improving the throughput quite a
lot by squeezing out the hidden capacity of the cpus. And even under
extreme load pressure when the cpu capacity is almost fully utilized,
there is still some can be exploited.
Although it is unlikely to be such loaded in the real world, the long
running workloads like training jobs can also keep cpus busy and can
benefit from this feature a lot.
netperf
(A-udp) baseline sis_core
Hmean send-64 214.34 ( 0.00%) 210.79 * -1.65%*
Hmean send-128 427.90 ( 0.00%) 417.96 * -2.32%*
Hmean send-256 839.65 ( 0.00%) 823.78 * -1.89%*
Hmean send-1024 3207.45 ( 0.00%) 3167.96 * -1.23%*
Hmean send-2048 6097.24 ( 0.00%) 6089.01 ( -0.13%)
Hmean send-3312 9350.83 ( 0.00%) 9299.09 ( -0.55%)
Hmean send-4096 11368.25 ( 0.00%) 11186.44 * -1.60%*
Hmean send-8192 18273.21 ( 0.00%) 18103.81 ( -0.93%)
Hmean send-16384 28207.81 ( 0.00%) 28259.82 ( 0.18%)
(B-udp)
Hmean send-64 249.97 ( 0.00%) 256.99 * 2.81%*
Hmean send-128 500.68 ( 0.00%) 514.44 * 2.75%*
Hmean send-256 991.59 ( 0.00%) 1017.38 * 2.60%*
Hmean send-1024 3913.02 ( 0.00%) 3982.68 * 1.78%*
Hmean send-2048 7627.99 ( 0.00%) 7590.30 ( -0.49%)
Hmean send-3312 11907.07 ( 0.00%) 12114.03 * 1.74%*
Hmean send-4096 14300.09 ( 0.00%) 14753.34 * 3.17%*
Hmean send-8192 24576.21 ( 0.00%) 25431.42 * 3.48%*
Hmean send-16384 42105.89 ( 0.00%) 41813.30 ( -0.69%)
(A-tcp)
Hmean 64 1191.91 ( 0.00%) 1220.47 * 2.40%*
Hmean 128 2318.60 ( 0.00%) 2354.56 ( 1.55%)
Hmean 256 4267.41 ( 0.00%) 4226.72 ( -0.95%)
Hmean 1024 13190.66 ( 0.00%) 13065.91 ( -0.95%)
Hmean 2048 20466.22 ( 0.00%) 20704.66 ( 1.17%)
Hmean 3312 24363.57 ( 0.00%) 24613.99 * 1.03%*
Hmean 4096 26144.44 ( 0.00%) 26204.24 ( 0.23%)
Hmean 8192 30387.77 ( 0.00%) 30703.65 * 1.04%*
Hmean 16384 34942.71 ( 0.00%) 34205.44 * -2.11%*
(B-tcp)
Hmean 64 1971.18 ( 0.00%) 2120.61 * 7.58%*
Hmean 128 3752.96 ( 0.00%) 3995.68 * 6.47%*
Hmean 256 6861.58 ( 0.00%) 7342.93 * 7.02%*
Hmean 1024 21966.06 ( 0.00%) 23725.30 * 8.01%*
Hmean 2048 33949.66 ( 0.00%) 35620.67 * 4.92%*
Hmean 3312 40681.75 ( 0.00%) 41543.26 * 2.12%*
Hmean 4096 44309.70 ( 0.00%) 45390.03 * 2.44%*
Hmean 8192 50909.35 ( 0.00%) 52157.16 * 2.45%*
Hmean 16384 57198.37 ( 0.00%) 57686.96 ( 0.85%)
The netperf has an almost 100% success rate on used target, prev or
recent cpus, so the domain scan is generally not performed. This test
is to see how much overhead of maintaining the idle cpumask and the
result is neutral, which sounds like the overhead is acceptable.
tbench4 Throughput
(A) baseline sis_core
Hmean 1 280.53 ( 0.00%) 289.44 * 3.17%*
Hmean 2 561.94 ( 0.00%) 571.46 * 1.69%*
Hmean 4 1109.14 ( 0.00%) 1129.88 * 1.87%*
Hmean 8 2229.39 ( 0.00%) 2266.52 * 1.67%*
Hmean 16 4383.06 ( 0.00%) 4473.48 * 2.06%*
Hmean 32 7124.14 ( 0.00%) 7223.83 * 1.40%*
Hmean 64 8815.41 ( 0.00%) 8770.21 * -0.51%*
Hmean 128 19519.35 ( 0.00%) 20337.24 * 4.19%*
Hmean 256 19392.24 ( 0.00%) 20052.98 * 3.41%*
Hmean 384 19201.07 ( 0.00%) 19563.63 * 1.89%*
(B)
Hmean 1 515.98 ( 0.00%) 499.91 * -3.12%*
Hmean 2 1031.54 ( 0.00%) 1044.38 * 1.24%*
Hmean 4 1953.44 ( 0.00%) 1959.30 * 0.30%*
Hmean 8 3622.52 ( 0.00%) 3773.08 * 4.16%*
Hmean 16 6545.82 ( 0.00%) 6814.46 * 4.10%*
Hmean 32 13697.73 ( 0.00%) 13078.74 * -4.52%*
Hmean 64 25589.59 ( 0.00%) 24576.52 * -3.96%*
Hmean 128 35667.64 ( 0.00%) 37590.20 * 5.39%*
Hmean 256 65215.85 ( 0.00%) 64921.74 * -0.45%*
Hmean 512 66035.57 ( 0.00%) 63812.48 * -3.37%*
Hmean 1024 53391.64 ( 0.00%) 62356.50 * 16.79%*
Like netperf, tbench4 benchmark also has a high success rate ~39% on
the cache hot cpus, and the SIS overall success rate is ~45%. This
benchmark runs a fast idling workload and makes the SIS idle cpumask
quite unstable, which is a disaster to this feature. Even so, not
much but still some improvement can be seen from the result.
Conclusion
==========
The overhead of maintaining the idle cpumasks seems acceptable, and
this cost can trade for considerable throughput improvement once LLC
becomes busier in which case less idle cpus are available.
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
kernel/sched/fair.c | 74 +++++++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 3 ++
2 files changed, 65 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3aa699e9d4af..d06d59ac2f05 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6309,6 +6309,16 @@ static void update_idle_cpu(int cpu)
}
}
+static inline bool should_scan_sibling(int cpu)
+{
+ return cmpxchg(&cpu_rq(cpu)->sis_scan_sibling, 1, 0);
+}
+
+static inline void set_scan_sibling(int cpu)
+{
+ WRITE_ONCE(cpu_rq(cpu)->sis_scan_sibling, 1);
+}
+
/*
* Scans the local SMT mask to see if the entire core is idle, and records this
* information in sd_llc_shared->has_idle_cores.
@@ -6384,17 +6394,20 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
/*
* Scan the local SMT mask for idle CPUs.
*/
-static int select_idle_smt(struct task_struct *p, int target)
+static int select_idle_smt(struct task_struct *p, int core, struct cpumask *cpus, int exclude)
{
int cpu;
- for_each_cpu_and(cpu, cpu_smt_mask(target), p->cpus_ptr) {
- if (cpu == target)
+ for_each_cpu_and(cpu, cpu_smt_mask(core), p->cpus_ptr) {
+ if (exclude && cpu == core)
continue;
if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
return cpu;
}
+ if (cpus)
+ cpumask_clear_cpu(core, cpus);
+
return -1;
}
@@ -6409,12 +6422,21 @@ static inline bool test_idle_cores(int cpu)
return false;
}
+static inline bool should_scan_sibling(int cpu)
+{
+ return false;
+}
+
+static inline void set_scan_sibling(int cpu)
+{
+}
+
static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
{
return __select_idle_cpu(core, p);
}
-static inline int select_idle_smt(struct task_struct *p, int target)
+static inline int select_idle_smt(struct task_struct *p, int core, struct cpumask *cpus, int exclude)
{
return -1;
}
@@ -6434,6 +6456,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
struct rq *this_rq = this_rq();
int this = smp_processor_id();
struct sched_domain *this_sd = NULL;
+ bool scan_sibling = false;
u64 time = 0;
if (sched_feat(SIS_PROP) && !has_idle_core) {
@@ -6479,20 +6502,31 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
}
}
- if (sched_feat(SIS_CORE) && sched_smt_active())
+ if (sched_feat(SIS_CORE) && sched_smt_active()) {
+ /*
+ * Due to the nature of idle core scanning, has_idle_core
+ * hint should also consume the scan_sibling flag even
+ * though it doesn't use the flag when scanning.
+ */
+ scan_sibling = should_scan_sibling(target);
icpus = sched_domain_icpus(sd);
+ }
cpumask_and(cpus, icpus ? icpus : sched_domain_span(sd), p->cpus_ptr);
for_each_cpu_wrap(cpu, cpus, target + 1) {
+ if (!--nr)
+ break;
+
if (has_idle_core) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
if ((unsigned int)i < nr_cpumask_bits)
return i;
-
+ } else if (scan_sibling) {
+ idle_cpu = select_idle_smt(p, cpu, icpus, 0);
+ if ((unsigned int)idle_cpu < nr_cpumask_bits)
+ break;
} else {
- if (!--nr)
- return -1;
idle_cpu = __select_idle_cpu(cpu, p);
if ((unsigned int)idle_cpu < nr_cpumask_bits)
break;
@@ -6503,9 +6537,25 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
set_idle_cores(target, false);
if (icpus && idle_cpu == -1) {
- /* Reset the idle cpu mask if a full scan fails */
- if (nr > 0)
- cpumask_clear(icpus);
+ if (nr > 0 && (has_idle_core || scan_sibling)) {
+ /*
+ * Reset the idle cpu mask if a full scan fails,
+ * but ignore the !has_idle_core case which should
+ * have already been fixed during scan.
+ */
+ if (has_idle_core)
+ cpumask_clear(icpus);
+ } else {
+ /*
+ * As for partial scan failures, it will probably
+ * fail again next time scanning from the same cpu.
+ * Due to the SIS_CORE rule of CORE granule update,
+ * some idle cpus can be missed in the mask. So it
+ * would be reasonable to scan SMT siblings as well
+ * if the scan is fail-prone.
+ */
+ set_scan_sibling(target);
+ }
}
if (sched_feat(SIS_PROP) && this_sd && !has_idle_core) {
@@ -6657,7 +6707,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
has_idle_core = test_idle_cores(target);
if (!has_idle_core && cpus_share_cache(prev, target)) {
- i = select_idle_smt(p, prev);
+ i = select_idle_smt(p, prev, NULL, 1);
if ((unsigned int)i < nr_cpumask_bits)
return i;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1fc198be1ffd..c7f8ed5021e6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -971,6 +971,9 @@ struct rq {
#ifdef CONFIG_SMP
unsigned int ttwu_pending;
+#ifdef CONFIG_SCHED_SMT
+ int sis_scan_sibling;
+#endif
#endif
u64 nr_switches;
--
2.37.3
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v6 3/4] sched/fair: Introduce SIS_CORE
2022-10-19 12:28 ` [PATCH v6 3/4] sched/fair: Introduce SIS_CORE Abel Wu
@ 2022-10-21 4:03 ` Chen Yu
2022-10-21 4:30 ` Abel Wu
0 siblings, 1 reply; 18+ messages in thread
From: Chen Yu @ 2022-10-21 4:03 UTC (permalink / raw)
To: Abel Wu
Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider, Josh Don, Tim Chen,
K Prateek Nayak, Gautham R . Shenoy, Aubrey Li, Qais Yousef,
Juri Lelli, Rik van Riel, Yicong Yang, Barry Song, linux-kernel
On 2022-10-19 at 20:28:58 +0800, Abel Wu wrote:
[cut]
> A major concern is the accuracy of the idle cpumask. A cpu present
> in the mask might not be idle any more, which is called the false
> positive cpu. Such cpus will negate lots of benefit this feature
> brings. The strategy against the false positives will be introduced
> in next patch.
>
I was thinking that, if patch[3/4] needs [4/4] to fix the false positives,
maybe SIS_CORE could be disabled by default in 3/4 but enabled
in 4/4? So this might facilicate git bisect in case of any regression
check?
[cut]
> + * To honor the rule of CORE granule update, set this cpu to the LLC idle
> + * cpumask only if there is no cpu of this core showed up in the cpumask.
> + */
> +static void update_idle_cpu(int cpu)
> +{
> + struct sched_domain_shared *sds;
> +
> + if (!sched_feat(SIS_CORE))
> + return;
> +
> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> + if (sds) {
> + struct cpumask *icpus = to_cpumask(sds->icpus);
> +
> + /*
> + * This is racy against clearing in select_idle_cpu(),
> + * and can lead to idle cpus miss the chance to be set to
> + * the idle cpumask, thus the idle cpus are temporarily
> + * out of reach in SIS domain scan. But it should be rare
> + * and we still have ILB to kick them working.
> + */
> + if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
> + cpumask_set_cpu(cpu, icpus);
Maybe I miss something, here we only set one CPU in the icpus, but
when we reach update_idle_cpu(), all SMT siblings of 'cpu' are idle,
is this intended for 'CORE granule update'?
thanks,
Chenyu
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 3/4] sched/fair: Introduce SIS_CORE
2022-10-21 4:03 ` Chen Yu
@ 2022-10-21 4:30 ` Abel Wu
2022-10-21 4:34 ` Chen Yu
0 siblings, 1 reply; 18+ messages in thread
From: Abel Wu @ 2022-10-21 4:30 UTC (permalink / raw)
To: Chen Yu
Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider, Josh Don, Tim Chen,
K Prateek Nayak, Gautham R . Shenoy, Aubrey Li, Qais Yousef,
Juri Lelli, Rik van Riel, Yicong Yang, Barry Song, linux-kernel
Hi Chen, thanks for your reviewing!
On 10/21/22 12:03 PM, Chen Yu wrote:
> On 2022-10-19 at 20:28:58 +0800, Abel Wu wrote:
> [cut]
>> A major concern is the accuracy of the idle cpumask. A cpu present
>> in the mask might not be idle any more, which is called the false
>> positive cpu. Such cpus will negate lots of benefit this feature
>> brings. The strategy against the false positives will be introduced
>> in next patch.
>>
> I was thinking that, if patch[3/4] needs [4/4] to fix the false positives,
> maybe SIS_CORE could be disabled by default in 3/4 but enabled
> in 4/4? So this might facilicate git bisect in case of any regression
> check?
Agreed. Will fix in next version.
> [cut]
>> + * To honor the rule of CORE granule update, set this cpu to the LLC idle
>> + * cpumask only if there is no cpu of this core showed up in the cpumask.
>> + */
>> +static void update_idle_cpu(int cpu)
>> +{
>> + struct sched_domain_shared *sds;
>> +
>> + if (!sched_feat(SIS_CORE))
>> + return;
>> +
>> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>> + if (sds) {
>> + struct cpumask *icpus = to_cpumask(sds->icpus);
>> +
>> + /*
>> + * This is racy against clearing in select_idle_cpu(),
>> + * and can lead to idle cpus miss the chance to be set to
>> + * the idle cpumask, thus the idle cpus are temporarily
>> + * out of reach in SIS domain scan. But it should be rare
>> + * and we still have ILB to kick them working.
>> + */
>> + if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
>> + cpumask_set_cpu(cpu, icpus);
> Maybe I miss something, here we only set one CPU in the icpus, but
> when we reach update_idle_cpu(), all SMT siblings of 'cpu' are idle,
> is this intended for 'CORE granule update'?
The __update_idle_core() is called by all the cpus that need to go idle
to update has_idle_core if necessary, and update_idle_cpu() is called
before that check.
Thanks,
Abel
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 3/4] sched/fair: Introduce SIS_CORE
2022-10-21 4:30 ` Abel Wu
@ 2022-10-21 4:34 ` Chen Yu
2022-10-21 9:35 ` Abel Wu
0 siblings, 1 reply; 18+ messages in thread
From: Chen Yu @ 2022-10-21 4:34 UTC (permalink / raw)
To: Abel Wu
Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider, Josh Don, Tim Chen,
K Prateek Nayak, Gautham R . Shenoy, Aubrey Li, Qais Yousef,
Juri Lelli, Rik van Riel, Yicong Yang, Barry Song, linux-kernel
On 2022-10-21 at 12:30:56 +0800, Abel Wu wrote:
> Hi Chen, thanks for your reviewing!
>
> On 10/21/22 12:03 PM, Chen Yu wrote:
> > On 2022-10-19 at 20:28:58 +0800, Abel Wu wrote:
> > [cut]
> > > A major concern is the accuracy of the idle cpumask. A cpu present
> > > in the mask might not be idle any more, which is called the false
> > > positive cpu. Such cpus will negate lots of benefit this feature
> > > brings. The strategy against the false positives will be introduced
> > > in next patch.
> > >
> > I was thinking that, if patch[3/4] needs [4/4] to fix the false positives,
> > maybe SIS_CORE could be disabled by default in 3/4 but enabled
> > in 4/4? So this might facilicate git bisect in case of any regression
> > check?
>
> Agreed. Will fix in next version.
>
> > [cut]
> > > + * To honor the rule of CORE granule update, set this cpu to the LLC idle
> > > + * cpumask only if there is no cpu of this core showed up in the cpumask.
> > > + */
> > > +static void update_idle_cpu(int cpu)
> > > +{
> > > + struct sched_domain_shared *sds;
> > > +
> > > + if (!sched_feat(SIS_CORE))
> > > + return;
> > > +
> > > + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > > + if (sds) {
> > > + struct cpumask *icpus = to_cpumask(sds->icpus);
> > > +
> > > + /*
> > > + * This is racy against clearing in select_idle_cpu(),
> > > + * and can lead to idle cpus miss the chance to be set to
> > > + * the idle cpumask, thus the idle cpus are temporarily
> > > + * out of reach in SIS domain scan. But it should be rare
> > > + * and we still have ILB to kick them working.
> > > + */
> > > + if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
> > > + cpumask_set_cpu(cpu, icpus);
> > Maybe I miss something, here we only set one CPU in the icpus, but
> > when we reach update_idle_cpu(), all SMT siblings of 'cpu' are idle,
> > is this intended for 'CORE granule update'?
>
> The __update_idle_core() is called by all the cpus that need to go idle
> to update has_idle_core if necessary, and update_idle_cpu() is called
> before that check.
>
I see.
Since __update_idle_core() has checked all SMT siblings of 'cpu' if
they are idle, can that information also be updated to icpus?
thanks,
Chenyu
> Thanks,
> Abel
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 3/4] sched/fair: Introduce SIS_CORE
2022-10-21 4:34 ` Chen Yu
@ 2022-10-21 9:35 ` Abel Wu
2022-10-21 11:14 ` Chen Yu
0 siblings, 1 reply; 18+ messages in thread
From: Abel Wu @ 2022-10-21 9:35 UTC (permalink / raw)
To: Chen Yu
Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider, Josh Don, Tim Chen,
K Prateek Nayak, Gautham R . Shenoy, Aubrey Li, Qais Yousef,
Juri Lelli, Rik van Riel, Yicong Yang, Barry Song, linux-kernel
On 10/21/22 12:34 PM, Chen Yu wrote:
> On 2022-10-21 at 12:30:56 +0800, Abel Wu wrote:
>> Hi Chen, thanks for your reviewing!
>>
>> On 10/21/22 12:03 PM, Chen Yu wrote:
>>> On 2022-10-19 at 20:28:58 +0800, Abel Wu wrote:
>>> [cut]
>>>> A major concern is the accuracy of the idle cpumask. A cpu present
>>>> in the mask might not be idle any more, which is called the false
>>>> positive cpu. Such cpus will negate lots of benefit this feature
>>>> brings. The strategy against the false positives will be introduced
>>>> in next patch.
>>>>
>>> I was thinking that, if patch[3/4] needs [4/4] to fix the false positives,
>>> maybe SIS_CORE could be disabled by default in 3/4 but enabled
>>> in 4/4? So this might facilicate git bisect in case of any regression
>>> check?
>>
>> Agreed. Will fix in next version.
>>
>>> [cut]
>>>> + * To honor the rule of CORE granule update, set this cpu to the LLC idle
>>>> + * cpumask only if there is no cpu of this core showed up in the cpumask.
>>>> + */
>>>> +static void update_idle_cpu(int cpu)
>>>> +{
>>>> + struct sched_domain_shared *sds;
>>>> +
>>>> + if (!sched_feat(SIS_CORE))
>>>> + return;
>>>> +
>>>> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>>>> + if (sds) {
>>>> + struct cpumask *icpus = to_cpumask(sds->icpus);
>>>> +
>>>> + /*
>>>> + * This is racy against clearing in select_idle_cpu(),
>>>> + * and can lead to idle cpus miss the chance to be set to
>>>> + * the idle cpumask, thus the idle cpus are temporarily
>>>> + * out of reach in SIS domain scan. But it should be rare
>>>> + * and we still have ILB to kick them working.
>>>> + */
>>>> + if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
>>>> + cpumask_set_cpu(cpu, icpus);
>>> Maybe I miss something, here we only set one CPU in the icpus, but
>>> when we reach update_idle_cpu(), all SMT siblings of 'cpu' are idle,
>>> is this intended for 'CORE granule update'?
>>
>> The __update_idle_core() is called by all the cpus that need to go idle
>> to update has_idle_core if necessary, and update_idle_cpu() is called
>> before that check.
>>
> I see.
>
> Since __update_idle_core() has checked all SMT siblings of 'cpu' if
> they are idle, can that information also be updated to icpus?
I think this will simply fallback to the original per-cpu proposal and
lose the opportunity to spread tasks to different cores.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 3/4] sched/fair: Introduce SIS_CORE
2022-10-21 9:35 ` Abel Wu
@ 2022-10-21 11:14 ` Chen Yu
0 siblings, 0 replies; 18+ messages in thread
From: Chen Yu @ 2022-10-21 11:14 UTC (permalink / raw)
To: Abel Wu
Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider, Josh Don, Tim Chen,
K Prateek Nayak, Gautham R . Shenoy, Aubrey Li, Qais Yousef,
Juri Lelli, Rik van Riel, Yicong Yang, Barry Song, linux-kernel
On 2022-10-21 at 17:35:06 +0800, Abel Wu wrote:
> On 10/21/22 12:34 PM, Chen Yu wrote:
> > On 2022-10-21 at 12:30:56 +0800, Abel Wu wrote:
> > > Hi Chen, thanks for your reviewing!
> > >
> > > On 10/21/22 12:03 PM, Chen Yu wrote:
> > > > On 2022-10-19 at 20:28:58 +0800, Abel Wu wrote:
> > > > [cut]
> > > > > A major concern is the accuracy of the idle cpumask. A cpu present
> > > > > in the mask might not be idle any more, which is called the false
> > > > > positive cpu. Such cpus will negate lots of benefit this feature
> > > > > brings. The strategy against the false positives will be introduced
> > > > > in next patch.
> > > > >
> > > > I was thinking that, if patch[3/4] needs [4/4] to fix the false positives,
> > > > maybe SIS_CORE could be disabled by default in 3/4 but enabled
> > > > in 4/4? So this might facilicate git bisect in case of any regression
> > > > check?
> > >
> > > Agreed. Will fix in next version.
> > >
> > > > [cut]
> > > > > + * To honor the rule of CORE granule update, set this cpu to the LLC idle
> > > > > + * cpumask only if there is no cpu of this core showed up in the cpumask.
> > > > > + */
> > > > > +static void update_idle_cpu(int cpu)
> > > > > +{
> > > > > + struct sched_domain_shared *sds;
> > > > > +
> > > > > + if (!sched_feat(SIS_CORE))
> > > > > + return;
> > > > > +
> > > > > + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > > > > + if (sds) {
> > > > > + struct cpumask *icpus = to_cpumask(sds->icpus);
> > > > > +
> > > > > + /*
> > > > > + * This is racy against clearing in select_idle_cpu(),
> > > > > + * and can lead to idle cpus miss the chance to be set to
> > > > > + * the idle cpumask, thus the idle cpus are temporarily
> > > > > + * out of reach in SIS domain scan. But it should be rare
> > > > > + * and we still have ILB to kick them working.
> > > > > + */
> > > > > + if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
> > > > > + cpumask_set_cpu(cpu, icpus);
> > > > Maybe I miss something, here we only set one CPU in the icpus, but
> > > > when we reach update_idle_cpu(), all SMT siblings of 'cpu' are idle,
> > > > is this intended for 'CORE granule update'?
> > >
> > > The __update_idle_core() is called by all the cpus that need to go idle
> > > to update has_idle_core if necessary, and update_idle_cpu() is called
> > > before that check.
> > >
> > I see.
> >
> > Since __update_idle_core() has checked all SMT siblings of 'cpu' if
> > they are idle, can that information also be updated to icpus?
>
> I think this will simply fallback to the original per-cpu proposal and
> lose the opportunity to spread tasks to different cores.
OK, make sense.
thanks,
Chenyu
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
2022-10-19 12:28 [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
` (3 preceding siblings ...)
2022-10-19 12:28 ` [PATCH v6 4/4] sched/fair: Deal with SIS scan failures Abel Wu
@ 2022-11-04 7:29 ` Abel Wu
2022-11-14 5:45 ` K Prateek Nayak
2023-02-07 3:42 ` K Prateek Nayak
6 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2022-11-04 7:29 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider
Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
Aubrey Li, Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang,
Barry Song, linux-kernel
Ping :)
On 10/19/22 8:28 PM, Abel Wu wrote:
> This patchset tries to improve SIS scan efficiency by recording idle
> cpus in a cpumask for each LLC which will be used as a target cpuset
> in the domain scan. The cpus are recorded at CORE granule to avoid
> tasks being stack on same core.
>
> v5 -> v6:
> - Rename SIS_FILTER to SIS_CORE as it can only be activated when
> SMT is enabled and better describes the behavior of CORE granule
> update & load delivery.
> - Removed the part of limited scan for idle cores since it might be
> better to open another thread to discuss the strategies such as
> limited or scaled depth. But keep the part of full scan for idle
> cores when LLC is overloaded because SIS_CORE can greatly reduce
> the overhead of full scan in such case.
> - Removed the state of sd_is_busy which indicates an LLC is fully
> busy and we can safely skip the SIS domain scan. I would prefer
> leave this to SIS_UTIL.
> - The filter generation mechanism is replaced by in-place updates
> during domain scan to better deal with partial scan failures.
> - Collect Reviewed-bys from Tim Chen
>
> ...
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
2022-10-19 12:28 [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
` (4 preceding siblings ...)
2022-11-04 7:29 ` [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
@ 2022-11-14 5:45 ` K Prateek Nayak
2022-11-15 8:31 ` Abel Wu
2023-02-07 3:42 ` K Prateek Nayak
6 siblings, 1 reply; 18+ messages in thread
From: K Prateek Nayak @ 2022-11-14 5:45 UTC (permalink / raw)
To: Abel Wu, Peter Zijlstra, Ingo Molnar, Mel Gorman,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider
Cc: Josh Don, Chen Yu, Tim Chen, Gautham R . Shenoy, Aubrey Li,
Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang, Barry Song,
linux-kernel
Hello Abel,
Sorry for the delay. I've tested the patch on a dual socket Zen3 system
(2 x 64C/128T)
tl;dr
o I do not notice any regressions with the standard benchmarks.
o schbench sees a nice improvement to the tail latency when the number
of worker are equal to the number of cores in the system in NPS1 and
NPS2 mode. (Marked with "^")
o Few data points show improvements in tbench in NPS1 and NPS2 mode.
(Marked with "^")
I'm still in the process of running larger workloads. If there is any
specific workload you would like me to run on the test system, please
do let me know. Below is the detailed report:
Following are the results from running standard benchmarks on a
dual socket Zen3 (2 x 64C/128T) machine configured in different
NPS modes.
NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:
NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.
Node 0: 0-63, 128-191
Node 1: 64-127, 192-255
NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.
Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255
NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.
Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255
Benchmark Results:
Kernel versions:
- tip: 5.19.0 tip sched/core
- sis_core: 5.19.0 tip sched/core + this series
When we started testing, the tip was at:
commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~
o NPS1
Test: tip sis_core
1-groups: 4.06 (0.00 pct) 4.26 (-4.92 pct) *
1-groups: 4.14 (0.00 pct) 4.09 (1.20 pct) [Verification Run]
2-groups: 4.76 (0.00 pct) 4.71 (1.05 pct)
4-groups: 5.22 (0.00 pct) 5.11 (2.10 pct)
8-groups: 5.35 (0.00 pct) 5.31 (0.74 pct)
16-groups: 7.21 (0.00 pct) 6.80 (5.68 pct)
o NPS2
Test: tip sis_core
1-groups: 4.09 (0.00 pct) 4.08 (0.24 pct)
2-groups: 4.70 (0.00 pct) 4.69 (0.21 pct)
4-groups: 5.05 (0.00 pct) 4.92 (2.57 pct)
8-groups: 5.35 (0.00 pct) 5.26 (1.68 pct)
16-groups: 6.37 (0.00 pct) 6.34 (0.47 pct)
o NPS4
Test: tip sis_core
1-groups: 4.07 (0.00 pct) 3.99 (1.96 pct)
2-groups: 4.65 (0.00 pct) 4.59 (1.29 pct)
4-groups: 5.13 (0.00 pct) 5.00 (2.53 pct)
8-groups: 5.47 (0.00 pct) 5.43 (0.73 pct)
16-groups: 6.82 (0.00 pct) 6.56 (3.81 pct)
~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~
o NPS1
#workers: tip sis_core
1: 33.00 (0.00 pct) 33.00 (0.00 pct)
2: 35.00 (0.00 pct) 35.00 (0.00 pct)
4: 39.00 (0.00 pct) 38.00 (2.56 pct)
8: 49.00 (0.00 pct) 48.00 (2.04 pct)
16: 63.00 (0.00 pct) 66.00 (-4.76 pct)
32: 109.00 (0.00 pct) 107.00 (1.83 pct)
64: 208.00 (0.00 pct) 216.00 (-3.84 pct)
128: 559.00 (0.00 pct) 469.00 (16.10 pct) ^
256: 45888.00 (0.00 pct) 47552.00 (-3.62 pct)
512: 80000.00 (0.00 pct) 79744.00 (0.32 pct)
o NPS2
#workers: =tip sis_core
1: 30.00 (0.00 pct) 32.00 (-6.66 pct)
2: 37.00 (0.00 pct) 34.00 (8.10 pct)
4: 39.00 (0.00 pct) 36.00 (7.69 pct)
8: 51.00 (0.00 pct) 49.00 (3.92 pct)
16: 67.00 (0.00 pct) 66.00 (1.49 pct)
32: 117.00 (0.00 pct) 109.00 (6.83 pct)
64: 216.00 (0.00 pct) 213.00 (1.38 pct)
128: 529.00 (0.00 pct) 465.00 (12.09 pct) ^
256: 47040.00 (0.00 pct) 46528.00 (1.08 pct)
512: 84864.00 (0.00 pct) 83584.00 (1.50 pct)
o NPS4
#workers: tip sis_core
1: 23.00 (0.00 pct) 28.00 (-21.73 pct)
2: 28.00 (0.00 pct) 36.00 (-28.57 pct)
4: 41.00 (0.00 pct) 43.00 (-4.87 pct)
8: 60.00 (0.00 pct) 48.00 (20.00 pct)
16: 71.00 (0.00 pct) 69.00 (2.81 pct)
32: 117.00 (0.00 pct) 115.00 (1.70 pct)
64: 227.00 (0.00 pct) 228.00 (-0.44 pct)
128: 545.00 (0.00 pct) 545.00 (0.00 pct)
256: 45632.00 (0.00 pct) 47680.00 (-4.48 pct)
512: 81024.00 (0.00 pct) 76416.00 (5.68 pct)
Note: For lower worker count, schbench can show run to
run variation depending on external factors. Regression
for lower worker count can be ignored. The results are
included to spot any large blow up in the tail latency
for larger worker count.
~~~~~~~~~~
~ tbench ~
~~~~~~~~~~
o NPS1
Clients: tip sis_core
1 578.37 (0.00 pct) 582.09 (0.64 pct)
2 1062.09 (0.00 pct) 1063.95 (0.17 pct)
4 1800.62 (0.00 pct) 1879.18 (4.36 pct)
8 3211.02 (0.00 pct) 3220.44 (0.29 pct)
16 4848.92 (0.00 pct) 4890.08 (0.84 pct)
32 9091.36 (0.00 pct) 9721.13 (6.92 pct) ^
64 15454.01 (0.00 pct) 15124.42 (-2.13 pct)
128 3511.33 (0.00 pct) 14314.79 (307.67 pct)
128 19910.99 (0.00pct) 19935.61 (0.12 pct) [Verification Run]
256 50019.32 (0.00 pct) 50708.24 (1.37 pct)
512 44317.68 (0.00 pct) 44787.48 (1.06 pct)
1024 41200.85 (0.00 pct) 42079.29 (2.13 pct)
o NPS2
Clients: tip sis_core
1 576.05 (0.00 pct) 579.18 (0.54 pct)
2 1037.68 (0.00 pct) 1070.49 (3.16 pct)
4 1818.13 (0.00 pct) 1860.22 (2.31 pct)
8 3004.16 (0.00 pct) 3087.09 (2.76 pct)
16 4520.11 (0.00 pct) 4789.53 (5.96 pct)
32 8624.23 (0.00 pct) 9439.50 (9.45 pct) ^
64 14886.75 (0.00 pct) 15004.96 (0.79 pct)
128 20602.00 (0.00 pct) 17730.31 (-13.93 pct) *
128 20602.00 (0.00 pct) 19585.20 (-4.93 pct) [Verification Run]
256 45566.83 (0.00 pct) 47922.70 (5.17 pct)
512 42717.49 (0.00 pct) 43809.68 (2.55 pct)
1024 40936.61 (0.00 pct) 40787.71 (-0.36 pct)
o NPS4
Clients: tip sis_core
1 576.36 (0.00 pct) 580.83 (0.77 pct)
2 1044.26 (0.00 pct) 1066.50 (2.12 pct)
4 1839.77 (0.00 pct) 1867.56 (1.51 pct)
8 3043.53 (0.00 pct) 3115.17 (2.35 pct)
16 5207.54 (0.00 pct) 4847.53 (-6.91 pct) *
16 4722.56 (0.00 pct) 4811.29 (1.87 pct) [Verification Run]
32 9263.86 (0.00 pct) 9478.68 (2.31 pct)
64 14959.66 (0.00 pct) 15267.39 (2.05 pct)
128 20698.65 (0.00 pct) 20432.19 (-1.28 pct)
256 46666.21 (0.00 pct) 46664.81 (0.00 pct)
512 41532.80 (0.00 pct) 44241.12 (6.52 pct)
1024 39459.49 (0.00 pct) 41043.22 (4.01 pct)
Note: On the tested kernel, with 128 clients, tbench can
run into a bottleneck during C2 exit. More details can be
found at:
https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/
This issue has been fixed in v6.0 but was not part of the
tip kernel when I started testing. This data point has
been rerun with C2 disabled to get representative results.
~~~~~~~~~~
~ Stream ~
~~~~~~~~~~
o NPS1
-> 10 Runs:
Test: tip sis_core
Copy: 328419.14 (0.00 pct) 337857.83 (2.87 pct)
Scale: 206071.21 (0.00 pct) 212133.82 (2.94 pct)
Add: 235271.48 (0.00 pct) 243811.97 (3.63 pct)
Triad: 253175.80 (0.00 pct) 252333.43 (-0.33 pct)
-> 100 Runs:
Test: tip sis_core
Copy: 328209.61 (0.00 pct) 339817.27 (3.53 pct)
Scale: 216310.13 (0.00 pct) 218635.16 (1.07 pct)
Add: 244417.83 (0.00 pct) 245641.47 (0.50 pct)
Triad: 237508.83 (0.00 pct) 255387.28 (7.52 pct)
o NPS2
-> 10 Runs:
Test: tip sis_core
Copy: 336503.88 (0.00 pct) 339684.21 (0.94 pct)
Scale: 218035.23 (0.00 pct) 217601.11 (-0.19 pct)
Add: 257677.42 (0.00 pct) 258608.34 (0.36 pct)
Triad: 268872.37 (0.00 pct) 272548.09 (1.36 pct)
-> 100 Runs:
Test: tip sis_core
Copy: 332304.34 (0.00 pct) 341565.75 (2.78 pct)
Scale: 223421.60 (0.00 pct) 224267.40 (0.37 pct)
Add: 252363.56 (0.00 pct) 254926.98 (1.01 pct)
Triad: 266687.56 (0.00 pct) 270782.81 (1.53 pct)
o NPS4
-> 10 Runs:
Test: tip sis_core
Copy: 353515.62 (0.00 pct) 342060.85 (-3.24 pct)
Scale: 228854.37 (0.00 pct) 218262.41 (-4.62 pct)
Add: 254942.12 (0.00 pct) 241975.90 (-5.08 pct)
Triad: 270521.87 (0.00 pct) 257686.71 (-4.74 pct)
-> 100 Runs:
Test: tip sis_core
Copy: 374520.81 (0.00 pct) 369353.13 (-1.37 pct)
Scale: 246280.23 (0.00 pct) 253881.69 (3.08 pct)
Add: 262772.72 (0.00 pct) 266484.58 (1.41 pct)
Triad: 283740.92 (0.00 pct) 279981.18 (-1.32 pct)
On 10/19/2022 5:58 PM, Abel Wu wrote:
> This patchset tries to improve SIS scan efficiency by recording idle
> cpus in a cpumask for each LLC which will be used as a target cpuset
> in the domain scan. The cpus are recorded at CORE granule to avoid
> tasks being stack on same core.
>
> v5 -> v6:
> - Rename SIS_FILTER to SIS_CORE as it can only be activated when
> SMT is enabled and better describes the behavior of CORE granule
> update & load delivery.
> - Removed the part of limited scan for idle cores since it might be
> better to open another thread to discuss the strategies such as
> limited or scaled depth. But keep the part of full scan for idle
> cores when LLC is overloaded because SIS_CORE can greatly reduce
> the overhead of full scan in such case.
> - Removed the state of sd_is_busy which indicates an LLC is fully
> busy and we can safely skip the SIS domain scan. I would prefer
> leave this to SIS_UTIL.
> - The filter generation mechanism is replaced by in-place updates
> during domain scan to better deal with partial scan failures.
> - Collect Reviewed-bys from Tim Chen
>
> v4 -> v5:
> - Add limited scan for idle cores when overloaded, suggested by Mel
> - Split out several patches since they are irrelevant to this scope
> - Add quick check on ttwu_pending before core update
> - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
> - Move the main filter logic to the idle path, because the newidle
> balance can bail out early if rq->avg_idle is small enough and
> lose chances to update the filter.
>
> v3 -> v4:
> - Update filter in load_balance rather than in the tick
> - Now the filter contains unoccupied cpus rather than overloaded ones
> - Added mechanisms to deal with the false positive cases
>
> v2 -> v3:
> - Removed sched-idle balance feature and focus on SIS
> - Take non-CFS tasks into consideration
> - Several fixes/improvement suggested by Josh Don
>
> v1 -> v2:
> - Several optimizations on sched-idle balancing
> - Ignore asym topos in can_migrate_task
> - Add more benchmarks including SIS efficiency
> - Re-organize patch as suggested by Mel Gorman
>
> Abel Wu (4):
> sched/fair: Skip core update if task pending
> sched/fair: Ignore SIS_UTIL when has_idle_core
> sched/fair: Introduce SIS_CORE
> sched/fair: Deal with SIS scan failures
>
> include/linux/sched/topology.h | 15 ++++
> kernel/sched/fair.c | 122 +++++++++++++++++++++++++++++----
> kernel/sched/features.h | 7 ++
> kernel/sched/sched.h | 3 +
> kernel/sched/topology.c | 8 ++-
> 5 files changed, 141 insertions(+), 14 deletions(-)
>
I ran pgbench from mmtest but realised there is too much run to run
variation on the system. Planning on running MongoDB benchmark which
is more stable on the system and couple more workloads but the
initial results look good. I'll get back with results later this week
or by early next week. Meanwhile, if you need data for any specific
workload on the test system, please do let me know.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
2022-11-14 5:45 ` K Prateek Nayak
@ 2022-11-15 8:31 ` Abel Wu
2022-11-15 11:28 ` K Prateek Nayak
0 siblings, 1 reply; 18+ messages in thread
From: Abel Wu @ 2022-11-15 8:31 UTC (permalink / raw)
To: K Prateek Nayak, Peter Zijlstra, Ingo Molnar, Mel Gorman,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider
Cc: Josh Don, Chen Yu, Tim Chen, Gautham R . Shenoy, Aubrey Li,
Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang, Barry Song,
linux-kernel
Hi Prateek, thanks very much for your detailed testing!
On 11/14/22 1:45 PM, K Prateek Nayak wrote:
> Hello Abel,
>
> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
> (2 x 64C/128T)
>
> tl;dr
>
> o I do not notice any regressions with the standard benchmarks.
> o schbench sees a nice improvement to the tail latency when the number
> of worker are equal to the number of cores in the system in NPS1 and
> NPS2 mode. (Marked with "^")
> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
> (Marked with "^")
>
> I'm still in the process of running larger workloads. If there is any
> specific workload you would like me to run on the test system, please
> do let me know. Below is the detailed report:
Not particularly in my mind, and I think testing larger workloads is
great. Thanks!
>
> Following are the results from running standard benchmarks on a
> dual socket Zen3 (2 x 64C/128T) machine configured in different
> NPS modes.
>
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
>
> NPS1: Each socket is a NUMA node.
> Total 2 NUMA nodes in the dual socket machine.
>
> Node 0: 0-63, 128-191
> Node 1: 64-127, 192-255
>
> NPS2: Each socket is further logically divided into 2 NUMA regions.
> Total 4 NUMA nodes exist over 2 socket.
>
> Node 0: 0-31, 128-159
> Node 1: 32-63, 160-191
> Node 2: 64-95, 192-223
> Node 3: 96-127, 223-255
>
> NPS4: Each socket is logically divided into 4 NUMA regions.
> Total 8 NUMA nodes exist over 2 socket.
>
> Node 0: 0-15, 128-143
> Node 1: 16-31, 144-159
> Node 2: 32-47, 160-175
> Node 3: 48-63, 176-191
> Node 4: 64-79, 192-207
> Node 5: 80-95, 208-223
> Node 6: 96-111, 223-231
> Node 7: 112-127, 232-255
>
> Benchmark Results:
>
> Kernel versions:
> - tip: 5.19.0 tip sched/core
> - sis_core: 5.19.0 tip sched/core + this series
>
> When we started testing, the tip was at:
> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
>
> ~~~~~~~~~~~~~
> ~ hackbench ~
> ~~~~~~~~~~~~~
>
> o NPS1
>
> Test: tip sis_core
> 1-groups: 4.06 (0.00 pct) 4.26 (-4.92 pct) *
> 1-groups: 4.14 (0.00 pct) 4.09 (1.20 pct) [Verification Run]
> 2-groups: 4.76 (0.00 pct) 4.71 (1.05 pct)
> 4-groups: 5.22 (0.00 pct) 5.11 (2.10 pct)
> 8-groups: 5.35 (0.00 pct) 5.31 (0.74 pct)
> 16-groups: 7.21 (0.00 pct) 6.80 (5.68 pct)
>
> o NPS2
>
> Test: tip sis_core
> 1-groups: 4.09 (0.00 pct) 4.08 (0.24 pct)
> 2-groups: 4.70 (0.00 pct) 4.69 (0.21 pct)
> 4-groups: 5.05 (0.00 pct) 4.92 (2.57 pct)
> 8-groups: 5.35 (0.00 pct) 5.26 (1.68 pct)
> 16-groups: 6.37 (0.00 pct) 6.34 (0.47 pct)
>
> o NPS4
>
> Test: tip sis_core
> 1-groups: 4.07 (0.00 pct) 3.99 (1.96 pct)
> 2-groups: 4.65 (0.00 pct) 4.59 (1.29 pct)
> 4-groups: 5.13 (0.00 pct) 5.00 (2.53 pct)
> 8-groups: 5.47 (0.00 pct) 5.43 (0.73 pct)
> 16-groups: 6.82 (0.00 pct) 6.56 (3.81 pct)
Although each cpu will get 2.5 tasks when 16-groups, which can
be considered overloaded, I tested in AMD EPYC 7Y83 machine and
the total cpu usage was ~82% (with some older kernel version),
so there is still lots of idle time.
I guess cutting off at 16-groups is because it's enough loaded
compared to the real workloads, so testing more groups might just
be a waste of time?
Thanks & Best Regards,
Abel
>
> ~~~~~~~~~~~~
> ~ schbench ~
> ~~~~~~~~~~~~
>
> o NPS1
>
> #workers: tip sis_core
> 1: 33.00 (0.00 pct) 33.00 (0.00 pct)
> 2: 35.00 (0.00 pct) 35.00 (0.00 pct)
> 4: 39.00 (0.00 pct) 38.00 (2.56 pct)
> 8: 49.00 (0.00 pct) 48.00 (2.04 pct)
> 16: 63.00 (0.00 pct) 66.00 (-4.76 pct)
> 32: 109.00 (0.00 pct) 107.00 (1.83 pct)
> 64: 208.00 (0.00 pct) 216.00 (-3.84 pct)
> 128: 559.00 (0.00 pct) 469.00 (16.10 pct) ^
> 256: 45888.00 (0.00 pct) 47552.00 (-3.62 pct)
> 512: 80000.00 (0.00 pct) 79744.00 (0.32 pct)
>
> o NPS2
>
> #workers: =tip sis_core
> 1: 30.00 (0.00 pct) 32.00 (-6.66 pct)
> 2: 37.00 (0.00 pct) 34.00 (8.10 pct)
> 4: 39.00 (0.00 pct) 36.00 (7.69 pct)
> 8: 51.00 (0.00 pct) 49.00 (3.92 pct)
> 16: 67.00 (0.00 pct) 66.00 (1.49 pct)
> 32: 117.00 (0.00 pct) 109.00 (6.83 pct)
> 64: 216.00 (0.00 pct) 213.00 (1.38 pct)
> 128: 529.00 (0.00 pct) 465.00 (12.09 pct) ^
> 256: 47040.00 (0.00 pct) 46528.00 (1.08 pct)
> 512: 84864.00 (0.00 pct) 83584.00 (1.50 pct)
>
> o NPS4
>
> #workers: tip sis_core
> 1: 23.00 (0.00 pct) 28.00 (-21.73 pct)
> 2: 28.00 (0.00 pct) 36.00 (-28.57 pct)
> 4: 41.00 (0.00 pct) 43.00 (-4.87 pct)
> 8: 60.00 (0.00 pct) 48.00 (20.00 pct)
> 16: 71.00 (0.00 pct) 69.00 (2.81 pct)
> 32: 117.00 (0.00 pct) 115.00 (1.70 pct)
> 64: 227.00 (0.00 pct) 228.00 (-0.44 pct)
> 128: 545.00 (0.00 pct) 545.00 (0.00 pct)
> 256: 45632.00 (0.00 pct) 47680.00 (-4.48 pct)
> 512: 81024.00 (0.00 pct) 76416.00 (5.68 pct)
>
> Note: For lower worker count, schbench can show run to
> run variation depending on external factors. Regression
> for lower worker count can be ignored. The results are
> included to spot any large blow up in the tail latency
> for larger worker count.
>
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
>
> o NPS1
>
> Clients: tip sis_core
> 1 578.37 (0.00 pct) 582.09 (0.64 pct)
> 2 1062.09 (0.00 pct) 1063.95 (0.17 pct)
> 4 1800.62 (0.00 pct) 1879.18 (4.36 pct)
> 8 3211.02 (0.00 pct) 3220.44 (0.29 pct)
> 16 4848.92 (0.00 pct) 4890.08 (0.84 pct)
> 32 9091.36 (0.00 pct) 9721.13 (6.92 pct) ^
> 64 15454.01 (0.00 pct) 15124.42 (-2.13 pct)
> 128 3511.33 (0.00 pct) 14314.79 (307.67 pct)
> 128 19910.99 (0.00pct) 19935.61 (0.12 pct) [Verification Run]
> 256 50019.32 (0.00 pct) 50708.24 (1.37 pct)
> 512 44317.68 (0.00 pct) 44787.48 (1.06 pct)
> 1024 41200.85 (0.00 pct) 42079.29 (2.13 pct)
>
> o NPS2
>
> Clients: tip sis_core
> 1 576.05 (0.00 pct) 579.18 (0.54 pct)
> 2 1037.68 (0.00 pct) 1070.49 (3.16 pct)
> 4 1818.13 (0.00 pct) 1860.22 (2.31 pct)
> 8 3004.16 (0.00 pct) 3087.09 (2.76 pct)
> 16 4520.11 (0.00 pct) 4789.53 (5.96 pct)
> 32 8624.23 (0.00 pct) 9439.50 (9.45 pct) ^
> 64 14886.75 (0.00 pct) 15004.96 (0.79 pct)
> 128 20602.00 (0.00 pct) 17730.31 (-13.93 pct) *
> 128 20602.00 (0.00 pct) 19585.20 (-4.93 pct) [Verification Run]
> 256 45566.83 (0.00 pct) 47922.70 (5.17 pct)
> 512 42717.49 (0.00 pct) 43809.68 (2.55 pct)
> 1024 40936.61 (0.00 pct) 40787.71 (-0.36 pct)
>
> o NPS4
>
> Clients: tip sis_core
> 1 576.36 (0.00 pct) 580.83 (0.77 pct)
> 2 1044.26 (0.00 pct) 1066.50 (2.12 pct)
> 4 1839.77 (0.00 pct) 1867.56 (1.51 pct)
> 8 3043.53 (0.00 pct) 3115.17 (2.35 pct)
> 16 5207.54 (0.00 pct) 4847.53 (-6.91 pct) *
> 16 4722.56 (0.00 pct) 4811.29 (1.87 pct) [Verification Run]
> 32 9263.86 (0.00 pct) 9478.68 (2.31 pct)
> 64 14959.66 (0.00 pct) 15267.39 (2.05 pct)
> 128 20698.65 (0.00 pct) 20432.19 (-1.28 pct)
> 256 46666.21 (0.00 pct) 46664.81 (0.00 pct)
> 512 41532.80 (0.00 pct) 44241.12 (6.52 pct)
> 1024 39459.49 (0.00 pct) 41043.22 (4.01 pct)
>
> Note: On the tested kernel, with 128 clients, tbench can
> run into a bottleneck during C2 exit. More details can be
> found at:
> https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/
> This issue has been fixed in v6.0 but was not part of the
> tip kernel when I started testing. This data point has
> been rerun with C2 disabled to get representative results.
>
> ~~~~~~~~~~
> ~ Stream ~
> ~~~~~~~~~~
>
> o NPS1
>
> -> 10 Runs:
>
> Test: tip sis_core
> Copy: 328419.14 (0.00 pct) 337857.83 (2.87 pct)
> Scale: 206071.21 (0.00 pct) 212133.82 (2.94 pct)
> Add: 235271.48 (0.00 pct) 243811.97 (3.63 pct)
> Triad: 253175.80 (0.00 pct) 252333.43 (-0.33 pct)
>
> -> 100 Runs:
>
> Test: tip sis_core
> Copy: 328209.61 (0.00 pct) 339817.27 (3.53 pct)
> Scale: 216310.13 (0.00 pct) 218635.16 (1.07 pct)
> Add: 244417.83 (0.00 pct) 245641.47 (0.50 pct)
> Triad: 237508.83 (0.00 pct) 255387.28 (7.52 pct)
>
> o NPS2
>
> -> 10 Runs:
>
> Test: tip sis_core
> Copy: 336503.88 (0.00 pct) 339684.21 (0.94 pct)
> Scale: 218035.23 (0.00 pct) 217601.11 (-0.19 pct)
> Add: 257677.42 (0.00 pct) 258608.34 (0.36 pct)
> Triad: 268872.37 (0.00 pct) 272548.09 (1.36 pct)
>
> -> 100 Runs:
>
> Test: tip sis_core
> Copy: 332304.34 (0.00 pct) 341565.75 (2.78 pct)
> Scale: 223421.60 (0.00 pct) 224267.40 (0.37 pct)
> Add: 252363.56 (0.00 pct) 254926.98 (1.01 pct)
> Triad: 266687.56 (0.00 pct) 270782.81 (1.53 pct)
>
> o NPS4
>
> -> 10 Runs:
>
> Test: tip sis_core
> Copy: 353515.62 (0.00 pct) 342060.85 (-3.24 pct)
> Scale: 228854.37 (0.00 pct) 218262.41 (-4.62 pct)
> Add: 254942.12 (0.00 pct) 241975.90 (-5.08 pct)
> Triad: 270521.87 (0.00 pct) 257686.71 (-4.74 pct)
>
> -> 100 Runs:
>
> Test: tip sis_core
> Copy: 374520.81 (0.00 pct) 369353.13 (-1.37 pct)
> Scale: 246280.23 (0.00 pct) 253881.69 (3.08 pct)
> Add: 262772.72 (0.00 pct) 266484.58 (1.41 pct)
> Triad: 283740.92 (0.00 pct) 279981.18 (-1.32 pct)
>
> On 10/19/2022 5:58 PM, Abel Wu wrote:
>> This patchset tries to improve SIS scan efficiency by recording idle
>> cpus in a cpumask for each LLC which will be used as a target cpuset
>> in the domain scan. The cpus are recorded at CORE granule to avoid
>> tasks being stack on same core.
>>
>> v5 -> v6:
>> - Rename SIS_FILTER to SIS_CORE as it can only be activated when
>> SMT is enabled and better describes the behavior of CORE granule
>> update & load delivery.
>> - Removed the part of limited scan for idle cores since it might be
>> better to open another thread to discuss the strategies such as
>> limited or scaled depth. But keep the part of full scan for idle
>> cores when LLC is overloaded because SIS_CORE can greatly reduce
>> the overhead of full scan in such case.
>> - Removed the state of sd_is_busy which indicates an LLC is fully
>> busy and we can safely skip the SIS domain scan. I would prefer
>> leave this to SIS_UTIL.
>> - The filter generation mechanism is replaced by in-place updates
>> during domain scan to better deal with partial scan failures.
>> - Collect Reviewed-bys from Tim Chen
>>
>> v4 -> v5:
>> - Add limited scan for idle cores when overloaded, suggested by Mel
>> - Split out several patches since they are irrelevant to this scope
>> - Add quick check on ttwu_pending before core update
>> - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
>> - Move the main filter logic to the idle path, because the newidle
>> balance can bail out early if rq->avg_idle is small enough and
>> lose chances to update the filter.
>>
>> v3 -> v4:
>> - Update filter in load_balance rather than in the tick
>> - Now the filter contains unoccupied cpus rather than overloaded ones
>> - Added mechanisms to deal with the false positive cases
>>
>> v2 -> v3:
>> - Removed sched-idle balance feature and focus on SIS
>> - Take non-CFS tasks into consideration
>> - Several fixes/improvement suggested by Josh Don
>>
>> v1 -> v2:
>> - Several optimizations on sched-idle balancing
>> - Ignore asym topos in can_migrate_task
>> - Add more benchmarks including SIS efficiency
>> - Re-organize patch as suggested by Mel Gorman
>>
>> Abel Wu (4):
>> sched/fair: Skip core update if task pending
>> sched/fair: Ignore SIS_UTIL when has_idle_core
>> sched/fair: Introduce SIS_CORE
>> sched/fair: Deal with SIS scan failures
>>
>> include/linux/sched/topology.h | 15 ++++
>> kernel/sched/fair.c | 122 +++++++++++++++++++++++++++++----
>> kernel/sched/features.h | 7 ++
>> kernel/sched/sched.h | 3 +
>> kernel/sched/topology.c | 8 ++-
>> 5 files changed, 141 insertions(+), 14 deletions(-)
>>
>
> I ran pgbench from mmtest but realised there is too much run to run
> variation on the system. Planning on running MongoDB benchmark which
> is more stable on the system and couple more workloads but the
> initial results look good. I'll get back with results later this week
> or by early next week. Meanwhile, if you need data for any specific
> workload on the test system, please do let me know.
>
> --
> Thanks and Regards,
> Prateek
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
2022-11-15 8:31 ` Abel Wu
@ 2022-11-15 11:28 ` K Prateek Nayak
2022-11-22 11:28 ` K Prateek Nayak
0 siblings, 1 reply; 18+ messages in thread
From: K Prateek Nayak @ 2022-11-15 11:28 UTC (permalink / raw)
To: Abel Wu, Peter Zijlstra, Ingo Molnar, Mel Gorman,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider
Cc: Josh Don, Chen Yu, Tim Chen, Gautham R . Shenoy, Aubrey Li,
Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang, Barry Song,
linux-kernel
Hello Abel,
Thank you for taking a look at the report.
On 11/15/2022 2:01 PM, Abel Wu wrote:
> Hi Prateek, thanks very much for your detailed testing!
>
> On 11/14/22 1:45 PM, K Prateek Nayak wrote:
>> Hello Abel,
>>
>> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
>> (2 x 64C/128T)
>>
>> tl;dr
>>
>> o I do not notice any regressions with the standard benchmarks.
>> o schbench sees a nice improvement to the tail latency when the number
>> of worker are equal to the number of cores in the system in NPS1 and
>> NPS2 mode. (Marked with "^")
>> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
>> (Marked with "^")
>>
>> I'm still in the process of running larger workloads. If there is any
>> specific workload you would like me to run on the test system, please
>> do let me know. Below is the detailed report:
>
> Not particularly in my mind, and I think testing larger workloads is
> great. Thanks!
>
>>
>> Following are the results from running standard benchmarks on a
>> dual socket Zen3 (2 x 64C/128T) machine configured in different
>> NPS modes.
>>
>> NPS Modes are used to logically divide single socket into
>> multiple NUMA region.
>> Following is the NUMA configuration for each NPS mode on the system:
>>
>> NPS1: Each socket is a NUMA node.
>> Total 2 NUMA nodes in the dual socket machine.
>>
>> Node 0: 0-63, 128-191
>> Node 1: 64-127, 192-255
>>
>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>> Total 4 NUMA nodes exist over 2 socket.
>> Node 0: 0-31, 128-159
>> Node 1: 32-63, 160-191
>> Node 2: 64-95, 192-223
>> Node 3: 96-127, 223-255
>>
>> NPS4: Each socket is logically divided into 4 NUMA regions.
>> Total 8 NUMA nodes exist over 2 socket.
>> Node 0: 0-15, 128-143
>> Node 1: 16-31, 144-159
>> Node 2: 32-47, 160-175
>> Node 3: 48-63, 176-191
>> Node 4: 64-79, 192-207
>> Node 5: 80-95, 208-223
>> Node 6: 96-111, 223-231
>> Node 7: 112-127, 232-255
>>
>> Benchmark Results:
>>
>> Kernel versions:
>> - tip: 5.19.0 tip sched/core
>> - sis_core: 5.19.0 tip sched/core + this series
>>
>> When we started testing, the tip was at:
>> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
>>
>> ~~~~~~~~~~~~~
>> ~ hackbench ~
>> ~~~~~~~~~~~~~
>>
>> o NPS1
>>
>> Test: tip sis_core
>> 1-groups: 4.06 (0.00 pct) 4.26 (-4.92 pct) *
>> 1-groups: 4.14 (0.00 pct) 4.09 (1.20 pct) [Verification Run]
>> 2-groups: 4.76 (0.00 pct) 4.71 (1.05 pct)
>> 4-groups: 5.22 (0.00 pct) 5.11 (2.10 pct)
>> 8-groups: 5.35 (0.00 pct) 5.31 (0.74 pct)
>> 16-groups: 7.21 (0.00 pct) 6.80 (5.68 pct)
>>
>> o NPS2
>>
>> Test: tip sis_core
>> 1-groups: 4.09 (0.00 pct) 4.08 (0.24 pct)
>> 2-groups: 4.70 (0.00 pct) 4.69 (0.21 pct)
>> 4-groups: 5.05 (0.00 pct) 4.92 (2.57 pct)
>> 8-groups: 5.35 (0.00 pct) 5.26 (1.68 pct)
>> 16-groups: 6.37 (0.00 pct) 6.34 (0.47 pct)
>>
>> o NPS4
>>
>> Test: tip sis_core
>> 1-groups: 4.07 (0.00 pct) 3.99 (1.96 pct)
>> 2-groups: 4.65 (0.00 pct) 4.59 (1.29 pct)
>> 4-groups: 5.13 (0.00 pct) 5.00 (2.53 pct)
>> 8-groups: 5.47 (0.00 pct) 5.43 (0.73 pct)
>> 16-groups: 6.82 (0.00 pct) 6.56 (3.81 pct)
>
> Although each cpu will get 2.5 tasks when 16-groups, which can
> be considered overloaded, I tested in AMD EPYC 7Y83 machine and
> the total cpu usage was ~82% (with some older kernel version),
> so there is still lots of idle time.
>
> I guess cutting off at 16-groups is because it's enough loaded
> compared to the real workloads, so testing more groups might just
> be a waste of time?
The machine has 16 LLCs so I capped the results at 16-groups.
Previously I had seen some run-to-run variance with larger group counts
so I limited the reports to 16-groups. I'll run hackbench with more
number of groups (32, 64, 128, 256) and get back to you with the
results along with results for a couple of long running workloads.
>
> Thanks & Best Regards,
> Abel
>
> [..snip..]
>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
2022-11-15 11:28 ` K Prateek Nayak
@ 2022-11-22 11:28 ` K Prateek Nayak
2022-11-24 3:50 ` Abel Wu
0 siblings, 1 reply; 18+ messages in thread
From: K Prateek Nayak @ 2022-11-22 11:28 UTC (permalink / raw)
To: Abel Wu, Peter Zijlstra, Ingo Molnar, Mel Gorman,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider
Cc: Josh Don, Chen Yu, Tim Chen, Gautham R . Shenoy, Aubrey Li,
Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang, Barry Song,
linux-kernel
Hello Abel,
Following are the results for hackbench with larger number of
groups, ycsb-mongodb, Spec-JBB, and unixbench. Apart for
a regression in unixbench spawn in NPS2 and NPS4 mode and
unixbench syscall in NPs2 mode, everything looks good.
Detailed results are below:
~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~
o NPS1:
tip: 131696.33 (var: 2.03%)
sis_core: 129519.00 (var: 1.46%) (-1.65%)
o NPS2:
tip: 129895.33 (var: 2.34%)
sis_core: 130774.33 (var: 2.57%) (+0.67%)
o NPS4:
tip: 131165.00 (var: 1.06%)
sis_core: 133547.33 (var: 3.90%) (+1.81%)
~~~~~~~~~~~~~~~~~
~ Spec-JBB NPS1 ~
~~~~~~~~~~~~~~~~~
Max-jOPS and Critical-jOPS are same as the tip kernel.
~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~
-> unixbench-dhry2reg
o NPS1
kernel: tip sis_core
Min unixbench-dhry2reg-1 48876615.50 ( 0.00%) 48891544.00 ( 0.03%)
Min unixbench-dhry2reg-512 6260344658.90 ( 0.00%) 6282967594.10 ( 0.36%)
Hmean unixbench-dhry2reg-1 49299721.81 ( 0.00%) 49233828.70 ( -0.13%)
Hmean unixbench-dhry2reg-512 6267459427.19 ( 0.00%) 6288772961.79 * 0.34%*
CoeffVar unixbench-dhry2reg-1 0.90 ( 0.00%) 0.68 ( 24.66%)
CoeffVar unixbench-dhry2reg-512 0.10 ( 0.00%) 0.10 ( 7.54%)
o NPS2
kernel: tip sis_core
Min unixbench-dhry2reg-1 48828251.70 ( 0.00%) 48856709.20 ( 0.06%)
Min unixbench-dhry2reg-512 6244987739.10 ( 0.00%) 6271229549.10 ( 0.42%)
Hmean unixbench-dhry2reg-1 48869882.65 ( 0.00%) 49302481.81 ( 0.89%)
Hmean unixbench-dhry2reg-512 6261073948.84 ( 0.00%) 6272564898.35 ( 0.18%)
CoeffVar unixbench-dhry2reg-1 0.08 ( 0.00%) 0.87 (-945.28%)
CoeffVar unixbench-dhry2reg-512 0.23 ( 0.00%) 0.03 ( 85.94%)
o NPS4
kernel: tip sis_core
Min unixbench-dhry2reg-1 48523981.30 ( 0.00%) 49083957.50 ( 1.15%)
Min unixbench-dhry2reg-512 6253738837.10 ( 0.00%) 6271747119.10 ( 0.29%)
Hmean unixbench-dhry2reg-1 48781044.09 ( 0.00%) 49232218.87 * 0.92%*
Hmean unixbench-dhry2reg-512 6264428474.90 ( 0.00%) 6280484789.64 ( 0.26%)
CoeffVar unixbench-dhry2reg-1 0.46 ( 0.00%) 0.26 ( 42.63%)
CoeffVar unixbench-dhry2reg-512 0.17 ( 0.00%) 0.21 ( -26.72%)
-> unixbench-syscall
o NPS1
kernel: tip sis_core
Min unixbench-syscall-1 2975654.80 ( 0.00%) 2978489.40 ( -0.10%)
Min unixbench-syscall-512 7840226.50 ( 0.00%) 7822133.40 ( 0.23%)
Amean unixbench-syscall-1 2976326.47 ( 0.00%) 2980985.27 * -0.16%*
Amean unixbench-syscall-512 7850493.90 ( 0.00%) 7844527.50 ( 0.08%)
CoeffVar unixbench-syscall-1 0.03 ( 0.00%) 0.07 (-154.43%)
CoeffVar unixbench-syscall-512 0.13 ( 0.00%) 0.34 (-158.96%)
o NPS2
kernel: tip sis_core
Min unixbench-syscall-1 2969863.60 ( 0.00%) 2977936.50 ( -0.27%)
Min unixbench-syscall-512 8053157.60 ( 0.00%) 8072239.00 ( -0.24%)
Amean unixbench-syscall-1 2970462.30 ( 0.00%) 2981732.50 * -0.38%*
Amean unixbench-syscall-512 8061454.50 ( 0.00%) 8079287.73 * -0.22%*
CoeffVar unixbench-syscall-1 0.02 ( 0.00%) 0.11 (-527.26%)
CoeffVar unixbench-syscall-512 0.12 ( 0.00%) 0.08 ( 37.30%)
o NPS4
kernel: tip sis_core
Min unixbench-syscall-1 2971799.80 ( 0.00%) 2979335.60 ( -0.25%)
Min unixbench-syscall-512 7824196.90 ( 0.00%) 8155610.20 ( -4.24%)
Amean unixbench-syscall-1 2973045.43 ( 0.00%) 2982036.13 * -0.30%*
Amean unixbench-syscall-512 7826302.17 ( 0.00%) 8173026.57 * -4.43%* <-- Regression in syscall for larger worker count
CoeffVar unixbench-syscall-1 0.04 ( 0.00%) 0.09 (-139.63%)
CoeffVar unixbench-syscall-512 0.03 ( 0.00%) 0.20 (-701.13%)
-> unixbench-pipe
o NPS1
kernel: tip sis_core
Min unixbench-pipe-1 2894765.30 ( 0.00%) 2891505.30 ( -0.11%)
Min unixbench-pipe-512 329818573.50 ( 0.00%) 325610257.80 ( -1.28%)
Hmean unixbench-pipe-1 2898803.38 ( 0.00%) 2896940.25 ( -0.06%)
Hmean unixbench-pipe-512 330226401.69 ( 0.00%) 326311984.29 * -1.19%*
CoeffVar unixbench-pipe-1 0.14 ( 0.00%) 0.17 ( -21.99%)
CoeffVar unixbench-pipe-512 0.11 ( 0.00%) 0.20 ( -88.38%)
o NPS2
kernel: tip sis_core
Min unixbench-pipe-1 2895327.90 ( 0.00%) 2894798.20 ( -0.02%)
Min unixbench-pipe-512 328350065.60 ( 0.00%) 325681163.10 ( -0.81%)
Hmean unixbench-pipe-1 2899129.86 ( 0.00%) 2897067.80 ( -0.07%)
Hmean unixbench-pipe-512 329436096.80 ( 0.00%) 326023030.94 * -1.04%*
CoeffVar unixbench-pipe-1 0.12 ( 0.00%) 0.09 ( 21.96%)
CoeffVar unixbench-pipe-512 0.30 ( 0.00%) 0.12 ( 60.80%)
o NPS4
kernel: tip sis_core
Min unixbench-pipe-1 2901525.60 ( 0.00%) 2885730.80 ( -0.54%)
Min unixbench-pipe-512 330265873.90 ( 0.00%) 326730770.60 ( -1.07%)
Hmean unixbench-pipe-1 2906184.70 ( 0.00%) 2891616.18 * -0.50%*
Hmean unixbench-pipe-512 330854683.27 ( 0.00%) 327113296.63 * -1.13%*
CoeffVar unixbench-pipe-1 0.14 ( 0.00%) 0.19 ( -33.74%)
CoeffVar unixbench-pipe-512 0.16 ( 0.00%) 0.11 ( 31.75%)
-> unixbench-spawn
o NPS1
kernel: tip sis_core
Min unixbench-spawn-1 6536.50 ( 0.00%) 6000.30 ( -8.20%)
Min unixbench-spawn-512 72571.40 ( 0.00%) 70829.60 ( -2.40%)
Hmean unixbench-spawn-1 6811.16 ( 0.00%) 7016.11 ( 3.01%)
Hmean unixbench-spawn-512 72801.77 ( 0.00%) 71012.03 * -2.46%*
CoeffVar unixbench-spawn-1 3.69 ( 0.00%) 13.52 (-266.69%)
CoeffVar unixbench-spawn-512 0.27 ( 0.00%) 0.22 ( 18.25%)
o NPS2
kernel: tip sis_core
Min unixbench-spawn-1 7042.20 ( 0.00%) 7078.70 ( 0.52%)
Min unixbench-spawn-512 85571.60 ( 0.00%) 77362.60 ( -9.59%)
Hmean unixbench-spawn-1 7199.01 ( 0.00%) 7276.55 ( 1.08%)
Hmean unixbench-spawn-512 85717.77 ( 0.00%) 77923.73 * -9.09%* <-- Regression in spawn test for larger worker count
CoeffVar unixbench-spawn-1 3.50 ( 0.00%) 3.30 ( 5.70%)
CoeffVar unixbench-spawn-512 0.20 ( 0.00%) 0.82 (-304.88%)
o NPS4
kernel: tip sis_core
Min unixbench-spawn-1 7521.90 ( 0.00%) 8102.80 ( 7.72%)
Min unixbench-spawn-512 84245.70 ( 0.00%) 73074.50 ( -13.26%)
Hmean unixbench-spawn-1 7659.12 ( 0.00%) 8645.19 * 12.87%*
Hmean unixbench-spawn-512 84908.77 ( 0.00%) 73409.49 * -13.54%* <-- Regression in spawn test for larger worker count
CoeffVar unixbench-spawn-1 1.92 ( 0.00%) 5.78 (-200.56%)
CoeffVar unixbench-spawn-512 0.76 ( 0.00%) 0.41 ( 46.58%)
-> unixbench-execl
o NPS1
kernel: tip sis_core
Min unixbench-execl-1 5421.50 ( 0.00%) 5471.50 ( 0.92%)
Min unixbench-execl-512 11213.50 ( 0.00%) 11677.20 ( 4.14%)
Hmean unixbench-execl-1 5443.75 ( 0.00%) 5475.36 * 0.58%*
Hmean unixbench-execl-512 11311.94 ( 0.00%) 11804.52 * 4.35%*
CoeffVar unixbench-execl-1 0.38 ( 0.00%) 0.12 ( 69.22%)
CoeffVar unixbench-execl-512 1.03 ( 0.00%) 1.73 ( -68.91%)
o NPS2
kernel: tip sis_core
Min unixbench-execl-1 5089.10 ( 0.00%) 5405.40 ( 6.22%)
Min unixbench-execl-512 11772.70 ( 0.00%) 11917.20 ( 1.23%)
Hmean unixbench-execl-1 5321.65 ( 0.00%) 5421.41 ( 1.87%)
Hmean unixbench-execl-512 12201.73 ( 0.00%) 12327.95 ( 1.03%)
CoeffVar unixbench-execl-1 3.87 ( 0.00%) 0.28 ( 92.88%)
CoeffVar unixbench-execl-512 6.23 ( 0.00%) 5.78 ( 7.21%)
o NPS4
kernel: tip sis_core
Min unixbench-execl-1 5099.40 ( 0.00%) 5479.60 ( 7.46%)
Min unixbench-execl-512 11692.80 ( 0.00%) 12205.50 ( 4.38%)
Hmean unixbench-execl-1 5136.86 ( 0.00%) 5487.93 * 6.83%*
Hmean unixbench-execl-512 12053.71 ( 0.00%) 12712.96 ( 5.47%)
CoeffVar unixbench-execl-1 1.05 ( 0.00%) 0.14 ( 86.57%)
CoeffVar unixbench-execl-512 3.85 ( 0.00%) 5.86 ( -52.14%)
For unixbench regressions, I do not see anything obvious jump up
in perf traces captureed with IBS. top shows over 99% utilization
which would ideally mean there are not many updates to the mask.
I'll take some more look at the spawn test case and get back to you.
On 11/15/2022 4:58 PM, K Prateek Nayak wrote:
> Hello Abel,
>
> Thank you for taking a look at the report.
>
> On 11/15/2022 2:01 PM, Abel Wu wrote:
>> Hi Prateek, thanks very much for your detailed testing!
>>
>> On 11/14/22 1:45 PM, K Prateek Nayak wrote:
>>> Hello Abel,
>>>
>>> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
>>> (2 x 64C/128T)
>>>
>>> tl;dr
>>>
>>> o I do not notice any regressions with the standard benchmarks.
>>> o schbench sees a nice improvement to the tail latency when the number
>>> of worker are equal to the number of cores in the system in NPS1 and
>>> NPS2 mode. (Marked with "^")
>>> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
>>> (Marked with "^")
>>>
>>> I'm still in the process of running larger workloads. If there is any
>>> specific workload you would like me to run on the test system, please
>>> do let me know. Below is the detailed report:
>>
>> Not particularly in my mind, and I think testing larger workloads is
>> great. Thanks!
>>
>>>
>>> Following are the results from running standard benchmarks on a
>>> dual socket Zen3 (2 x 64C/128T) machine configured in different
>>> NPS modes.
>>>
>>> NPS Modes are used to logically divide single socket into
>>> multiple NUMA region.
>>> Following is the NUMA configuration for each NPS mode on the system:
>>>
>>> NPS1: Each socket is a NUMA node.
>>> Total 2 NUMA nodes in the dual socket machine.
>>>
>>> Node 0: 0-63, 128-191
>>> Node 1: 64-127, 192-255
>>>
>>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>>> Total 4 NUMA nodes exist over 2 socket.
>>> Node 0: 0-31, 128-159
>>> Node 1: 32-63, 160-191
>>> Node 2: 64-95, 192-223
>>> Node 3: 96-127, 223-255
>>>
>>> NPS4: Each socket is logically divided into 4 NUMA regions.
>>> Total 8 NUMA nodes exist over 2 socket.
>>> Node 0: 0-15, 128-143
>>> Node 1: 16-31, 144-159
>>> Node 2: 32-47, 160-175
>>> Node 3: 48-63, 176-191
>>> Node 4: 64-79, 192-207
>>> Node 5: 80-95, 208-223
>>> Node 6: 96-111, 223-231
>>> Node 7: 112-127, 232-255
>>>
>>> Benchmark Results:
>>>
>>> Kernel versions:
>>> - tip: 5.19.0 tip sched/core
>>> - sis_core: 5.19.0 tip sched/core + this series
>>>
>>> When we started testing, the tip was at:
>>> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
>>>
>>> ~~~~~~~~~~~~~
>>> ~ hackbench ~
>>> ~~~~~~~~~~~~~
>>>
>>> o NPS1
>>>
>>> Test: tip sis_core
>>> 1-groups: 4.06 (0.00 pct) 4.26 (-4.92 pct) *
>>> 1-groups: 4.14 (0.00 pct) 4.09 (1.20 pct) [Verification Run]
>>> 2-groups: 4.76 (0.00 pct) 4.71 (1.05 pct)
>>> 4-groups: 5.22 (0.00 pct) 5.11 (2.10 pct)
>>> 8-groups: 5.35 (0.00 pct) 5.31 (0.74 pct)
>>> 16-groups: 7.21 (0.00 pct) 6.80 (5.68 pct)
>>>
>>> o NPS2
>>>
>>> Test: tip sis_core
>>> 1-groups: 4.09 (0.00 pct) 4.08 (0.24 pct)
>>> 2-groups: 4.70 (0.00 pct) 4.69 (0.21 pct)
>>> 4-groups: 5.05 (0.00 pct) 4.92 (2.57 pct)
>>> 8-groups: 5.35 (0.00 pct) 5.26 (1.68 pct)
>>> 16-groups: 6.37 (0.00 pct) 6.34 (0.47 pct)
>>>
>>> o NPS4
>>>
>>> Test: tip sis_core
>>> 1-groups: 4.07 (0.00 pct) 3.99 (1.96 pct)
>>> 2-groups: 4.65 (0.00 pct) 4.59 (1.29 pct)
>>> 4-groups: 5.13 (0.00 pct) 5.00 (2.53 pct)
>>> 8-groups: 5.47 (0.00 pct) 5.43 (0.73 pct)
>>> 16-groups: 6.82 (0.00 pct) 6.56 (3.81 pct)
>>
>> Although each cpu will get 2.5 tasks when 16-groups, which can
>> be considered overloaded, I tested in AMD EPYC 7Y83 machine and
>> the total cpu usage was ~82% (with some older kernel version),
>> so there is still lots of idle time.
>>
>> I guess cutting off at 16-groups is because it's enough loaded
>> compared to the real workloads, so testing more groups might just
>> be a waste of time?
>
> The machine has 16 LLCs so I capped the results at 16-groups.
> Previously I had seen some run-to-run variance with larger group counts
> so I limited the reports to 16-groups. I'll run hackbench with more
> number of groups (32, 64, 128, 256) and get back to you with the
> results along with results for a couple of long running workloads.
~~~~~~~~~~~~~
~ Hackbench ~
~~~~~~~~~~~~~
$ perf bench sched messaging -p -l 50000 -g <groups>
o NPS1
kernel: tip sis_core
32-groups: 6.20 (0.00 pct) 5.86 (5.48 pct)
64-groups: 16.55 (0.00 pct) 15.21 (8.09 pct)
128-groups: 42.57 (0.00 pct) 34.63 (18.65 pct)
256-groups: 71.69 (0.00 pct) 67.11 (6.38 pct)
512-groups: 108.48 (0.00 pct) 110.23 (-1.61 pct)
o NPS2
kernel: tip sis_core
32-groups: 6.56 (0.00 pct) 5.60 (14.63 pct)
64-groups: 15.74 (0.00 pct) 14.45 (8.19 pct)
128-groups: 39.93 (0.00 pct) 35.33 (11.52 pct)
256-groups: 74.49 (0.00 pct) 69.65 (6.49 pct)
512-groups: 112.22 (0.00 pct) 113.75 (-1.36 pct)
o NPS4:
kernel: tip sis_core
32-groups: 9.48 (0.00 pct) 5.64 (40.50 pct)
64-groups: 15.38 (0.00 pct) 14.13 (8.12 pct)
128-groups: 39.93 (0.00 pct) 34.47 (13.67 pct)
256-groups: 75.31 (0.00 pct) 67.98 (9.73 pct)
512-groups: 115.37 (0.00 pct) 111.15 (3.65 pct)
Note: Hackbench with 32-groups show run to run variation
on tip but is more stable with sis_core. Hackbench for
64-groups and beyond is stable on both kernels.
>
>>
>> Thanks & Best Regards,
>> Abel
>>
>> [..snip..]
>>
>
>
> --
> Thanks and Regards,
> Prateek
Apart from the couple of regressions in Unixbench, everything looks good.
If you would like me to get any more data for any workload on the test
system, please do let me know.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
2022-11-22 11:28 ` K Prateek Nayak
@ 2022-11-24 3:50 ` Abel Wu
0 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2022-11-24 3:50 UTC (permalink / raw)
To: K Prateek Nayak, Peter Zijlstra, Ingo Molnar, Mel Gorman,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider
Cc: Josh Don, Chen Yu, Tim Chen, Gautham R . Shenoy, Aubrey Li,
Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang, Barry Song,
linux-kernel
Hi Prateek, thanks again for your detailed test!
On 11/22/22 7:28 PM, K Prateek Nayak wrote:
> Hello Abel,
>
> Following are the results for hackbench with larger number of
> groups, ycsb-mongodb, Spec-JBB, and unixbench. Apart for
> a regression in unixbench spawn in NPS2 and NPS4 mode and
> unixbench syscall in NPs2 mode, everything looks good.
>
> ...
>
> -> unixbench-syscall
>
> o NPS4
>
> kernel: tip sis_core
> Min unixbench-syscall-1 2971799.80 ( 0.00%) 2979335.60 ( -0.25%)
> Min unixbench-syscall-512 7824196.90 ( 0.00%) 8155610.20 ( -4.24%)
> Amean unixbench-syscall-1 2973045.43 ( 0.00%) 2982036.13 * -0.30%*
> Amean unixbench-syscall-512 7826302.17 ( 0.00%) 8173026.57 * -4.43%* <-- Regression in syscall for larger worker count
> CoeffVar unixbench-syscall-1 0.04 ( 0.00%) 0.09 (-139.63%)
> CoeffVar unixbench-syscall-512 0.03 ( 0.00%) 0.20 (-701.13%)
>
>
> -> unixbench-spawn
>
> o NPS1
>
> kernel: tip sis_core
> Min unixbench-spawn-1 6536.50 ( 0.00%) 6000.30 ( -8.20%)
> Min unixbench-spawn-512 72571.40 ( 0.00%) 70829.60 ( -2.40%)
> Hmean unixbench-spawn-1 6811.16 ( 0.00%) 7016.11 ( 3.01%)
> Hmean unixbench-spawn-512 72801.77 ( 0.00%) 71012.03 * -2.46%*
> CoeffVar unixbench-spawn-1 3.69 ( 0.00%) 13.52 (-266.69%)
> CoeffVar unixbench-spawn-512 0.27 ( 0.00%) 0.22 ( 18.25%)
>
> o NPS2
>
> kernel: tip sis_core
> Min unixbench-spawn-1 7042.20 ( 0.00%) 7078.70 ( 0.52%)
> Min unixbench-spawn-512 85571.60 ( 0.00%) 77362.60 ( -9.59%)
> Hmean unixbench-spawn-1 7199.01 ( 0.00%) 7276.55 ( 1.08%)
> Hmean unixbench-spawn-512 85717.77 ( 0.00%) 77923.73 * -9.09%* <-- Regression in spawn test for larger worker count
> CoeffVar unixbench-spawn-1 3.50 ( 0.00%) 3.30 ( 5.70%)
> CoeffVar unixbench-spawn-512 0.20 ( 0.00%) 0.82 (-304.88%)
>
> o NPS4
>
> kernel: tip sis_core
> Min unixbench-spawn-1 7521.90 ( 0.00%) 8102.80 ( 7.72%)
> Min unixbench-spawn-512 84245.70 ( 0.00%) 73074.50 ( -13.26%)
> Hmean unixbench-spawn-1 7659.12 ( 0.00%) 8645.19 * 12.87%*
> Hmean unixbench-spawn-512 84908.77 ( 0.00%) 73409.49 * -13.54%* <-- Regression in spawn test for larger worker count
> CoeffVar unixbench-spawn-1 1.92 ( 0.00%) 5.78 (-200.56%)
> CoeffVar unixbench-spawn-512 0.76 ( 0.00%) 0.41 ( 46.58%)
>
> ...
>
> For unixbench regressions, I do not see anything obvious jump up
> in perf traces captureed with IBS. top shows over 99% utilization
> which would ideally mean there are not many updates to the mask.
> I'll take some more look at the spawn test case and get back to you.
These regressions seems to be common in full parallel tests. I
guess it might be due to over updating the idle cpumask when LLC
is overloaded which is not necessary if SIS_UTIL enabled, but I
need to dig it further. Maybe the rq avg_idle or nr_idle_scan need
to be taken into consideration as well. Thanks for providing these
important information.
>
> ~~~~~~~~~~~~~
> ~ Hackbench ~
> ~~~~~~~~~~~~~
>
> $ perf bench sched messaging -p -l 50000 -g <groups>
>
> o NPS1
>
> kernel: tip sis_core
> 32-groups: 6.20 (0.00 pct) 5.86 (5.48 pct)
> 64-groups: 16.55 (0.00 pct) 15.21 (8.09 pct)
> 128-groups: 42.57 (0.00 pct) 34.63 (18.65 pct)
> 256-groups: 71.69 (0.00 pct) 67.11 (6.38 pct)
> 512-groups: 108.48 (0.00 pct) 110.23 (-1.61 pct)
>
> o NPS2
>
> kernel: tip sis_core
> 32-groups: 6.56 (0.00 pct) 5.60 (14.63 pct)
> 64-groups: 15.74 (0.00 pct) 14.45 (8.19 pct)
> 128-groups: 39.93 (0.00 pct) 35.33 (11.52 pct)
> 256-groups: 74.49 (0.00 pct) 69.65 (6.49 pct)
> 512-groups: 112.22 (0.00 pct) 113.75 (-1.36 pct)
>
> o NPS4:
>
> kernel: tip sis_core
> 32-groups: 9.48 (0.00 pct) 5.64 (40.50 pct)
> 64-groups: 15.38 (0.00 pct) 14.13 (8.12 pct)
> 128-groups: 39.93 (0.00 pct) 34.47 (13.67 pct)
> 256-groups: 75.31 (0.00 pct) 67.98 (9.73 pct)
> 512-groups: 115.37 (0.00 pct) 111.15 (3.65 pct)
>
> Note: Hackbench with 32-groups show run to run variation
> on tip but is more stable with sis_core. Hackbench for
> 64-groups and beyond is stable on both kernels.
>
The result is consistent with mine except 512-groups which I
didn't test. The 512-groups test may have the same problem
aforementioned.
Thanks & Regards,
Abel
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
2022-10-19 12:28 [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
` (5 preceding siblings ...)
2022-11-14 5:45 ` K Prateek Nayak
@ 2023-02-07 3:42 ` K Prateek Nayak
2023-02-16 13:18 ` Abel Wu
6 siblings, 1 reply; 18+ messages in thread
From: K Prateek Nayak @ 2023-02-07 3:42 UTC (permalink / raw)
To: Abel Wu, Peter Zijlstra, Ingo Molnar, Mel Gorman,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider
Cc: Josh Don, Chen Yu, Tim Chen, Gautham R . Shenoy, Aubrey Li,
Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang, Barry Song,
linux-kernel
Hello Abel,
I've retested the patches with on the updated tip and the results
are still promising.
tl;dr
o Hackbench sees improvements when the machine is overloaded.
o tbench shows improvements when the machine is overloaded.
o The unixbench regression seen previously seems to be unrelated
to the patch as the spawn test scores are vastly different
after a reboot/kexec for the same kernel.
o Other benchmarks show slight improvements or are comparable to
the numbers on tip.
Following are the results from running standard benchmarks on a
dual socket Zen3 (2 x 64C/128T) machine configured in different
NPS modes.
NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:
NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.
Node 0: 0-63, 128-191
Node 1: 64-127, 192-255
NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.
Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255
NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.
Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255
Following are the Kernel versions:
tip: 6.2.0-rc2 tip:sched/core at
commit: bbd0b031509b "sched/rseq: Fix concurrency ID handling of usermodehelper kthreads"
sis_short: tip + series
The patch applied cleanly on the tip.
Benchmark Results:
~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~
NPS1
Test: tip sis_core
1-groups: 4.36 (0.00 pct) 4.17 (4.35 pct)
2-groups: 5.17 (0.00 pct) 5.03 (2.70 pct)
4-groups: 4.17 (0.00 pct) 4.14 (0.71 pct)
8-groups: 4.64 (0.00 pct) 4.63 (0.21 pct)
16-groups: 5.43 (0.00 pct) 5.32 (2.02 pct)
NPS2
Test: tip sis_core
1-groups: 4.43 (0.00 pct) 4.27 (3.61 pct)
2-groups: 4.61 (0.00 pct) 4.92 (-6.72 pct) *
2-groups: 4.52 (0.00 pct) 4.55 (-0.66 pct) [Verification Run]
4-groups: 4.25 (0.00 pct) 4.10 (3.52 pct)
8-groups: 4.91 (0.00 pct) 4.53 (7.73 pct)
16-groups: 5.84 (0.00 pct) 5.54 (5.13 pct)
NPS4
Test: tip sis_core
1-groups: 4.34 (0.00 pct) 4.23 (2.53 pct)
2-groups: 4.64 (0.00 pct) 4.84 (-4.31 pct)
4-groups: 4.20 (0.00 pct) 4.17 (0.71 pct)
8-groups: 5.21 (0.00 pct) 5.06 (2.87 pct)
16-groups: 6.24 (0.00 pct) 5.60 (10.25 pct)
~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~
NPS1
#workers: tip sis_core
1: 36.00 (0.00 pct) 23.00 (36.11 pct)
2: 37.00 (0.00 pct) 37.00 (0.00 pct)
4: 37.00 (0.00 pct) 38.00 (-2.70 pct)
8: 47.00 (0.00 pct) 52.00 (-10.63 pct)
16: 64.00 (0.00 pct) 65.00 (-1.56 pct)
32: 109.00 (0.00 pct) 111.00 (-1.83 pct)
64: 222.00 (0.00 pct) 215.00 (3.15 pct)
128: 515.00 (0.00 pct) 486.00 (5.63 pct)
256: 39744.00 (0.00 pct) 47808.00 (-20.28 pct) * (Machine Overloaded ~ 2 tasks per rq)
256: 43242.00 (0.00 pct) 42293.00 (2.19 pct) [Verification Run]
512: 81280.00 (0.00 pct) 76416.00 (5.98 pct)
NPS2
#workers: tip sis_core
1: 27.00 (0.00 pct) 27.00 (0.00 pct)
2: 31.00 (0.00 pct) 30.00 (3.22 pct)
4: 38.00 (0.00 pct) 37.00 (2.63 pct)
8: 50.00 (0.00 pct) 46.00 (8.00 pct)
16: 66.00 (0.00 pct) 68.00 (-3.03 pct)
32: 116.00 (0.00 pct) 113.00 (2.58 pct)
64: 210.00 (0.00 pct) 228.00 (-8.57 pct) *
64: 206.00 (0.00 pct) 219.00 (-6.31 pct) [Verification Run]
128: 523.00 (0.00 pct) 559.00 (-6.88 pct) *
128: 474.00 (0.00 pct) 497.00 (-4.85 pct) [Verification Run]
256: 44864.00 (0.00 pct) 47040.00 (-4.85 pct)
512: 78464.00 (0.00 pct) 81280.00 (-3.58 pct)
NPS4
#workers: tip sis_core
1: 32.00 (0.00 pct) 27.00 (15.62 pct)
2: 32.00 (0.00 pct) 35.00 (-9.37 pct)
4: 34.00 (0.00 pct) 41.00 (-20.58 pct)
8: 58.00 (0.00 pct) 58.00 (0.00 pct)
16: 67.00 (0.00 pct) 69.00 (-2.98 pct)
32: 118.00 (0.00 pct) 112.00 (5.08 pct)
64: 224.00 (0.00 pct) 209.00 (6.69 pct)
128: 533.00 (0.00 pct) 519.00 (2.62 pct)
256: 43456.00 (0.00 pct) 45248.00 (-4.12 pct)
512: 78976.00 (0.00 pct) 76160.00 (3.56 pct)
~~~~~~~~~~
~ tbench ~
~~~~~~~~~~
NPS1
Clients: tip sis_core
1 539.96 (0.00 pct) 538.19 (-0.32 pct)
2 1068.21 (0.00 pct) 1063.04 (-0.48 pct)
4 1994.76 (0.00 pct) 1990.47 (-0.21 pct)
8 3602.30 (0.00 pct) 3496.07 (-2.94 pct)
16 6075.49 (0.00 pct) 6061.74 (-0.22 pct)
32 11641.07 (0.00 pct) 11904.58 (2.26 pct)
64 21529.16 (0.00 pct) 22124.81 (2.76 pct)
128 30852.92 (0.00 pct) 31258.56 (1.31 pct)
256 51901.20 (0.00 pct) 53249.69 (2.59 pct)
512 46797.40 (0.00 pct) 54477.79 (16.41 pct)
1024 46057.28 (0.00 pct) 53676.58 (16.54 pct)
NPS2
Clients: tip sis_core
1 536.11 (0.00 pct) 541.18 (0.94 pct)
2 1044.58 (0.00 pct) 1064.16 (1.87 pct)
4 2043.92 (0.00 pct) 2017.84 (-1.27 pct)
8 3572.50 (0.00 pct) 3494.83 (-2.17 pct)
16 6040.97 (0.00 pct) 5530.10 (-8.45 pct) *
16 5814.03 (0.00 pct) 6012.33 (-8.45 pct) [Verification Run]
32 10794.10 (0.00 pct) 10841.68 (0.44 pct)
64 20905.89 (0.00 pct) 21438.82 (2.54 pct)
128 30885.39 (0.00 pct) 30064.78 (-2.65 pct)
256 48901.25 (0.00 pct) 51395.08 (5.09 pct)
512 49673.91 (0.00 pct) 51725.89 (4.13 pct)
1024 47626.34 (0.00 pct) 52662.01 (10.57 pct)
NPS4
Clients: tip sis_core
1 544.91 (0.00 pct) 544.66 (-0.04 pct)
2 1046.49 (0.00 pct) 1072.42 (2.47 pct)
4 2007.11 (0.00 pct) 1970.05 (-1.84 pct)
8 3590.66 (0.00 pct) 3670.45 (2.22 pct)
16 5956.60 (0.00 pct) 6045.07 (1.48 pct)
32 10431.73 (0.00 pct) 10439.40 (0.07 pct)
64 21563.37 (0.00 pct) 19344.05 (-10.29 pct) *
64 19387.71 (0.00 pct) 19050.47 (-1.73 pct) [Verification Run]
128 30352.16 (0.00 pct) 26998.85 (-11.04 pct) *
128 29110.99 (0.00 pct) 29690.37 (1.99 pct) [Verification Run]
256 49504.51 (0.00 pct) 50921.66 (2.86 pct)
512 44916.61 (0.00 pct) 52176.11 (16.16 pct)
1024 49986.21 (0.00 pct) 51639.91 (3.30 pct)
~~~~~~~~~~
~ stream ~
~~~~~~~~~~
NPS1
10 Runs:
Test: tip sis_core
Copy: 339390.30 (0.00 pct) 324656.88 (-4.34 pct)
Scale: 212472.78 (0.00 pct) 210641.39 (-0.86 pct)
Add: 247598.48 (0.00 pct) 241669.10 (-2.39 pct)
Triad: 261852.07 (0.00 pct) 252088.55 (-3.72 pct)
100 Runs:
Test: tip sis_core
Copy: 335938.02 (0.00 pct) 331491.32 (-1.32 pct)
Scale: 212597.92 (0.00 pct) 218705.84 (2.87 pct)
Add: 248294.62 (0.00 pct) 243830.42 (-1.79 pct)
Triad: 258400.88 (0.00 pct) 248178.42 (-3.95 pct)
NPS2
10 Runs:
Test: tip sis_core
Copy: 334500.32 (0.00 pct) 335317.70 (0.24 pct)
Scale: 216804.76 (0.00 pct) 217862.71 (0.48 pct)
Add: 250787.33 (0.00 pct) 258839.00 (3.21 pct)
Triad: 259451.40 (0.00 pct) 264847.88 (2.07 pct)
100 Runs:
Test: tip sis_core
Copy: 326385.13 (0.00 pct) 338030.70 (3.56 pct)
Scale: 216440.37 (0.00 pct) 230053.24 (6.28 pct)
Add: 255062.22 (0.00 pct) 259197.23 (1.62 pct)
Triad: 265442.03 (0.00 pct) 271365.65 (2.23 pct)
NPS4
10 Runs:
Test: tip sis_core
Copy: 363927.86 (0.00 pct) 361014.15 (-0.80 pct)
Scale: 238190.49 (0.00 pct) 242176.02 (1.67 pct)
Add: 262806.49 (0.00 pct) 266348.50 (1.34 pct)
Triad: 276492.33 (0.00 pct) 276769.10 (0.10 pct)
100 Runs:
Test: tip sis_core
Copy: 365041.37 (0.00 pct) 349299.35 (-4.31 pct)
Scale: 239295.27 (0.00 pct) 229944.85 (-3.90 pct)
Add: 264085.21 (0.00 pct) 252651.56 (-4.32 pct)
Triad: 279664.56 (0.00 pct) 274254.22 (-1.93 pct)
~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~
o NPS1
tip: 131328.67 (var: 2.97%)
sis_core: 131702.33 (var: 3.61%) (0.28%)
o NPS2:
tip: 132482.33 (var: 2.06%)
sis_core: 132338.33 (var: 0.97%) (-0.11%)
o NPS4:
tip: 134130.00 (var: 4.12%)
sis_core: 133224.33 (var: 4.13%) (-0.67%)
~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~
o NPS1
Test Metric Parallelism tip sis_core
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48770555.20 ( 0.00%) 49025161.73 ( 0.52%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6268185467.60 ( 0.00%) 6266351964.20 ( -0.03%)
unixbench-syscall Amean unixbench-syscall-1 2685321.17 ( 0.00%) 2694468.30 * -0.34%*
unixbench-syscall Amean unixbench-syscall-512 7291476.20 ( 0.00%) 7295087.67 ( -0.05%)
unixbench-pipe Hmean unixbench-pipe-1 2480858.53 ( 0.00%) 2536923.44 * 2.26%*
unixbench-pipe Hmean unixbench-pipe-512 300739256.62 ( 0.00%) 303470605.93 * 0.91%*
unixbench-spawn Hmean unixbench-spawn-1 4358.14 ( 0.00%) 4104.88 ( -5.81%) * (Known to be unstable)
unixbench-spawn Hmean unixbench-spawn-1 4711.00 ( 0.00%) 4006.20 ( -14.96%) [Verification Run]
unixbench-spawn Hmean unixbench-spawn-512 76497.32 ( 0.00%) 75555.94 * -1.23%*
unixbench-execl Hmean unixbench-execl-1 4147.12 ( 0.00%) 4157.33 ( 0.25%)
unixbench-execl Hmean unixbench-execl-512 12435.26 ( 0.00%) 11992.43 ( -3.56%)
o NPS2
Test Metric Parallelism tip sis_core
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48872335.50 ( 0.00%) 48902553.70 ( 0.06%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6264134378.20 ( 0.00%) 6260631689.40 ( -0.06%)
unixbench-syscall Amean unixbench-syscall-1 2683903.13 ( 0.00%) 2694829.17 * -0.41%*
unixbench-syscall Amean unixbench-syscall-512 7746773.60 ( 0.00%) 7493782.67 * 3.27%*
unixbench-pipe Hmean unixbench-pipe-1 2476724.23 ( 0.00%) 2537127.96 * 2.44%*
unixbench-pipe Hmean unixbench-pipe-512 300277350.41 ( 0.00%) 302979776.19 * 0.90%*
unixbench-spawn Hmean unixbench-spawn-1 5026.50 ( 0.00%) 4680.63 ( -6.88%) *
unixbench-spawn Hmean unixbench-spawn-1 5421.70 ( 0.00%) 5311.50 ( -2.03%) [Verification Run]
unixbench-spawn Hmean unixbench-spawn-512 80549.70 ( 0.00%) 78888.60 ( -2.06%)
unixbench-execl Hmean unixbench-execl-1 4151.70 ( 0.00%) 3913.76 * -5.73%* *
unixbench-execl Hmean unixbench-execl-1 4304.30 ( 0.00%) 4303.20 ( -0.02%) [Verification run]
unixbench-execl Hmean unixbench-execl-512 13605.15 ( 0.00%) 13129.23 ( -3.50%)
o NPS4
Test Metric Parallelism tip sis_core
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48506771.20 ( 0.00%) 48894866.70 ( 0.80%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6280954362.50 ( 0.00%) 6282759876.40 ( 0.03%)
unixbench-syscall Amean unixbench-syscall-1 2687259.30 ( 0.00%) 2695379.93 * -0.30%*
unixbench-syscall Amean unixbench-syscall-512 7350275.67 ( 0.00%) 7366923.73 ( -0.23%)
unixbench-pipe Hmean unixbench-pipe-1 2478893.01 ( 0.00%) 2540015.88 * 2.47%*
unixbench-pipe Hmean unixbench-pipe-512 301830155.61 ( 0.00%) 304305539.27 * 0.82%*
unixbench-spawn Hmean unixbench-spawn-1 5208.55 ( 0.00%) 5273.11 ( 1.24%)
unixbench-spawn Hmean unixbench-spawn-512 80745.79 ( 0.00%) 81940.71 * 1.48%*
unixbench-execl Hmean unixbench-execl-1 4072.72 ( 0.00%) 4126.13 * 1.31%*
unixbench-execl Hmean unixbench-execl-512 13746.56 ( 0.00%) 12848.77 ( -6.53%) *
unixbench-execl Hmean unixbench-execl-512 13898.30 ( 0.00%) 13959.70 ( 0.44%) [Verification Run]
On 10/19/2022 5:58 PM, Abel Wu wrote:
> This patchset tries to improve SIS scan efficiency by recording idle
> cpus in a cpumask for each LLC which will be used as a target cpuset
> in the domain scan. The cpus are recorded at CORE granule to avoid
> tasks being stack on same core.
>
> v5 -> v6:
> - Rename SIS_FILTER to SIS_CORE as it can only be activated when
> SMT is enabled and better describes the behavior of CORE granule
> update & load delivery.
> - Removed the part of limited scan for idle cores since it might be
> better to open another thread to discuss the strategies such as
> limited or scaled depth. But keep the part of full scan for idle
> cores when LLC is overloaded because SIS_CORE can greatly reduce
> the overhead of full scan in such case.
> - Removed the state of sd_is_busy which indicates an LLC is fully
> busy and we can safely skip the SIS domain scan. I would prefer
> leave this to SIS_UTIL.
> - The filter generation mechanism is replaced by in-place updates
> during domain scan to better deal with partial scan failures.
> - Collect Reviewed-bys from Tim Chen
>
> v4 -> v5:
> - Add limited scan for idle cores when overloaded, suggested by Mel
> - Split out several patches since they are irrelevant to this scope
> - Add quick check on ttwu_pending before core update
> - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
> - Move the main filter logic to the idle path, because the newidle
> balance can bail out early if rq->avg_idle is small enough and
> lose chances to update the filter.
>
> v3 -> v4:
> - Update filter in load_balance rather than in the tick
> - Now the filter contains unoccupied cpus rather than overloaded ones
> - Added mechanisms to deal with the false positive cases
>
> v2 -> v3:
> - Removed sched-idle balance feature and focus on SIS
> - Take non-CFS tasks into consideration
> - Several fixes/improvement suggested by Josh Don
>
> v1 -> v2:
> - Several optimizations on sched-idle balancing
> - Ignore asym topos in can_migrate_task
> - Add more benchmarks including SIS efficiency
> - Re-organize patch as suggested by Mel Gorman
>
> Abel Wu (4):
> sched/fair: Skip core update if task pending
> sched/fair: Ignore SIS_UTIL when has_idle_core
> sched/fair: Introduce SIS_CORE
> sched/fair: Deal with SIS scan failures
>
> include/linux/sched/topology.h | 15 ++++
> kernel/sched/fair.c | 122 +++++++++++++++++++++++++++++----
> kernel/sched/features.h | 7 ++
> kernel/sched/sched.h | 3 +
> kernel/sched/topology.c | 8 ++-
> 5 files changed, 141 insertions(+), 14 deletions(-)
>
Testing with couple of larger workloads like SpecJBB are still underway.
I'll update the thread with the results once they are done. The idea
is promising. I'll also try to run schbench / hackbench pinned in a
manner such that all wakeups happen on an external LLC to spot any
impact of rapid changes to the idle cpu mask of an external LLC.
Please let me know if you would like me to test or get data for any
particular benchmark from my test setup.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
2023-02-07 3:42 ` K Prateek Nayak
@ 2023-02-16 13:18 ` Abel Wu
0 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2023-02-16 13:18 UTC (permalink / raw)
To: K Prateek Nayak, Peter Zijlstra, Ingo Molnar, Mel Gorman,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider
Cc: Josh Don, Chen Yu, Tim Chen, Gautham R . Shenoy, Aubrey Li,
Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang, Barry Song,
linux-kernel
Hi Prateek, thanks very much for your solid testings!
On 2/7/23 11:42 AM, K Prateek Nayak wrote:
> Hello Abel,
>
> I've retested the patches with on the updated tip and the results
> are still promising.
>
> tl;dr
>
> o Hackbench sees improvements when the machine is overloaded.
> o tbench shows improvements when the machine is overloaded.
> o The unixbench regression seen previously seems to be unrelated
> to the patch as the spawn test scores are vastly different
> after a reboot/kexec for the same kernel.
> o Other benchmarks show slight improvements or are comparable to
> the numbers on tip.
Cheers! Yet I still see some minor regressions in the report
below. As we discussed last time, reducing unnecessary updates
on the idle cpumask when LLC is overloaded should help.
Thanks & Best regards,
Abel
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2023-02-16 13:18 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-19 12:28 [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
2022-10-19 12:28 ` [PATCH v6 1/4] sched/fair: Skip core update if task pending Abel Wu
2022-10-19 12:28 ` [PATCH v6 2/4] sched/fair: Ignore SIS_UTIL when has_idle_core Abel Wu
2022-10-19 12:28 ` [PATCH v6 3/4] sched/fair: Introduce SIS_CORE Abel Wu
2022-10-21 4:03 ` Chen Yu
2022-10-21 4:30 ` Abel Wu
2022-10-21 4:34 ` Chen Yu
2022-10-21 9:35 ` Abel Wu
2022-10-21 11:14 ` Chen Yu
2022-10-19 12:28 ` [PATCH v6 4/4] sched/fair: Deal with SIS scan failures Abel Wu
2022-11-04 7:29 ` [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
2022-11-14 5:45 ` K Prateek Nayak
2022-11-15 8:31 ` Abel Wu
2022-11-15 11:28 ` K Prateek Nayak
2022-11-22 11:28 ` K Prateek Nayak
2022-11-24 3:50 ` Abel Wu
2023-02-07 3:42 ` K Prateek Nayak
2023-02-16 13:18 ` Abel Wu
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.