All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
@ 2022-10-19 12:28 Abel Wu
  2022-10-19 12:28 ` [PATCH v6 1/4] sched/fair: Skip core update if task pending Abel Wu
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Abel Wu @ 2022-10-19 12:28 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider
  Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
	Aubrey Li, Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang,
	Barry Song, linux-kernel, Abel Wu

This patchset tries to improve SIS scan efficiency by recording idle
cpus in a cpumask for each LLC which will be used as a target cpuset
in the domain scan. The cpus are recorded at CORE granule to avoid
tasks being stack on same core.

v5 -> v6:
 - Rename SIS_FILTER to SIS_CORE as it can only be activated when
   SMT is enabled and better describes the behavior of CORE granule
   update & load delivery.
 - Removed the part of limited scan for idle cores since it might be
   better to open another thread to discuss the strategies such as
   limited or scaled depth. But keep the part of full scan for idle
   cores when LLC is overloaded because SIS_CORE can greatly reduce
   the overhead of full scan in such case.
 - Removed the state of sd_is_busy which indicates an LLC is fully
   busy and we can safely skip the SIS domain scan. I would prefer
   leave this to SIS_UTIL.
 - The filter generation mechanism is replaced by in-place updates
   during domain scan to better deal with partial scan failures.
 - Collect Reviewed-bys from Tim Chen

v4 -> v5:
 - Add limited scan for idle cores when overloaded, suggested by Mel
 - Split out several patches since they are irrelevant to this scope
 - Add quick check on ttwu_pending before core update
 - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
 - Move the main filter logic to the idle path, because the newidle
   balance can bail out early if rq->avg_idle is small enough and
   lose chances to update the filter.

v3 -> v4:
 - Update filter in load_balance rather than in the tick
 - Now the filter contains unoccupied cpus rather than overloaded ones
 - Added mechanisms to deal with the false positive cases

v2 -> v3:
 - Removed sched-idle balance feature and focus on SIS
 - Take non-CFS tasks into consideration
 - Several fixes/improvement suggested by Josh Don

v1 -> v2:
 - Several optimizations on sched-idle balancing
 - Ignore asym topos in can_migrate_task
 - Add more benchmarks including SIS efficiency
 - Re-organize patch as suggested by Mel Gorman

Abel Wu (4):
  sched/fair: Skip core update if task pending
  sched/fair: Ignore SIS_UTIL when has_idle_core
  sched/fair: Introduce SIS_CORE
  sched/fair: Deal with SIS scan failures

 include/linux/sched/topology.h |  15 ++++
 kernel/sched/fair.c            | 122 +++++++++++++++++++++++++++++----
 kernel/sched/features.h        |   7 ++
 kernel/sched/sched.h           |   3 +
 kernel/sched/topology.c        |   8 ++-
 5 files changed, 141 insertions(+), 14 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v6 1/4] sched/fair: Skip core update if task pending
  2022-10-19 12:28 [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
@ 2022-10-19 12:28 ` Abel Wu
  2022-10-19 12:28 ` [PATCH v6 2/4] sched/fair: Ignore SIS_UTIL when has_idle_core Abel Wu
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2022-10-19 12:28 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider
  Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
	Aubrey Li, Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang,
	Barry Song, linux-kernel, Abel Wu

The function __update_idle_core() considers this cpu is idle so
only checks its siblings to decide whether the resident core is
idle or not and update has_idle_cores hint if necessary. But the
problem is that this cpu might not be idle at that moment any
more, resulting in the hint being misleading.

It's not proper to make this check everywhere in the idle path,
but checking just before core updating can make the has_idle_core
hint more reliable with negligible cost.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5ffec4370602..e7f82fa92c5b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6294,6 +6294,9 @@ void __update_idle_core(struct rq *rq)
 	int core = cpu_of(rq);
 	int cpu;
 
+	if (rq->ttwu_pending)
+		return;
+
 	rcu_read_lock();
 	if (test_idle_cores(core))
 		goto unlock;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v6 2/4] sched/fair: Ignore SIS_UTIL when has_idle_core
  2022-10-19 12:28 [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
  2022-10-19 12:28 ` [PATCH v6 1/4] sched/fair: Skip core update if task pending Abel Wu
@ 2022-10-19 12:28 ` Abel Wu
  2022-10-19 12:28 ` [PATCH v6 3/4] sched/fair: Introduce SIS_CORE Abel Wu
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2022-10-19 12:28 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider
  Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
	Aubrey Li, Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang,
	Barry Song, linux-kernel, Abel Wu

When SIS_UTIL is enabled, SIS domain scan will be skipped if the
LLC is overloaded even the has_idle_core hint is true. Since idle
load balancing is triggered at tick boundary, the idle cores can
stay cold for the whole tick period wasting time meanwhile some
of other cpus might be overloaded.

Give it a chance to scan for idle cores if the hint implies a
worthy effort.

Benchmark
=========

All of the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled, and test machines are:

A) A dual socket machine modeled Intel Xeon(R) Platinum 8260 with SNC
disabled, so there are 2 NUMA nodes each of which has 24C/48T. Each
NUMA shares an LLC.

B) A dual socket machine modeled AMD EPYC 7Y83 64-Core Processor with
NPS1 enabled, so there are 2 NUMA nodes each of which has 64C/128T.
Each NUMA node contains several LLCs sized of 16 cpus.

Based on tip sched/core fb04563d1cae (v5.19.0).

Results
=======

hackbench-process-pipes
 (A)			 vanilla		patched
Amean     1        0.2767 (   0.00%)      0.2540 (   8.19%)
Amean     4        0.6080 (   0.00%)      0.6220 (  -2.30%)
Amean     7        0.7923 (   0.00%)      0.8020 (  -1.22%)
Amean     12       1.3917 (   0.00%)      1.1823 (  15.04%)
Amean     21       3.6747 (   0.00%)      2.7717 (  24.57%)
Amean     30       6.7070 (   0.00%)      5.1200 *  23.66%*
Amean     48       9.3537 (   0.00%)      8.5890 *   8.18%*
Amean     79      11.6627 (   0.00%)     11.2580 (   3.47%)
Amean     110     13.4473 (   0.00%)     13.1283 (   2.37%)
Amean     141     16.4747 (   0.00%)     15.5967 *   5.33%*
Amean     172     19.0000 (   0.00%)     18.1153 *   4.66%*
Amean     203     21.4200 (   0.00%)     21.1340 (   1.34%)
Amean     234     24.2250 (   0.00%)     23.8227 (   1.66%)
Amean     265     27.2400 (   0.00%)     26.8293 (   1.51%)
Amean     296     30.6937 (   0.00%)     29.5800 *   3.63%*
 (B)
Amean     1        0.3543 (   0.00%)      0.3650 (  -3.01%)
Amean     4        0.4623 (   0.00%)      0.4837 (  -4.61%)
Amean     7        0.5117 (   0.00%)      0.4997 (   2.35%)
Amean     12       0.5707 (   0.00%)      0.5863 (  -2.75%)
Amean     21       0.9717 (   0.00%)      0.8930 *   8.10%*
Amean     30       1.4423 (   0.00%)      1.2530 (  13.13%)
Amean     48       2.3520 (   0.00%)      1.9743 *  16.06%*
Amean     79       5.7193 (   0.00%)      3.4933 *  38.92%*
Amean     110      6.9893 (   0.00%)      5.5963 *  19.93%*
Amean     141      9.1103 (   0.00%)      7.6550 (  15.97%)
Amean     172     10.2490 (   0.00%)      8.8323 *  13.82%*
Amean     203     11.3727 (   0.00%)     10.8683 (   4.43%)
Amean     234     12.7627 (   0.00%)     11.8683 (   7.01%)
Amean     265     13.8947 (   0.00%)     13.4717 (   3.04%)
Amean     296     14.1093 (   0.00%)     13.8130 (   2.10%)

The results can approximately divided into 3 sections:
 - busy, e.g. <12 groups on A and <21 groups on B
 - overloaded, e.g. 12~48 groups on A and 21~172 groups on B
 - saturated, the rest part

For the busy part the result is neutral with slight wins or loss.
It is probably because there are still idle cpus not hard to be find
so the effort we paid for locating an idle core will bring limited
benefit which can be negated by the cost of full scan easily.

While for the overloaded but not saturated part, great improvement
can be seen due to exploiting the cpu resources by more actively
kicking idle cores working. But once all cpus are totally saturated,
scanning for idle cores doesn't help much.

One concern of the full scan is that the cost gets bigger in larger
LLCs, but the test result seems still positive. One possible reason
is due to the low SIS success rate (<2%), so the paid effort will
indeed trade for efficiency.

tbench4 Throughput
 (A)			 vanilla		patched
Hmean     1        275.61 (   0.00%)      280.53 *   1.78%*
Hmean     2        541.28 (   0.00%)      561.94 *   3.82%*
Hmean     4       1102.62 (   0.00%)     1109.14 *   0.59%*
Hmean     8       2149.58 (   0.00%)     2229.39 *   3.71%*
Hmean     16      4305.40 (   0.00%)     4383.06 *   1.80%*
Hmean     32      7088.36 (   0.00%)     7124.14 *   0.50%*
Hmean     64      8609.16 (   0.00%)     8815.41 *   2.40%*
Hmean     128    19304.92 (   0.00%)    19519.35 *   1.11%*
Hmean     256    19147.04 (   0.00%)    19392.24 *   1.28%*
Hmean     384    18970.86 (   0.00%)    19201.07 *   1.21%*
 (B)
Hmean     1         519.62 (   0.00%)      515.98 *  -0.70%*
Hmean     2        1042.92 (   0.00%)     1031.54 *  -1.09%*
Hmean     4        1959.10 (   0.00%)     1953.44 *  -0.29%*
Hmean     8        3842.82 (   0.00%)     3622.52 *  -5.73%*
Hmean     16       6768.50 (   0.00%)     6545.82 *  -3.29%*
Hmean     32      12589.50 (   0.00%)    13697.73 *   8.80%*
Hmean     64      24797.23 (   0.00%)    25589.59 *   3.20%*
Hmean     128     38036.66 (   0.00%)    35667.64 *  -6.23%*
Hmean     256     65069.93 (   0.00%)    65215.85 *   0.22%*
Hmean     512     61147.99 (   0.00%)    66035.57 *   7.99%*
Hmean     1024    48542.73 (   0.00%)    53391.64 *   9.99%*

The tbench4 test has a ~40% success rate on used target, prev or
recent cpus, and ~45% of total success rate. And the core scan is
also not very frequent, so the benefit this patch brings is limited
while still some gains can be seen.

netperf

The netperf has an almost 100% success rate on used target, prev or
recent cpus, so the domain scan is generally not performed and not
affected by this patch.

Conclusion
==========

Taking full scan for idle cores is generally good for making better
use of the cpu resources.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7f82fa92c5b..7b668e16812e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6436,7 +6436,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		time = cpu_clock(this);
 	}
 
-	if (sched_feat(SIS_UTIL)) {
+	if (sched_feat(SIS_UTIL) && !has_idle_core) {
 		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
 		if (sd_share) {
 			/* because !--nr is the condition to stop scan */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v6 3/4] sched/fair: Introduce SIS_CORE
  2022-10-19 12:28 [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
  2022-10-19 12:28 ` [PATCH v6 1/4] sched/fair: Skip core update if task pending Abel Wu
  2022-10-19 12:28 ` [PATCH v6 2/4] sched/fair: Ignore SIS_UTIL when has_idle_core Abel Wu
@ 2022-10-19 12:28 ` Abel Wu
  2022-10-21  4:03   ` Chen Yu
  2022-10-19 12:28 ` [PATCH v6 4/4] sched/fair: Deal with SIS scan failures Abel Wu
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Abel Wu @ 2022-10-19 12:28 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider
  Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
	Aubrey Li, Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang,
	Barry Song, linux-kernel, Abel Wu

The wakeup fastpath for fair tasks, select_idle_sibling() aka SIS,
plays an important role in maximizing the usage of cpu resources and
can greatly affect overall performance of the system.

The SIS tries to find an idle cpu inside the targeting LLC to place
the woken-up task. The cache hot cpus are preferred, but if none of
them is idle, a domain scan is fired to check the other cpus of that
LLC. It currently scans in linear fashion, and it works well under
light pressure due to lots of idle cpus can be available.

But things change. The LLC is getting bigger in modern and future
machines, and players like cloud service providers are continuously
trying to use all kinds of resources more efficiently to reduce TCO.
In either case, locating an idle cpu is no longer as easy as before.
So the linear scan doesn't fit well in such cases.

There are already features like SIS_{UTIL,PROP} exist to deal with
the scalability issue by limiting the scan depth, and it would be
better if we can improve the way how it scans as well. And this is
exactly what the SIS_CORE is born for.

When SIS_CORE is enabled, a cpumask containing idle cpus for each
LLC is maintained and stored in each LLC shared domain. The idle
cpus are recorded at CORE granule, so in theory only one idle cpu
of a CORE can be set to the mask. The ideas behind this:

 - Recording idle cpus is to narrow down the SIS scan, so we can
   avoid touching the runqueues that must be in a busy state, as
   we all know that the runqueues are one of the most hot data in
   the system. And because all the possibly idle cpus are in the
   mask, the hint of has_idle_core still works.

 - The rule of CORE granule update helps spreading load out to
   different cores trying to make better use of cpu capacity.

A major concern is the accuracy of the idle cpumask. A cpu present
in the mask might not be idle any more, which is called the false
positive cpu. Such cpus will negate lots of benefit this feature
brings. The strategy against the false positives will be introduced
in next patch.

Another concern is the overhead of accessing the LLC-shared cpumask,
which can be more severe in large LLCs. But a perf stat on cache
miss rate during hackbench doesn't show obvious difference.

This patch records idle cpus when they goes idle at CORE granule for
each LLC, and the cpumask is in LLC shared domain. The false positive
cpus are cleared when SIS domain scan fails.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 include/linux/sched/topology.h | 15 ++++++++++
 kernel/sched/fair.c            | 51 +++++++++++++++++++++++++++++++---
 kernel/sched/features.h        |  7 +++++
 kernel/sched/topology.c        |  8 +++++-
 4 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 816df6cc444e..ac2162f33ada 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -82,6 +82,16 @@ struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+
+	/*
+	 * Used by sched feature SIS_CORE to record idle cpus at core
+	 * granule to improve efficiency of SIS domain scan.
+	 *
+	 * NOTE: this field is variable length. (Allocated dynamically
+	 * by attaching extra space to the end of the structure,
+	 * depending on how many CPUs the kernel has booted up with)
+	 */
+	unsigned long	icpus[];
 };
 
 struct sched_domain {
@@ -167,6 +177,11 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 	return to_cpumask(sd->span);
 }
 
+static inline struct cpumask *sched_domain_icpus(struct sched_domain *sd)
+{
+	return to_cpumask(sd->shared->icpus);
+}
+
 extern void partition_sched_domains_locked(int ndoms_new,
 					   cpumask_var_t doms_new[],
 					   struct sched_domain_attr *dattr_new);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7b668e16812e..3aa699e9d4af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6282,6 +6282,33 @@ static inline bool test_idle_cores(int cpu)
 	return false;
 }
 
+/*
+ * To honor the rule of CORE granule update, set this cpu to the LLC idle
+ * cpumask only if there is no cpu of this core showed up in the cpumask.
+ */
+static void update_idle_cpu(int cpu)
+{
+	struct sched_domain_shared *sds;
+
+	if (!sched_feat(SIS_CORE))
+		return;
+
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds) {
+		struct cpumask *icpus = to_cpumask(sds->icpus);
+
+		/*
+		 * This is racy against clearing in select_idle_cpu(),
+		 * and can lead to idle cpus miss the chance to be set to
+		 * the idle cpumask, thus the idle cpus are temporarily
+		 * out of reach in SIS domain scan. But it should be rare
+		 * and we still have ILB to kick them working.
+		 */
+		if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
+			cpumask_set_cpu(cpu, icpus);
+	}
+}
+
 /*
  * Scans the local SMT mask to see if the entire core is idle, and records this
  * information in sd_llc_shared->has_idle_cores.
@@ -6298,6 +6325,7 @@ void __update_idle_core(struct rq *rq)
 		return;
 
 	rcu_read_lock();
+	update_idle_cpu(core);
 	if (test_idle_cores(core))
 		goto unlock;
 
@@ -6343,7 +6371,13 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 	if (idle)
 		return core;
 
-	cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+	/*
+	 * It is unlikely that more than one cpu of a core show up
+	 * in the @cpus if SIS_CORE enabled.
+	 */
+	if (!sched_feat(SIS_CORE))
+		cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+
 	return -1;
 }
 
@@ -6394,7 +6428,7 @@ static inline int select_idle_smt(struct task_struct *p, int target)
  */
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
 {
-	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask), *icpus = NULL;
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
 	struct sched_domain_shared *sd_share;
 	struct rq *this_rq = this_rq();
@@ -6402,8 +6436,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	struct sched_domain *this_sd = NULL;
 	u64 time = 0;
 
-	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
-
 	if (sched_feat(SIS_PROP) && !has_idle_core) {
 		u64 avg_cost, avg_idle, span_avg;
 		unsigned long now = jiffies;
@@ -6447,6 +6479,11 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		}
 	}
 
+	if (sched_feat(SIS_CORE) && sched_smt_active())
+		icpus = sched_domain_icpus(sd);
+
+	cpumask_and(cpus, icpus ? icpus : sched_domain_span(sd), p->cpus_ptr);
+
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -6465,6 +6502,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	if (has_idle_core)
 		set_idle_cores(target, false);
 
+	if (icpus && idle_cpu == -1) {
+		/* Reset the idle cpu mask if a full scan fails */
+		if (nr > 0)
+			cpumask_clear(icpus);
+	}
+
 	if (sched_feat(SIS_PROP) && this_sd && !has_idle_core) {
 		time = cpu_clock(this) - time;
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..bf3cae94caa6 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,6 +63,13 @@ SCHED_FEAT(TTWU_QUEUE, true)
 SCHED_FEAT(SIS_PROP, false)
 SCHED_FEAT(SIS_UTIL, true)
 
+/*
+ * Record idle cpus at core granule for each LLC to improve efficiency of
+ * SIS domain scan. Combine with the above features of limiting scan depth
+ * to better deal with the scalability issue.
+ */
+SCHED_FEAT(SIS_CORE, true)
+
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
  * in a single rq->lock section. Default disabled because the
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..a2bb0091c10d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1641,6 +1641,12 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+
+		/*
+		 * This will temporarily break the rule of CORE granule,
+		 * but will be fixed after SIS scan failures.
+		 */
+		cpumask_copy(sched_domain_icpus(sd), sd_span);
 	}
 
 	sd->private = sdd;
@@ -2106,7 +2112,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
 			*per_cpu_ptr(sdd->sd, j) = sd;
 
-			sds = kzalloc_node(sizeof(struct sched_domain_shared),
+			sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
 			if (!sds)
 				return -ENOMEM;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v6 4/4] sched/fair: Deal with SIS scan failures
  2022-10-19 12:28 [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
                   ` (2 preceding siblings ...)
  2022-10-19 12:28 ` [PATCH v6 3/4] sched/fair: Introduce SIS_CORE Abel Wu
@ 2022-10-19 12:28 ` Abel Wu
  2022-11-04  7:29 ` [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2022-10-19 12:28 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider
  Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
	Aubrey Li, Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang,
	Barry Song, linux-kernel, Abel Wu

When SIS_CORE is active, idle cpus are recorded into a cpumask at
CORE granule, and the SIS domain scan tries to locate an idle cpu
in this cpumask. So the quality of the cpumask matters. There are
generally two types of cpus that need to be dealt with:

 - False positive: the cpus that are present in the idle cpumask
   but is actually busy. This can be caused by task enqueuing to
   these cpus without removing from the cpumask immediately. To
   solve this, the cpumask needs to be shrinked if poor quality
   of the mask is observed.

 - True negative: the cpus that are idle but not show up in the
   idle cpumask. This is due to the rule of CORE granule update
   that these idle cpus will be ignored when their SMT siblings
   are already present in the idle cpumask.

According to the nature of the two type cpus and some heuristics,
the strategies against SIS scan failures are as follows:

 - It can be predicted that if a scan fails, the next scan from the
   same cpu will probably fail too. So mark these cpus fail-prone.
   A scan from fail-prone cpu is the time to shrink the cpumask.

 - All true negative cpus are the SMT siblings of the false positive
   cpus, and are taken for granted to be treated as the fallbacks of
   the false positive cpus. So a fail-prone scan should also check
   the SMT siblings to see if any true negative cpu exists.

The number of false positive cpus being removed during one scan is
not constrained, but it will be implicitly constrained by the load
of the LLC. If LLC is under heavy pressure, both the weight of idle
cpumask and the scan depth are reduced, so does the number of cpus
removed.

To sum up, this patch marks the cpus fail-prone if scans from them
failed last time, so next time scanning from them will also check
their SMT siblings to see if any true negative cpus available. The
false positive cpus will be removed during fail-prone scans if its
belonging core is fully busy.

Benchmark
=========

All of the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled, and test machines are:

A) A dual socket machine modeled Intel Xeon(R) Platinum 8260 with SNC
disabled, so there are 2 NUMA nodes each of which has 24C/48T. Each
NUMA shares an LLC.

B) A dual socket machine modeled AMD EPYC 7Y83 64-Core Processor with
NPS1 enabled, so there are 2 NUMA nodes each of which has 64C/128T.
Each NUMA node contains several LLCs sized of 16 cpus.

The 'baseline' is based on tip sched/core fb04563d1cae (v5.19.0) with
the first 2 patches of this series applied. While 'sis_core' includes
the whole patchset.

Results
=======

hackbench-process-pipes
 (A)			baseline		sis_core
Amean     1        0.2540 (   0.00%)      0.2533 (   0.26%)
Amean     4        0.6220 (   0.00%)      0.5993 (   3.64%)
Amean     7        0.8020 (   0.00%)      0.7663 *   4.45%*
Amean     12       1.1823 (   0.00%)      1.1037 *   6.65%*
Amean     21       2.7717 (   0.00%)      2.2203 *  19.89%*
Amean     30       5.1200 (   0.00%)      3.7133 *  27.47%*
Amean     48       8.5890 (   0.00%)      7.0863 *  17.50%*
Amean     79      11.2580 (   0.00%)     10.3717 *   7.87%*
Amean     110     13.1283 (   0.00%)     12.4133 *   5.45%*
Amean     141     15.5967 (   0.00%)     14.4883 *   7.11%*
Amean     172     18.1153 (   0.00%)     17.2557 *   4.75%*
Amean     203     21.1340 (   0.00%)     20.2807 (   4.04%)
Amean     234     23.8227 (   0.00%)     22.8510 (   4.08%)
Amean     265     26.8293 (   0.00%)     25.7367 *   4.07%*
Amean     296     29.5800 (   0.00%)     28.7847 (   2.69%)
 (B)
Amean     1        0.3650 (   0.00%)      0.3510 (   3.84%)
Amean     4        0.4837 (   0.00%)      0.4753 (   1.72%)
Amean     7        0.4997 (   0.00%)      0.5073 (  -1.53%)
Amean     12       0.5863 (   0.00%)      0.5807 (   0.97%)
Amean     21       0.8930 (   0.00%)      0.8953 (  -0.26%)
Amean     30       1.2530 (   0.00%)      1.2633 (  -0.82%)
Amean     48       1.9743 (   0.00%)      1.9023 (   3.65%)
Amean     79       3.4933 (   0.00%)      3.2820 (   6.05%)
Amean     110      5.5963 (   0.00%)      5.3923 (   3.65%)
Amean     141      7.6550 (   0.00%)      6.8633 (  10.34%)
Amean     172      8.8323 (   0.00%)      8.2973 *   6.06%*
Amean     203     10.8683 (   0.00%)      9.5170 *  12.43%*
Amean     234     11.8683 (   0.00%)     10.6217 (  10.50%)
Amean     265     13.4717 (   0.00%)     11.9357 *  11.40%*
Amean     296     13.8130 (   0.00%)     12.7430 *   7.75%*

The results on machine A can approximately divided into 3 sections:
 - busy, e.g. <21 groups
 - overloaded, e.g. 21~48 groups
 - saturated, the rest part

The two cases of 296 groups in B and 110 groups in A have same amount
of tasks per cpu. So does 30 groups in B and 12 groups in A. So the
sections on A also apply to B, except that B only has the first two.
This implies that this feature seems to have consistant behaviour on
LLCs with different sizes.

For the busy part the result is neutral with slight wins or loss.
It's not hard to find an idle cpu in such case, so SIS_CORE doesn't
outperform the linear scanner, considering the cpumask is maintained
at a cost which will negate the slight benefit.

Once load increases, SIS_CORE helps improving the throughput quite a
lot by squeezing out the hidden capacity of the cpus. And even under
extreme load pressure when the cpu capacity is almost fully utilized,
there is still some can be exploited.

Although it is unlikely to be such loaded in the real world, the long
running workloads like training jobs can also keep cpus busy and can
benefit from this feature a lot.

netperf
 (A-udp)		    baseline		   sis_core
Hmean     send-64         214.34 (   0.00%)      210.79 *  -1.65%*
Hmean     send-128        427.90 (   0.00%)      417.96 *  -2.32%*
Hmean     send-256        839.65 (   0.00%)      823.78 *  -1.89%*
Hmean     send-1024      3207.45 (   0.00%)     3167.96 *  -1.23%*
Hmean     send-2048      6097.24 (   0.00%)     6089.01 (  -0.13%)
Hmean     send-3312      9350.83 (   0.00%)     9299.09 (  -0.55%)
Hmean     send-4096     11368.25 (   0.00%)    11186.44 *  -1.60%*
Hmean     send-8192     18273.21 (   0.00%)    18103.81 (  -0.93%)
Hmean     send-16384    28207.81 (   0.00%)    28259.82 (   0.18%)
 (B-udp)
Hmean     send-64         249.97 (   0.00%)      256.99 *   2.81%*
Hmean     send-128        500.68 (   0.00%)      514.44 *   2.75%*
Hmean     send-256        991.59 (   0.00%)     1017.38 *   2.60%*
Hmean     send-1024      3913.02 (   0.00%)     3982.68 *   1.78%*
Hmean     send-2048      7627.99 (   0.00%)     7590.30 (  -0.49%)
Hmean     send-3312     11907.07 (   0.00%)    12114.03 *   1.74%*
Hmean     send-4096     14300.09 (   0.00%)    14753.34 *   3.17%*
Hmean     send-8192     24576.21 (   0.00%)    25431.42 *   3.48%*
Hmean     send-16384    42105.89 (   0.00%)    41813.30 (  -0.69%)
 (A-tcp)
Hmean     64        1191.91 (   0.00%)     1220.47 *   2.40%*
Hmean     128       2318.60 (   0.00%)     2354.56 (   1.55%)
Hmean     256       4267.41 (   0.00%)     4226.72 (  -0.95%)
Hmean     1024     13190.66 (   0.00%)    13065.91 (  -0.95%)
Hmean     2048     20466.22 (   0.00%)    20704.66 (   1.17%)
Hmean     3312     24363.57 (   0.00%)    24613.99 *   1.03%*
Hmean     4096     26144.44 (   0.00%)    26204.24 (   0.23%)
Hmean     8192     30387.77 (   0.00%)    30703.65 *   1.04%*
Hmean     16384    34942.71 (   0.00%)    34205.44 *  -2.11%*
 (B-tcp)
Hmean     64        1971.18 (   0.00%)     2120.61 *   7.58%*
Hmean     128       3752.96 (   0.00%)     3995.68 *   6.47%*
Hmean     256       6861.58 (   0.00%)     7342.93 *   7.02%*
Hmean     1024     21966.06 (   0.00%)    23725.30 *   8.01%*
Hmean     2048     33949.66 (   0.00%)    35620.67 *   4.92%*
Hmean     3312     40681.75 (   0.00%)    41543.26 *   2.12%*
Hmean     4096     44309.70 (   0.00%)    45390.03 *   2.44%*
Hmean     8192     50909.35 (   0.00%)    52157.16 *   2.45%*
Hmean     16384    57198.37 (   0.00%)    57686.96 (   0.85%)

The netperf has an almost 100% success rate on used target, prev or
recent cpus, so the domain scan is generally not performed. This test
is to see how much overhead of maintaining the idle cpumask and the
result is neutral, which sounds like the overhead is acceptable.

tbench4 Throughput
 (A)			baseline		sis_core
Hmean     1        280.53 (   0.00%)      289.44 *   3.17%*
Hmean     2        561.94 (   0.00%)      571.46 *   1.69%*
Hmean     4       1109.14 (   0.00%)     1129.88 *   1.87%*
Hmean     8       2229.39 (   0.00%)     2266.52 *   1.67%*
Hmean     16      4383.06 (   0.00%)     4473.48 *   2.06%*
Hmean     32      7124.14 (   0.00%)     7223.83 *   1.40%*
Hmean     64      8815.41 (   0.00%)     8770.21 *  -0.51%*
Hmean     128    19519.35 (   0.00%)    20337.24 *   4.19%*
Hmean     256    19392.24 (   0.00%)    20052.98 *   3.41%*
Hmean     384    19201.07 (   0.00%)    19563.63 *   1.89%*
 (B)
Hmean     1         515.98 (   0.00%)      499.91 *  -3.12%*
Hmean     2        1031.54 (   0.00%)     1044.38 *   1.24%*
Hmean     4        1953.44 (   0.00%)     1959.30 *   0.30%*
Hmean     8        3622.52 (   0.00%)     3773.08 *   4.16%*
Hmean     16       6545.82 (   0.00%)     6814.46 *   4.10%*
Hmean     32      13697.73 (   0.00%)    13078.74 *  -4.52%*
Hmean     64      25589.59 (   0.00%)    24576.52 *  -3.96%*
Hmean     128     35667.64 (   0.00%)    37590.20 *   5.39%*
Hmean     256     65215.85 (   0.00%)    64921.74 *  -0.45%*
Hmean     512     66035.57 (   0.00%)    63812.48 *  -3.37%*
Hmean     1024    53391.64 (   0.00%)    62356.50 *  16.79%*

Like netperf, tbench4 benchmark also has a high success rate ~39% on
the cache hot cpus, and the SIS overall success rate is ~45%. This
benchmark runs a fast idling workload and makes the SIS idle cpumask
quite unstable, which is a disaster to this feature. Even so, not
much but still some improvement can be seen from the result.

Conclusion
==========

The overhead of maintaining the idle cpumasks seems acceptable, and
this cost can trade for considerable throughput improvement once LLC
becomes busier in which case less idle cpus are available.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 kernel/sched/fair.c  | 74 +++++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h |  3 ++
 2 files changed, 65 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3aa699e9d4af..d06d59ac2f05 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6309,6 +6309,16 @@ static void update_idle_cpu(int cpu)
 	}
 }
 
+static inline bool should_scan_sibling(int cpu)
+{
+	return cmpxchg(&cpu_rq(cpu)->sis_scan_sibling, 1, 0);
+}
+
+static inline void set_scan_sibling(int cpu)
+{
+	WRITE_ONCE(cpu_rq(cpu)->sis_scan_sibling, 1);
+}
+
 /*
  * Scans the local SMT mask to see if the entire core is idle, and records this
  * information in sd_llc_shared->has_idle_cores.
@@ -6384,17 +6394,20 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 /*
  * Scan the local SMT mask for idle CPUs.
  */
-static int select_idle_smt(struct task_struct *p, int target)
+static int select_idle_smt(struct task_struct *p, int core, struct cpumask *cpus, int exclude)
 {
 	int cpu;
 
-	for_each_cpu_and(cpu, cpu_smt_mask(target), p->cpus_ptr) {
-		if (cpu == target)
+	for_each_cpu_and(cpu, cpu_smt_mask(core), p->cpus_ptr) {
+		if (exclude && cpu == core)
 			continue;
 		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
 			return cpu;
 	}
 
+	if (cpus)
+		cpumask_clear_cpu(core, cpus);
+
 	return -1;
 }
 
@@ -6409,12 +6422,21 @@ static inline bool test_idle_cores(int cpu)
 	return false;
 }
 
+static inline bool should_scan_sibling(int cpu)
+{
+	return false;
+}
+
+static inline void set_scan_sibling(int cpu)
+{
+}
+
 static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
 	return __select_idle_cpu(core, p);
 }
 
-static inline int select_idle_smt(struct task_struct *p, int target)
+static inline int select_idle_smt(struct task_struct *p, int core, struct cpumask *cpus, int exclude)
 {
 	return -1;
 }
@@ -6434,6 +6456,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	struct rq *this_rq = this_rq();
 	int this = smp_processor_id();
 	struct sched_domain *this_sd = NULL;
+	bool scan_sibling = false;
 	u64 time = 0;
 
 	if (sched_feat(SIS_PROP) && !has_idle_core) {
@@ -6479,20 +6502,31 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		}
 	}
 
-	if (sched_feat(SIS_CORE) && sched_smt_active())
+	if (sched_feat(SIS_CORE) && sched_smt_active()) {
+		/*
+		 * Due to the nature of idle core scanning, has_idle_core
+		 * hint should also consume the scan_sibling flag even
+		 * though it doesn't use the flag when scanning.
+		 */
+		scan_sibling = should_scan_sibling(target);
 		icpus = sched_domain_icpus(sd);
+	}
 
 	cpumask_and(cpus, icpus ? icpus : sched_domain_span(sd), p->cpus_ptr);
 
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
+		if (!--nr)
+			break;
+
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
 			if ((unsigned int)i < nr_cpumask_bits)
 				return i;
-
+		} else if (scan_sibling) {
+			idle_cpu = select_idle_smt(p, cpu, icpus, 0);
+			if ((unsigned int)idle_cpu < nr_cpumask_bits)
+				break;
 		} else {
-			if (!--nr)
-				return -1;
 			idle_cpu = __select_idle_cpu(cpu, p);
 			if ((unsigned int)idle_cpu < nr_cpumask_bits)
 				break;
@@ -6503,9 +6537,25 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		set_idle_cores(target, false);
 
 	if (icpus && idle_cpu == -1) {
-		/* Reset the idle cpu mask if a full scan fails */
-		if (nr > 0)
-			cpumask_clear(icpus);
+		if (nr > 0 && (has_idle_core || scan_sibling)) {
+			/*
+			 * Reset the idle cpu mask if a full scan fails,
+			 * but ignore the !has_idle_core case which should
+			 * have already been fixed during scan.
+			 */
+			if (has_idle_core)
+				cpumask_clear(icpus);
+		} else {
+			/*
+			 * As for partial scan failures, it will probably
+			 * fail again next time scanning from the same cpu.
+			 * Due to the SIS_CORE rule of CORE granule update,
+			 * some idle cpus can be missed in the mask. So it
+			 * would be reasonable to scan SMT siblings as well
+			 * if the scan is fail-prone.
+			 */
+			set_scan_sibling(target);
+		}
 	}
 
 	if (sched_feat(SIS_PROP) && this_sd && !has_idle_core) {
@@ -6657,7 +6707,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		has_idle_core = test_idle_cores(target);
 
 		if (!has_idle_core && cpus_share_cache(prev, target)) {
-			i = select_idle_smt(p, prev);
+			i = select_idle_smt(p, prev, NULL, 1);
 			if ((unsigned int)i < nr_cpumask_bits)
 				return i;
 		}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1fc198be1ffd..c7f8ed5021e6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -971,6 +971,9 @@ struct rq {
 
 #ifdef CONFIG_SMP
 	unsigned int		ttwu_pending;
+#ifdef CONFIG_SCHED_SMT
+	int			sis_scan_sibling;
+#endif
 #endif
 	u64			nr_switches;
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 3/4] sched/fair: Introduce SIS_CORE
  2022-10-19 12:28 ` [PATCH v6 3/4] sched/fair: Introduce SIS_CORE Abel Wu
@ 2022-10-21  4:03   ` Chen Yu
  2022-10-21  4:30     ` Abel Wu
  0 siblings, 1 reply; 18+ messages in thread
From: Chen Yu @ 2022-10-21  4:03 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Josh Don, Tim Chen,
	K Prateek Nayak, Gautham R . Shenoy, Aubrey Li, Qais Yousef,
	Juri Lelli, Rik van Riel, Yicong Yang, Barry Song, linux-kernel

On 2022-10-19 at 20:28:58 +0800, Abel Wu wrote:
[cut]
> A major concern is the accuracy of the idle cpumask. A cpu present
> in the mask might not be idle any more, which is called the false
> positive cpu. Such cpus will negate lots of benefit this feature
> brings. The strategy against the false positives will be introduced
> in next patch.
> 
I was thinking that, if patch[3/4] needs [4/4] to fix the false positives,
maybe SIS_CORE could be disabled by default in 3/4 but enabled
in 4/4? So this might facilicate git bisect in case of any regression
check?
[cut]
> + * To honor the rule of CORE granule update, set this cpu to the LLC idle
> + * cpumask only if there is no cpu of this core showed up in the cpumask.
> + */
> +static void update_idle_cpu(int cpu)
> +{
> +	struct sched_domain_shared *sds;
> +
> +	if (!sched_feat(SIS_CORE))
> +		return;
> +
> +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> +	if (sds) {
> +		struct cpumask *icpus = to_cpumask(sds->icpus);
> +
> +		/*
> +		 * This is racy against clearing in select_idle_cpu(),
> +		 * and can lead to idle cpus miss the chance to be set to
> +		 * the idle cpumask, thus the idle cpus are temporarily
> +		 * out of reach in SIS domain scan. But it should be rare
> +		 * and we still have ILB to kick them working.
> +		 */
> +		if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
> +			cpumask_set_cpu(cpu, icpus);
Maybe I miss something, here we only set one CPU in the icpus, but
when we reach update_idle_cpu(), all SMT siblings of 'cpu' are idle,
is this intended for 'CORE granule update'?


thanks,
Chenyu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 3/4] sched/fair: Introduce SIS_CORE
  2022-10-21  4:03   ` Chen Yu
@ 2022-10-21  4:30     ` Abel Wu
  2022-10-21  4:34       ` Chen Yu
  0 siblings, 1 reply; 18+ messages in thread
From: Abel Wu @ 2022-10-21  4:30 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Josh Don, Tim Chen,
	K Prateek Nayak, Gautham R . Shenoy, Aubrey Li, Qais Yousef,
	Juri Lelli, Rik van Riel, Yicong Yang, Barry Song, linux-kernel

Hi Chen, thanks for your reviewing!

On 10/21/22 12:03 PM, Chen Yu wrote:
> On 2022-10-19 at 20:28:58 +0800, Abel Wu wrote:
> [cut]
>> A major concern is the accuracy of the idle cpumask. A cpu present
>> in the mask might not be idle any more, which is called the false
>> positive cpu. Such cpus will negate lots of benefit this feature
>> brings. The strategy against the false positives will be introduced
>> in next patch.
>>
> I was thinking that, if patch[3/4] needs [4/4] to fix the false positives,
> maybe SIS_CORE could be disabled by default in 3/4 but enabled
> in 4/4? So this might facilicate git bisect in case of any regression
> check?

Agreed. Will fix in next version.

> [cut]
>> + * To honor the rule of CORE granule update, set this cpu to the LLC idle
>> + * cpumask only if there is no cpu of this core showed up in the cpumask.
>> + */
>> +static void update_idle_cpu(int cpu)
>> +{
>> +	struct sched_domain_shared *sds;
>> +
>> +	if (!sched_feat(SIS_CORE))
>> +		return;
>> +
>> +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>> +	if (sds) {
>> +		struct cpumask *icpus = to_cpumask(sds->icpus);
>> +
>> +		/*
>> +		 * This is racy against clearing in select_idle_cpu(),
>> +		 * and can lead to idle cpus miss the chance to be set to
>> +		 * the idle cpumask, thus the idle cpus are temporarily
>> +		 * out of reach in SIS domain scan. But it should be rare
>> +		 * and we still have ILB to kick them working.
>> +		 */
>> +		if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
>> +			cpumask_set_cpu(cpu, icpus);
> Maybe I miss something, here we only set one CPU in the icpus, but
> when we reach update_idle_cpu(), all SMT siblings of 'cpu' are idle,
> is this intended for 'CORE granule update'?

The __update_idle_core() is called by all the cpus that need to go idle
to update has_idle_core if necessary, and update_idle_cpu() is called
before that check.

Thanks,
Abel


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 3/4] sched/fair: Introduce SIS_CORE
  2022-10-21  4:30     ` Abel Wu
@ 2022-10-21  4:34       ` Chen Yu
  2022-10-21  9:35         ` Abel Wu
  0 siblings, 1 reply; 18+ messages in thread
From: Chen Yu @ 2022-10-21  4:34 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Josh Don, Tim Chen,
	K Prateek Nayak, Gautham R . Shenoy, Aubrey Li, Qais Yousef,
	Juri Lelli, Rik van Riel, Yicong Yang, Barry Song, linux-kernel

On 2022-10-21 at 12:30:56 +0800, Abel Wu wrote:
> Hi Chen, thanks for your reviewing!
> 
> On 10/21/22 12:03 PM, Chen Yu wrote:
> > On 2022-10-19 at 20:28:58 +0800, Abel Wu wrote:
> > [cut]
> > > A major concern is the accuracy of the idle cpumask. A cpu present
> > > in the mask might not be idle any more, which is called the false
> > > positive cpu. Such cpus will negate lots of benefit this feature
> > > brings. The strategy against the false positives will be introduced
> > > in next patch.
> > > 
> > I was thinking that, if patch[3/4] needs [4/4] to fix the false positives,
> > maybe SIS_CORE could be disabled by default in 3/4 but enabled
> > in 4/4? So this might facilicate git bisect in case of any regression
> > check?
> 
> Agreed. Will fix in next version.
> 
> > [cut]
> > > + * To honor the rule of CORE granule update, set this cpu to the LLC idle
> > > + * cpumask only if there is no cpu of this core showed up in the cpumask.
> > > + */
> > > +static void update_idle_cpu(int cpu)
> > > +{
> > > +	struct sched_domain_shared *sds;
> > > +
> > > +	if (!sched_feat(SIS_CORE))
> > > +		return;
> > > +
> > > +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > > +	if (sds) {
> > > +		struct cpumask *icpus = to_cpumask(sds->icpus);
> > > +
> > > +		/*
> > > +		 * This is racy against clearing in select_idle_cpu(),
> > > +		 * and can lead to idle cpus miss the chance to be set to
> > > +		 * the idle cpumask, thus the idle cpus are temporarily
> > > +		 * out of reach in SIS domain scan. But it should be rare
> > > +		 * and we still have ILB to kick them working.
> > > +		 */
> > > +		if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
> > > +			cpumask_set_cpu(cpu, icpus);
> > Maybe I miss something, here we only set one CPU in the icpus, but
> > when we reach update_idle_cpu(), all SMT siblings of 'cpu' are idle,
> > is this intended for 'CORE granule update'?
> 
> The __update_idle_core() is called by all the cpus that need to go idle
> to update has_idle_core if necessary, and update_idle_cpu() is called
> before that check.
>
I see.

Since __update_idle_core() has checked all SMT siblings of 'cpu' if
they are idle, can that information also be updated to icpus?

thanks,
Chenyu 
> Thanks,
> Abel
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 3/4] sched/fair: Introduce SIS_CORE
  2022-10-21  4:34       ` Chen Yu
@ 2022-10-21  9:35         ` Abel Wu
  2022-10-21 11:14           ` Chen Yu
  0 siblings, 1 reply; 18+ messages in thread
From: Abel Wu @ 2022-10-21  9:35 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Josh Don, Tim Chen,
	K Prateek Nayak, Gautham R . Shenoy, Aubrey Li, Qais Yousef,
	Juri Lelli, Rik van Riel, Yicong Yang, Barry Song, linux-kernel

On 10/21/22 12:34 PM, Chen Yu wrote:
> On 2022-10-21 at 12:30:56 +0800, Abel Wu wrote:
>> Hi Chen, thanks for your reviewing!
>>
>> On 10/21/22 12:03 PM, Chen Yu wrote:
>>> On 2022-10-19 at 20:28:58 +0800, Abel Wu wrote:
>>> [cut]
>>>> A major concern is the accuracy of the idle cpumask. A cpu present
>>>> in the mask might not be idle any more, which is called the false
>>>> positive cpu. Such cpus will negate lots of benefit this feature
>>>> brings. The strategy against the false positives will be introduced
>>>> in next patch.
>>>>
>>> I was thinking that, if patch[3/4] needs [4/4] to fix the false positives,
>>> maybe SIS_CORE could be disabled by default in 3/4 but enabled
>>> in 4/4? So this might facilicate git bisect in case of any regression
>>> check?
>>
>> Agreed. Will fix in next version.
>>
>>> [cut]
>>>> + * To honor the rule of CORE granule update, set this cpu to the LLC idle
>>>> + * cpumask only if there is no cpu of this core showed up in the cpumask.
>>>> + */
>>>> +static void update_idle_cpu(int cpu)
>>>> +{
>>>> +	struct sched_domain_shared *sds;
>>>> +
>>>> +	if (!sched_feat(SIS_CORE))
>>>> +		return;
>>>> +
>>>> +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>>>> +	if (sds) {
>>>> +		struct cpumask *icpus = to_cpumask(sds->icpus);
>>>> +
>>>> +		/*
>>>> +		 * This is racy against clearing in select_idle_cpu(),
>>>> +		 * and can lead to idle cpus miss the chance to be set to
>>>> +		 * the idle cpumask, thus the idle cpus are temporarily
>>>> +		 * out of reach in SIS domain scan. But it should be rare
>>>> +		 * and we still have ILB to kick them working.
>>>> +		 */
>>>> +		if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
>>>> +			cpumask_set_cpu(cpu, icpus);
>>> Maybe I miss something, here we only set one CPU in the icpus, but
>>> when we reach update_idle_cpu(), all SMT siblings of 'cpu' are idle,
>>> is this intended for 'CORE granule update'?
>>
>> The __update_idle_core() is called by all the cpus that need to go idle
>> to update has_idle_core if necessary, and update_idle_cpu() is called
>> before that check.
>>
> I see.
> 
> Since __update_idle_core() has checked all SMT siblings of 'cpu' if
> they are idle, can that information also be updated to icpus?

I think this will simply fallback to the original per-cpu proposal and
lose the opportunity to spread tasks to different cores.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 3/4] sched/fair: Introduce SIS_CORE
  2022-10-21  9:35         ` Abel Wu
@ 2022-10-21 11:14           ` Chen Yu
  0 siblings, 0 replies; 18+ messages in thread
From: Chen Yu @ 2022-10-21 11:14 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Josh Don, Tim Chen,
	K Prateek Nayak, Gautham R . Shenoy, Aubrey Li, Qais Yousef,
	Juri Lelli, Rik van Riel, Yicong Yang, Barry Song, linux-kernel

On 2022-10-21 at 17:35:06 +0800, Abel Wu wrote:
> On 10/21/22 12:34 PM, Chen Yu wrote:
> > On 2022-10-21 at 12:30:56 +0800, Abel Wu wrote:
> > > Hi Chen, thanks for your reviewing!
> > > 
> > > On 10/21/22 12:03 PM, Chen Yu wrote:
> > > > On 2022-10-19 at 20:28:58 +0800, Abel Wu wrote:
> > > > [cut]
> > > > > A major concern is the accuracy of the idle cpumask. A cpu present
> > > > > in the mask might not be idle any more, which is called the false
> > > > > positive cpu. Such cpus will negate lots of benefit this feature
> > > > > brings. The strategy against the false positives will be introduced
> > > > > in next patch.
> > > > > 
> > > > I was thinking that, if patch[3/4] needs [4/4] to fix the false positives,
> > > > maybe SIS_CORE could be disabled by default in 3/4 but enabled
> > > > in 4/4? So this might facilicate git bisect in case of any regression
> > > > check?
> > > 
> > > Agreed. Will fix in next version.
> > > 
> > > > [cut]
> > > > > + * To honor the rule of CORE granule update, set this cpu to the LLC idle
> > > > > + * cpumask only if there is no cpu of this core showed up in the cpumask.
> > > > > + */
> > > > > +static void update_idle_cpu(int cpu)
> > > > > +{
> > > > > +	struct sched_domain_shared *sds;
> > > > > +
> > > > > +	if (!sched_feat(SIS_CORE))
> > > > > +		return;
> > > > > +
> > > > > +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > > > > +	if (sds) {
> > > > > +		struct cpumask *icpus = to_cpumask(sds->icpus);
> > > > > +
> > > > > +		/*
> > > > > +		 * This is racy against clearing in select_idle_cpu(),
> > > > > +		 * and can lead to idle cpus miss the chance to be set to
> > > > > +		 * the idle cpumask, thus the idle cpus are temporarily
> > > > > +		 * out of reach in SIS domain scan. But it should be rare
> > > > > +		 * and we still have ILB to kick them working.
> > > > > +		 */
> > > > > +		if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
> > > > > +			cpumask_set_cpu(cpu, icpus);
> > > > Maybe I miss something, here we only set one CPU in the icpus, but
> > > > when we reach update_idle_cpu(), all SMT siblings of 'cpu' are idle,
> > > > is this intended for 'CORE granule update'?
> > > 
> > > The __update_idle_core() is called by all the cpus that need to go idle
> > > to update has_idle_core if necessary, and update_idle_cpu() is called
> > > before that check.
> > > 
> > I see.
> > 
> > Since __update_idle_core() has checked all SMT siblings of 'cpu' if
> > they are idle, can that information also be updated to icpus?
> 
> I think this will simply fallback to the original per-cpu proposal and
> lose the opportunity to spread tasks to different cores.
OK, make sense.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
  2022-10-19 12:28 [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
                   ` (3 preceding siblings ...)
  2022-10-19 12:28 ` [PATCH v6 4/4] sched/fair: Deal with SIS scan failures Abel Wu
@ 2022-11-04  7:29 ` Abel Wu
  2022-11-14  5:45 ` K Prateek Nayak
  2023-02-07  3:42 ` K Prateek Nayak
  6 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2022-11-04  7:29 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Mel Gorman, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider
  Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
	Aubrey Li, Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang,
	Barry Song, linux-kernel

Ping :)

On 10/19/22 8:28 PM, Abel Wu wrote:
> This patchset tries to improve SIS scan efficiency by recording idle
> cpus in a cpumask for each LLC which will be used as a target cpuset
> in the domain scan. The cpus are recorded at CORE granule to avoid
> tasks being stack on same core.
> 
> v5 -> v6:
>   - Rename SIS_FILTER to SIS_CORE as it can only be activated when
>     SMT is enabled and better describes the behavior of CORE granule
>     update & load delivery.
>   - Removed the part of limited scan for idle cores since it might be
>     better to open another thread to discuss the strategies such as
>     limited or scaled depth. But keep the part of full scan for idle
>     cores when LLC is overloaded because SIS_CORE can greatly reduce
>     the overhead of full scan in such case.
>   - Removed the state of sd_is_busy which indicates an LLC is fully
>     busy and we can safely skip the SIS domain scan. I would prefer
>     leave this to SIS_UTIL.
>   - The filter generation mechanism is replaced by in-place updates
>     during domain scan to better deal with partial scan failures.
>   - Collect Reviewed-bys from Tim Chen
> 
> ...
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
  2022-10-19 12:28 [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
                   ` (4 preceding siblings ...)
  2022-11-04  7:29 ` [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
@ 2022-11-14  5:45 ` K Prateek Nayak
  2022-11-15  8:31   ` Abel Wu
  2023-02-07  3:42 ` K Prateek Nayak
  6 siblings, 1 reply; 18+ messages in thread
From: K Prateek Nayak @ 2022-11-14  5:45 UTC (permalink / raw)
  To: Abel Wu, Peter Zijlstra, Ingo Molnar, Mel Gorman,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider
  Cc: Josh Don, Chen Yu, Tim Chen, Gautham R . Shenoy, Aubrey Li,
	Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang, Barry Song,
	linux-kernel

Hello Abel,

Sorry for the delay. I've tested the patch on a dual socket Zen3 system
(2 x 64C/128T)

tl;dr

o I do not notice any regressions with the standard benchmarks.
o schbench sees a nice improvement to the tail latency when the number
  of worker are equal to the number of cores in the system in NPS1 and
  NPS2 mode. (Marked with "^")
o Few data points show improvements in tbench in NPS1 and NPS2 mode.
  (Marked with "^")

I'm still in the process of running larger workloads. If there is any
specific workload you would like me to run on the test system, please
do let me know. Below is the detailed report:

Following are the results from running standard benchmarks on a
dual socket Zen3 (2 x 64C/128T) machine configured in different
NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Benchmark Results:

Kernel versions:
- tip:          5.19.0 tip sched/core
- sis_core: 	5.19.0 tip sched/core + this series

When we started testing, the tip was at:
commit fdf756f71271 ("sched: Fix more TASK_state comparisons")

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test:			tip			sis_core
 1-groups:	   4.06 (0.00 pct)	   4.26 (-4.92 pct)	*
 1-groups:	   4.14 (0.00 pct)	   4.09 (1.20 pct)	[Verification Run]
 2-groups:	   4.76 (0.00 pct)	   4.71 (1.05 pct)
 4-groups:	   5.22 (0.00 pct)	   5.11 (2.10 pct)
 8-groups:	   5.35 (0.00 pct)	   5.31 (0.74 pct)
16-groups:	   7.21 (0.00 pct)	   6.80 (5.68 pct)

o NPS2

Test:			tip			sis_core
 1-groups:	   4.09 (0.00 pct)	   4.08 (0.24 pct)
 2-groups:	   4.70 (0.00 pct)	   4.69 (0.21 pct)
 4-groups:	   5.05 (0.00 pct)	   4.92 (2.57 pct)
 8-groups:	   5.35 (0.00 pct)	   5.26 (1.68 pct)
16-groups:	   6.37 (0.00 pct)	   6.34 (0.47 pct)

o NPS4

Test:			tip			sis_core
 1-groups:	   4.07 (0.00 pct)	   3.99 (1.96 pct)
 2-groups:	   4.65 (0.00 pct)	   4.59 (1.29 pct)
 4-groups:	   5.13 (0.00 pct)	   5.00 (2.53 pct)
 8-groups:	   5.47 (0.00 pct)	   5.43 (0.73 pct)
16-groups:	   6.82 (0.00 pct)	   6.56 (3.81 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

o NPS1

#workers:	tip			sis_core
  1:	  33.00 (0.00 pct)	  33.00 (0.00 pct)
  2:	  35.00 (0.00 pct)	  35.00 (0.00 pct)
  4:	  39.00 (0.00 pct)	  38.00 (2.56 pct)
  8:	  49.00 (0.00 pct)	  48.00 (2.04 pct)
 16:	  63.00 (0.00 pct)	  66.00 (-4.76 pct)
 32:	 109.00 (0.00 pct)	 107.00 (1.83 pct)
 64:	 208.00 (0.00 pct)	 216.00 (-3.84 pct)
128:	 559.00 (0.00 pct)	 469.00 (16.10 pct)     ^
256:	 45888.00 (0.00 pct)	 47552.00 (-3.62 pct)
512:	 80000.00 (0.00 pct)	 79744.00 (0.32 pct)

o NPS2

#workers:	=tip			sis_core
  1:	  30.00 (0.00 pct)	  32.00 (-6.66 pct)
  2:	  37.00 (0.00 pct)	  34.00 (8.10 pct)
  4:	  39.00 (0.00 pct)	  36.00 (7.69 pct)
  8:	  51.00 (0.00 pct)	  49.00 (3.92 pct)
 16:	  67.00 (0.00 pct)	  66.00 (1.49 pct)
 32:	 117.00 (0.00 pct)	 109.00 (6.83 pct)
 64:	 216.00 (0.00 pct)	 213.00 (1.38 pct)
128:	 529.00 (0.00 pct)	 465.00 (12.09 pct)     ^
256:	 47040.00 (0.00 pct)	 46528.00 (1.08 pct)
512:	 84864.00 (0.00 pct)	 83584.00 (1.50 pct)

o NPS4

#workers:	tip			sis_core
  1:	  23.00 (0.00 pct)	  28.00 (-21.73 pct)
  2:	  28.00 (0.00 pct)	  36.00 (-28.57 pct)
  4:	  41.00 (0.00 pct)	  43.00 (-4.87 pct)
  8:	  60.00 (0.00 pct)	  48.00 (20.00 pct)
 16:	  71.00 (0.00 pct)	  69.00 (2.81 pct)
 32:	 117.00 (0.00 pct)	 115.00 (1.70 pct)
 64:	 227.00 (0.00 pct)	 228.00 (-0.44 pct)
128:	 545.00 (0.00 pct)	 545.00 (0.00 pct)
256:	 45632.00 (0.00 pct)	 47680.00 (-4.48 pct)
512:	 81024.00 (0.00 pct)	 76416.00 (5.68 pct)

Note: For lower worker count, schbench can show run to
run variation depending on external factors. Regression
for lower worker count can be ignored. The results are
included to spot any large blow up in the tail latency
for larger worker count.

~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients:	tip			sis_core
    1	 578.37 (0.00 pct)	 582.09 (0.64 pct)
    2	 1062.09 (0.00 pct)	 1063.95 (0.17 pct)
    4	 1800.62 (0.00 pct)	 1879.18 (4.36 pct)
    8	 3211.02 (0.00 pct)	 3220.44 (0.29 pct)
   16	 4848.92 (0.00 pct)	 4890.08 (0.84 pct)
   32	 9091.36 (0.00 pct)	 9721.13 (6.92 pct)     ^
   64	 15454.01 (0.00 pct)	 15124.42 (-2.13 pct)
  128	 3511.33 (0.00 pct)	 14314.79 (307.67 pct)
  128    19910.99 (0.00pct)      19935.61 (0.12 pct)   [Verification Run]
  256	 50019.32 (0.00 pct)	 50708.24 (1.37 pct)
  512	 44317.68 (0.00 pct)	 44787.48 (1.06 pct)
 1024	 41200.85 (0.00 pct)	 42079.29 (2.13 pct)

o NPS2

Clients:	tip			sis_core
    1	 576.05 (0.00 pct)	 579.18 (0.54 pct)
    2	 1037.68 (0.00 pct)	 1070.49 (3.16 pct)
    4	 1818.13 (0.00 pct)	 1860.22 (2.31 pct)
    8	 3004.16 (0.00 pct)	 3087.09 (2.76 pct)
   16	 4520.11 (0.00 pct)	 4789.53 (5.96 pct)
   32	 8624.23 (0.00 pct)	 9439.50 (9.45 pct)     ^
   64	 14886.75 (0.00 pct)	 15004.96 (0.79 pct)
  128	 20602.00 (0.00 pct)	 17730.31 (-13.93 pct) *
  128    20602.00 (0.00 pct)     19585.20 (-4.93 pct)   [Verification Run]
  256	 45566.83 (0.00 pct)	 47922.70 (5.17 pct)
  512	 42717.49 (0.00 pct)	 43809.68 (2.55 pct)
 1024	 40936.61 (0.00 pct)	 40787.71 (-0.36 pct)

o NPS4

Clients:	tip			sis_core
    1	 576.36 (0.00 pct)	 580.83 (0.77 pct)
    2	 1044.26 (0.00 pct)	 1066.50 (2.12 pct)
    4	 1839.77 (0.00 pct)	 1867.56 (1.51 pct)
    8	 3043.53 (0.00 pct)	 3115.17 (2.35 pct)
   16	 5207.54 (0.00 pct)	 4847.53 (-6.91 pct)	*
   16	 4722.56 (0.00 pct)	 4811.29 (1.87 pct)	[Verification Run]
   32	 9263.86 (0.00 pct)	 9478.68 (2.31 pct)
   64	 14959.66 (0.00 pct)	 15267.39 (2.05 pct)
  128	 20698.65 (0.00 pct)	 20432.19 (-1.28 pct)
  256	 46666.21 (0.00 pct)	 46664.81 (0.00 pct)
  512	 41532.80 (0.00 pct)	 44241.12 (6.52 pct)
 1024	 39459.49 (0.00 pct)	 41043.22 (4.01 pct)

Note: On the tested kernel, with 128 clients, tbench can
run into a bottleneck during C2 exit. More details can be
found at:
https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/
This issue has been fixed in v6.0 but was not part of the
tip kernel when I started testing. This data point has
been rerun with C2 disabled to get representative results.

~~~~~~~~~~
~ Stream ~
~~~~~~~~~~

o NPS1

-> 10 Runs:

Test:		tip			sis_core
 Copy:	 328419.14 (0.00 pct)	 337857.83 (2.87 pct)
Scale:	 206071.21 (0.00 pct)	 212133.82 (2.94 pct)
  Add:	 235271.48 (0.00 pct)	 243811.97 (3.63 pct)
Triad:	 253175.80 (0.00 pct)	 252333.43 (-0.33 pct)

-> 100 Runs:

Test:		tip			sis_core
 Copy:	 328209.61 (0.00 pct)	 339817.27 (3.53 pct)
Scale:	 216310.13 (0.00 pct)	 218635.16 (1.07 pct)
  Add:	 244417.83 (0.00 pct)	 245641.47 (0.50 pct)
Triad:	 237508.83 (0.00 pct)	 255387.28 (7.52 pct)

o NPS2

-> 10 Runs:

Test:		tip			sis_core
 Copy:	 336503.88 (0.00 pct)	 339684.21 (0.94 pct)
Scale:	 218035.23 (0.00 pct)	 217601.11 (-0.19 pct)
  Add:	 257677.42 (0.00 pct)	 258608.34 (0.36 pct)
Triad:	 268872.37 (0.00 pct)	 272548.09 (1.36 pct)

-> 100 Runs:

Test:		tip			sis_core
 Copy:	 332304.34 (0.00 pct)	 341565.75 (2.78 pct)
Scale:	 223421.60 (0.00 pct)	 224267.40 (0.37 pct)
  Add:	 252363.56 (0.00 pct)	 254926.98 (1.01 pct)
Triad:	 266687.56 (0.00 pct)	 270782.81 (1.53 pct)

o NPS4

-> 10 Runs:

Test:		tip			sis_core
 Copy:	 353515.62 (0.00 pct)	 342060.85 (-3.24 pct)
Scale:	 228854.37 (0.00 pct)	 218262.41 (-4.62 pct)
  Add:	 254942.12 (0.00 pct)	 241975.90 (-5.08 pct)
Triad:	 270521.87 (0.00 pct)	 257686.71 (-4.74 pct)

-> 100 Runs:

Test:		tip			sis_core
 Copy:	 374520.81 (0.00 pct)	 369353.13 (-1.37 pct)
Scale:	 246280.23 (0.00 pct)	 253881.69 (3.08 pct)
  Add:	 262772.72 (0.00 pct)	 266484.58 (1.41 pct)
Triad:	 283740.92 (0.00 pct)	 279981.18 (-1.32 pct)

On 10/19/2022 5:58 PM, Abel Wu wrote:
> This patchset tries to improve SIS scan efficiency by recording idle
> cpus in a cpumask for each LLC which will be used as a target cpuset
> in the domain scan. The cpus are recorded at CORE granule to avoid
> tasks being stack on same core.
> 
> v5 -> v6:
>  - Rename SIS_FILTER to SIS_CORE as it can only be activated when
>    SMT is enabled and better describes the behavior of CORE granule
>    update & load delivery.
>  - Removed the part of limited scan for idle cores since it might be
>    better to open another thread to discuss the strategies such as
>    limited or scaled depth. But keep the part of full scan for idle
>    cores when LLC is overloaded because SIS_CORE can greatly reduce
>    the overhead of full scan in such case.
>  - Removed the state of sd_is_busy which indicates an LLC is fully
>    busy and we can safely skip the SIS domain scan. I would prefer
>    leave this to SIS_UTIL.
>  - The filter generation mechanism is replaced by in-place updates
>    during domain scan to better deal with partial scan failures.
>  - Collect Reviewed-bys from Tim Chen
> 
> v4 -> v5:
>  - Add limited scan for idle cores when overloaded, suggested by Mel
>  - Split out several patches since they are irrelevant to this scope
>  - Add quick check on ttwu_pending before core update
>  - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
>  - Move the main filter logic to the idle path, because the newidle
>    balance can bail out early if rq->avg_idle is small enough and
>    lose chances to update the filter.
> 
> v3 -> v4:
>  - Update filter in load_balance rather than in the tick
>  - Now the filter contains unoccupied cpus rather than overloaded ones
>  - Added mechanisms to deal with the false positive cases
> 
> v2 -> v3:
>  - Removed sched-idle balance feature and focus on SIS
>  - Take non-CFS tasks into consideration
>  - Several fixes/improvement suggested by Josh Don
> 
> v1 -> v2:
>  - Several optimizations on sched-idle balancing
>  - Ignore asym topos in can_migrate_task
>  - Add more benchmarks including SIS efficiency
>  - Re-organize patch as suggested by Mel Gorman
> 
> Abel Wu (4):
>   sched/fair: Skip core update if task pending
>   sched/fair: Ignore SIS_UTIL when has_idle_core
>   sched/fair: Introduce SIS_CORE
>   sched/fair: Deal with SIS scan failures
> 
>  include/linux/sched/topology.h |  15 ++++
>  kernel/sched/fair.c            | 122 +++++++++++++++++++++++++++++----
>  kernel/sched/features.h        |   7 ++
>  kernel/sched/sched.h           |   3 +
>  kernel/sched/topology.c        |   8 ++-
>  5 files changed, 141 insertions(+), 14 deletions(-)
> 

I ran pgbench from mmtest but realised there is too much run to run
variation on the system. Planning on running MongoDB benchmark which
is more stable on the system and couple more workloads but the
initial results look good. I'll get back with results later this week
or by early next week. Meanwhile, if you need data for any specific
workload on the test system, please do let me know. 

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
  2022-11-14  5:45 ` K Prateek Nayak
@ 2022-11-15  8:31   ` Abel Wu
  2022-11-15 11:28     ` K Prateek Nayak
  0 siblings, 1 reply; 18+ messages in thread
From: Abel Wu @ 2022-11-15  8:31 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra, Ingo Molnar, Mel Gorman,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider
  Cc: Josh Don, Chen Yu, Tim Chen, Gautham R . Shenoy, Aubrey Li,
	Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang, Barry Song,
	linux-kernel

Hi Prateek, thanks very much for your detailed testing!

On 11/14/22 1:45 PM, K Prateek Nayak wrote:
> Hello Abel,
> 
> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
> (2 x 64C/128T)
> 
> tl;dr
> 
> o I do not notice any regressions with the standard benchmarks.
> o schbench sees a nice improvement to the tail latency when the number
>    of worker are equal to the number of cores in the system in NPS1 and
>    NPS2 mode. (Marked with "^")
> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
>    (Marked with "^")
> 
> I'm still in the process of running larger workloads. If there is any
> specific workload you would like me to run on the test system, please
> do let me know. Below is the detailed report:

Not particularly in my mind, and I think testing larger workloads is
great. Thanks!

> 
> Following are the results from running standard benchmarks on a
> dual socket Zen3 (2 x 64C/128T) machine configured in different
> NPS modes.
> 
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>      Total 2 NUMA nodes in the dual socket machine.
> 
>      Node 0: 0-63,   128-191
>      Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>      Total 4 NUMA nodes exist over 2 socket.
>     
>      Node 0: 0-31,   128-159
>      Node 1: 32-63,  160-191
>      Node 2: 64-95,  192-223
>      Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>      Total 8 NUMA nodes exist over 2 socket.
>     
>      Node 0: 0-15,    128-143
>      Node 1: 16-31,   144-159
>      Node 2: 32-47,   160-175
>      Node 3: 48-63,   176-191
>      Node 4: 64-79,   192-207
>      Node 5: 80-95,   208-223
>      Node 6: 96-111,  223-231
>      Node 7: 112-127, 232-255
> 
> Benchmark Results:
> 
> Kernel versions:
> - tip:          5.19.0 tip sched/core
> - sis_core: 	5.19.0 tip sched/core + this series
> 
> When we started testing, the tip was at:
> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
> 
> ~~~~~~~~~~~~~
> ~ hackbench ~
> ~~~~~~~~~~~~~
> 
> o NPS1
> 
> Test:			tip			sis_core
>   1-groups:	   4.06 (0.00 pct)	   4.26 (-4.92 pct)	*
>   1-groups:	   4.14 (0.00 pct)	   4.09 (1.20 pct)	[Verification Run]
>   2-groups:	   4.76 (0.00 pct)	   4.71 (1.05 pct)
>   4-groups:	   5.22 (0.00 pct)	   5.11 (2.10 pct)
>   8-groups:	   5.35 (0.00 pct)	   5.31 (0.74 pct)
> 16-groups:	   7.21 (0.00 pct)	   6.80 (5.68 pct)
> 
> o NPS2
> 
> Test:			tip			sis_core
>   1-groups:	   4.09 (0.00 pct)	   4.08 (0.24 pct)
>   2-groups:	   4.70 (0.00 pct)	   4.69 (0.21 pct)
>   4-groups:	   5.05 (0.00 pct)	   4.92 (2.57 pct)
>   8-groups:	   5.35 (0.00 pct)	   5.26 (1.68 pct)
> 16-groups:	   6.37 (0.00 pct)	   6.34 (0.47 pct)
> 
> o NPS4
> 
> Test:			tip			sis_core
>   1-groups:	   4.07 (0.00 pct)	   3.99 (1.96 pct)
>   2-groups:	   4.65 (0.00 pct)	   4.59 (1.29 pct)
>   4-groups:	   5.13 (0.00 pct)	   5.00 (2.53 pct)
>   8-groups:	   5.47 (0.00 pct)	   5.43 (0.73 pct)
> 16-groups:	   6.82 (0.00 pct)	   6.56 (3.81 pct)

Although each cpu will get 2.5 tasks when 16-groups, which can
be considered overloaded, I tested in AMD EPYC 7Y83 machine and
the total cpu usage was ~82% (with some older kernel version),
so there is still lots of idle time.

I guess cutting off at 16-groups is because it's enough loaded
compared to the real workloads, so testing more groups might just
be a waste of time?

Thanks & Best Regards,
	Abel

> 
> ~~~~~~~~~~~~
> ~ schbench ~
> ~~~~~~~~~~~~
> 
> o NPS1
> 
> #workers:	tip			sis_core
>    1:	  33.00 (0.00 pct)	  33.00 (0.00 pct)
>    2:	  35.00 (0.00 pct)	  35.00 (0.00 pct)
>    4:	  39.00 (0.00 pct)	  38.00 (2.56 pct)
>    8:	  49.00 (0.00 pct)	  48.00 (2.04 pct)
>   16:	  63.00 (0.00 pct)	  66.00 (-4.76 pct)
>   32:	 109.00 (0.00 pct)	 107.00 (1.83 pct)
>   64:	 208.00 (0.00 pct)	 216.00 (-3.84 pct)
> 128:	 559.00 (0.00 pct)	 469.00 (16.10 pct)     ^
> 256:	 45888.00 (0.00 pct)	 47552.00 (-3.62 pct)
> 512:	 80000.00 (0.00 pct)	 79744.00 (0.32 pct)
> 
> o NPS2
> 
> #workers:	=tip			sis_core
>    1:	  30.00 (0.00 pct)	  32.00 (-6.66 pct)
>    2:	  37.00 (0.00 pct)	  34.00 (8.10 pct)
>    4:	  39.00 (0.00 pct)	  36.00 (7.69 pct)
>    8:	  51.00 (0.00 pct)	  49.00 (3.92 pct)
>   16:	  67.00 (0.00 pct)	  66.00 (1.49 pct)
>   32:	 117.00 (0.00 pct)	 109.00 (6.83 pct)
>   64:	 216.00 (0.00 pct)	 213.00 (1.38 pct)
> 128:	 529.00 (0.00 pct)	 465.00 (12.09 pct)     ^
> 256:	 47040.00 (0.00 pct)	 46528.00 (1.08 pct)
> 512:	 84864.00 (0.00 pct)	 83584.00 (1.50 pct)
> 
> o NPS4
> 
> #workers:	tip			sis_core
>    1:	  23.00 (0.00 pct)	  28.00 (-21.73 pct)
>    2:	  28.00 (0.00 pct)	  36.00 (-28.57 pct)
>    4:	  41.00 (0.00 pct)	  43.00 (-4.87 pct)
>    8:	  60.00 (0.00 pct)	  48.00 (20.00 pct)
>   16:	  71.00 (0.00 pct)	  69.00 (2.81 pct)
>   32:	 117.00 (0.00 pct)	 115.00 (1.70 pct)
>   64:	 227.00 (0.00 pct)	 228.00 (-0.44 pct)
> 128:	 545.00 (0.00 pct)	 545.00 (0.00 pct)
> 256:	 45632.00 (0.00 pct)	 47680.00 (-4.48 pct)
> 512:	 81024.00 (0.00 pct)	 76416.00 (5.68 pct)
> 
> Note: For lower worker count, schbench can show run to
> run variation depending on external factors. Regression
> for lower worker count can be ignored. The results are
> included to spot any large blow up in the tail latency
> for larger worker count.
> 
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
> 
> o NPS1
> 
> Clients:	tip			sis_core
>      1	 578.37 (0.00 pct)	 582.09 (0.64 pct)
>      2	 1062.09 (0.00 pct)	 1063.95 (0.17 pct)
>      4	 1800.62 (0.00 pct)	 1879.18 (4.36 pct)
>      8	 3211.02 (0.00 pct)	 3220.44 (0.29 pct)
>     16	 4848.92 (0.00 pct)	 4890.08 (0.84 pct)
>     32	 9091.36 (0.00 pct)	 9721.13 (6.92 pct)     ^
>     64	 15454.01 (0.00 pct)	 15124.42 (-2.13 pct)
>    128	 3511.33 (0.00 pct)	 14314.79 (307.67 pct)
>    128    19910.99 (0.00pct)      19935.61 (0.12 pct)   [Verification Run]
>    256	 50019.32 (0.00 pct)	 50708.24 (1.37 pct)
>    512	 44317.68 (0.00 pct)	 44787.48 (1.06 pct)
>   1024	 41200.85 (0.00 pct)	 42079.29 (2.13 pct)
> 
> o NPS2
> 
> Clients:	tip			sis_core
>      1	 576.05 (0.00 pct)	 579.18 (0.54 pct)
>      2	 1037.68 (0.00 pct)	 1070.49 (3.16 pct)
>      4	 1818.13 (0.00 pct)	 1860.22 (2.31 pct)
>      8	 3004.16 (0.00 pct)	 3087.09 (2.76 pct)
>     16	 4520.11 (0.00 pct)	 4789.53 (5.96 pct)
>     32	 8624.23 (0.00 pct)	 9439.50 (9.45 pct)     ^
>     64	 14886.75 (0.00 pct)	 15004.96 (0.79 pct)
>    128	 20602.00 (0.00 pct)	 17730.31 (-13.93 pct) *
>    128    20602.00 (0.00 pct)     19585.20 (-4.93 pct)   [Verification Run]
>    256	 45566.83 (0.00 pct)	 47922.70 (5.17 pct)
>    512	 42717.49 (0.00 pct)	 43809.68 (2.55 pct)
>   1024	 40936.61 (0.00 pct)	 40787.71 (-0.36 pct)
> 
> o NPS4
> 
> Clients:	tip			sis_core
>      1	 576.36 (0.00 pct)	 580.83 (0.77 pct)
>      2	 1044.26 (0.00 pct)	 1066.50 (2.12 pct)
>      4	 1839.77 (0.00 pct)	 1867.56 (1.51 pct)
>      8	 3043.53 (0.00 pct)	 3115.17 (2.35 pct)
>     16	 5207.54 (0.00 pct)	 4847.53 (-6.91 pct)	*
>     16	 4722.56 (0.00 pct)	 4811.29 (1.87 pct)	[Verification Run]
>     32	 9263.86 (0.00 pct)	 9478.68 (2.31 pct)
>     64	 14959.66 (0.00 pct)	 15267.39 (2.05 pct)
>    128	 20698.65 (0.00 pct)	 20432.19 (-1.28 pct)
>    256	 46666.21 (0.00 pct)	 46664.81 (0.00 pct)
>    512	 41532.80 (0.00 pct)	 44241.12 (6.52 pct)
>   1024	 39459.49 (0.00 pct)	 41043.22 (4.01 pct)
> 
> Note: On the tested kernel, with 128 clients, tbench can
> run into a bottleneck during C2 exit. More details can be
> found at:
> https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/
> This issue has been fixed in v6.0 but was not part of the
> tip kernel when I started testing. This data point has
> been rerun with C2 disabled to get representative results.
> 
> ~~~~~~~~~~
> ~ Stream ~
> ~~~~~~~~~~
> 
> o NPS1
> 
> -> 10 Runs:
> 
> Test:		tip			sis_core
>   Copy:	 328419.14 (0.00 pct)	 337857.83 (2.87 pct)
> Scale:	 206071.21 (0.00 pct)	 212133.82 (2.94 pct)
>    Add:	 235271.48 (0.00 pct)	 243811.97 (3.63 pct)
> Triad:	 253175.80 (0.00 pct)	 252333.43 (-0.33 pct)
> 
> -> 100 Runs:
> 
> Test:		tip			sis_core
>   Copy:	 328209.61 (0.00 pct)	 339817.27 (3.53 pct)
> Scale:	 216310.13 (0.00 pct)	 218635.16 (1.07 pct)
>    Add:	 244417.83 (0.00 pct)	 245641.47 (0.50 pct)
> Triad:	 237508.83 (0.00 pct)	 255387.28 (7.52 pct)
> 
> o NPS2
> 
> -> 10 Runs:
> 
> Test:		tip			sis_core
>   Copy:	 336503.88 (0.00 pct)	 339684.21 (0.94 pct)
> Scale:	 218035.23 (0.00 pct)	 217601.11 (-0.19 pct)
>    Add:	 257677.42 (0.00 pct)	 258608.34 (0.36 pct)
> Triad:	 268872.37 (0.00 pct)	 272548.09 (1.36 pct)
> 
> -> 100 Runs:
> 
> Test:		tip			sis_core
>   Copy:	 332304.34 (0.00 pct)	 341565.75 (2.78 pct)
> Scale:	 223421.60 (0.00 pct)	 224267.40 (0.37 pct)
>    Add:	 252363.56 (0.00 pct)	 254926.98 (1.01 pct)
> Triad:	 266687.56 (0.00 pct)	 270782.81 (1.53 pct)
> 
> o NPS4
> 
> -> 10 Runs:
> 
> Test:		tip			sis_core
>   Copy:	 353515.62 (0.00 pct)	 342060.85 (-3.24 pct)
> Scale:	 228854.37 (0.00 pct)	 218262.41 (-4.62 pct)
>    Add:	 254942.12 (0.00 pct)	 241975.90 (-5.08 pct)
> Triad:	 270521.87 (0.00 pct)	 257686.71 (-4.74 pct)
> 
> -> 100 Runs:
> 
> Test:		tip			sis_core
>   Copy:	 374520.81 (0.00 pct)	 369353.13 (-1.37 pct)
> Scale:	 246280.23 (0.00 pct)	 253881.69 (3.08 pct)
>    Add:	 262772.72 (0.00 pct)	 266484.58 (1.41 pct)
> Triad:	 283740.92 (0.00 pct)	 279981.18 (-1.32 pct)
> 
> On 10/19/2022 5:58 PM, Abel Wu wrote:
>> This patchset tries to improve SIS scan efficiency by recording idle
>> cpus in a cpumask for each LLC which will be used as a target cpuset
>> in the domain scan. The cpus are recorded at CORE granule to avoid
>> tasks being stack on same core.
>>
>> v5 -> v6:
>>   - Rename SIS_FILTER to SIS_CORE as it can only be activated when
>>     SMT is enabled and better describes the behavior of CORE granule
>>     update & load delivery.
>>   - Removed the part of limited scan for idle cores since it might be
>>     better to open another thread to discuss the strategies such as
>>     limited or scaled depth. But keep the part of full scan for idle
>>     cores when LLC is overloaded because SIS_CORE can greatly reduce
>>     the overhead of full scan in such case.
>>   - Removed the state of sd_is_busy which indicates an LLC is fully
>>     busy and we can safely skip the SIS domain scan. I would prefer
>>     leave this to SIS_UTIL.
>>   - The filter generation mechanism is replaced by in-place updates
>>     during domain scan to better deal with partial scan failures.
>>   - Collect Reviewed-bys from Tim Chen
>>
>> v4 -> v5:
>>   - Add limited scan for idle cores when overloaded, suggested by Mel
>>   - Split out several patches since they are irrelevant to this scope
>>   - Add quick check on ttwu_pending before core update
>>   - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
>>   - Move the main filter logic to the idle path, because the newidle
>>     balance can bail out early if rq->avg_idle is small enough and
>>     lose chances to update the filter.
>>
>> v3 -> v4:
>>   - Update filter in load_balance rather than in the tick
>>   - Now the filter contains unoccupied cpus rather than overloaded ones
>>   - Added mechanisms to deal with the false positive cases
>>
>> v2 -> v3:
>>   - Removed sched-idle balance feature and focus on SIS
>>   - Take non-CFS tasks into consideration
>>   - Several fixes/improvement suggested by Josh Don
>>
>> v1 -> v2:
>>   - Several optimizations on sched-idle balancing
>>   - Ignore asym topos in can_migrate_task
>>   - Add more benchmarks including SIS efficiency
>>   - Re-organize patch as suggested by Mel Gorman
>>
>> Abel Wu (4):
>>    sched/fair: Skip core update if task pending
>>    sched/fair: Ignore SIS_UTIL when has_idle_core
>>    sched/fair: Introduce SIS_CORE
>>    sched/fair: Deal with SIS scan failures
>>
>>   include/linux/sched/topology.h |  15 ++++
>>   kernel/sched/fair.c            | 122 +++++++++++++++++++++++++++++----
>>   kernel/sched/features.h        |   7 ++
>>   kernel/sched/sched.h           |   3 +
>>   kernel/sched/topology.c        |   8 ++-
>>   5 files changed, 141 insertions(+), 14 deletions(-)
>>
> 
> I ran pgbench from mmtest but realised there is too much run to run
> variation on the system. Planning on running MongoDB benchmark which
> is more stable on the system and couple more workloads but the
> initial results look good. I'll get back with results later this week
> or by early next week. Meanwhile, if you need data for any specific
> workload on the test system, please do let me know.
> 
> --
> Thanks and Regards,
> Prateek

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
  2022-11-15  8:31   ` Abel Wu
@ 2022-11-15 11:28     ` K Prateek Nayak
  2022-11-22 11:28       ` K Prateek Nayak
  0 siblings, 1 reply; 18+ messages in thread
From: K Prateek Nayak @ 2022-11-15 11:28 UTC (permalink / raw)
  To: Abel Wu, Peter Zijlstra, Ingo Molnar, Mel Gorman,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider
  Cc: Josh Don, Chen Yu, Tim Chen, Gautham R . Shenoy, Aubrey Li,
	Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang, Barry Song,
	linux-kernel

Hello Abel,

Thank you for taking a look at the report.

On 11/15/2022 2:01 PM, Abel Wu wrote:
> Hi Prateek, thanks very much for your detailed testing!
> 
> On 11/14/22 1:45 PM, K Prateek Nayak wrote:
>> Hello Abel,
>>
>> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
>> (2 x 64C/128T)
>>
>> tl;dr
>>
>> o I do not notice any regressions with the standard benchmarks.
>> o schbench sees a nice improvement to the tail latency when the number
>>    of worker are equal to the number of cores in the system in NPS1 and
>>    NPS2 mode. (Marked with "^")
>> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
>>    (Marked with "^")
>>
>> I'm still in the process of running larger workloads. If there is any
>> specific workload you would like me to run on the test system, please
>> do let me know. Below is the detailed report:
> 
> Not particularly in my mind, and I think testing larger workloads is
> great. Thanks!
>
>>
>> Following are the results from running standard benchmarks on a
>> dual socket Zen3 (2 x 64C/128T) machine configured in different
>> NPS modes.
>>
>> NPS Modes are used to logically divide single socket into
>> multiple NUMA region.
>> Following is the NUMA configuration for each NPS mode on the system:
>>
>> NPS1: Each socket is a NUMA node.
>>      Total 2 NUMA nodes in the dual socket machine.
>>
>>      Node 0: 0-63,   128-191
>>      Node 1: 64-127, 192-255
>>
>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>>      Total 4 NUMA nodes exist over 2 socket.
>>          Node 0: 0-31,   128-159
>>      Node 1: 32-63,  160-191
>>      Node 2: 64-95,  192-223
>>      Node 3: 96-127, 223-255
>>
>> NPS4: Each socket is logically divided into 4 NUMA regions.
>>      Total 8 NUMA nodes exist over 2 socket.
>>          Node 0: 0-15,    128-143
>>      Node 1: 16-31,   144-159
>>      Node 2: 32-47,   160-175
>>      Node 3: 48-63,   176-191
>>      Node 4: 64-79,   192-207
>>      Node 5: 80-95,   208-223
>>      Node 6: 96-111,  223-231
>>      Node 7: 112-127, 232-255
>>
>> Benchmark Results:
>>
>> Kernel versions:
>> - tip:          5.19.0 tip sched/core
>> - sis_core:     5.19.0 tip sched/core + this series
>>
>> When we started testing, the tip was at:
>> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
>>
>> ~~~~~~~~~~~~~
>> ~ hackbench ~
>> ~~~~~~~~~~~~~
>>
>> o NPS1
>>
>> Test:            tip            sis_core
>>   1-groups:       4.06 (0.00 pct)       4.26 (-4.92 pct)    *
>>   1-groups:       4.14 (0.00 pct)       4.09 (1.20 pct)    [Verification Run]
>>   2-groups:       4.76 (0.00 pct)       4.71 (1.05 pct)
>>   4-groups:       5.22 (0.00 pct)       5.11 (2.10 pct)
>>   8-groups:       5.35 (0.00 pct)       5.31 (0.74 pct)
>> 16-groups:       7.21 (0.00 pct)       6.80 (5.68 pct)
>>
>> o NPS2
>>
>> Test:            tip            sis_core
>>   1-groups:       4.09 (0.00 pct)       4.08 (0.24 pct)
>>   2-groups:       4.70 (0.00 pct)       4.69 (0.21 pct)
>>   4-groups:       5.05 (0.00 pct)       4.92 (2.57 pct)
>>   8-groups:       5.35 (0.00 pct)       5.26 (1.68 pct)
>> 16-groups:       6.37 (0.00 pct)       6.34 (0.47 pct)
>>
>> o NPS4
>>
>> Test:            tip            sis_core
>>   1-groups:       4.07 (0.00 pct)       3.99 (1.96 pct)
>>   2-groups:       4.65 (0.00 pct)       4.59 (1.29 pct)
>>   4-groups:       5.13 (0.00 pct)       5.00 (2.53 pct)
>>   8-groups:       5.47 (0.00 pct)       5.43 (0.73 pct)
>> 16-groups:       6.82 (0.00 pct)       6.56 (3.81 pct)
> 
> Although each cpu will get 2.5 tasks when 16-groups, which can
> be considered overloaded, I tested in AMD EPYC 7Y83 machine and
> the total cpu usage was ~82% (with some older kernel version),
> so there is still lots of idle time.
> 
> I guess cutting off at 16-groups is because it's enough loaded
> compared to the real workloads, so testing more groups might just
> be a waste of time?

The machine has 16 LLCs so I capped the results at 16-groups.
Previously I had seen some run-to-run variance with larger group counts
so I limited the reports to 16-groups. I'll run hackbench with more
number of groups (32, 64, 128, 256) and get back to you with the
results along with results for a couple of long running workloads. 

> 
> Thanks & Best Regards,
>     Abel
> 
> [..snip..]
>


--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
  2022-11-15 11:28     ` K Prateek Nayak
@ 2022-11-22 11:28       ` K Prateek Nayak
  2022-11-24  3:50         ` Abel Wu
  0 siblings, 1 reply; 18+ messages in thread
From: K Prateek Nayak @ 2022-11-22 11:28 UTC (permalink / raw)
  To: Abel Wu, Peter Zijlstra, Ingo Molnar, Mel Gorman,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider
  Cc: Josh Don, Chen Yu, Tim Chen, Gautham R . Shenoy, Aubrey Li,
	Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang, Barry Song,
	linux-kernel

Hello Abel,

Following are the results for hackbench with larger number of
groups, ycsb-mongodb, Spec-JBB, and unixbench. Apart for
a regression in unixbench spawn in NPS2 and NPS4 mode and
unixbench syscall in NPs2 mode, everything looks good.

Detailed results are below:

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1:

tip:            131696.33 (var: 2.03%)
sis_core:       129519.00 (var: 1.46%)  (-1.65%)

o NPS2:

tip:            129895.33 (var: 2.34%)
sis_core:       130774.33 (var: 2.57%)  (+0.67%)

o NPS4:

tip:            131165.00 (var: 1.06%)
sis_core:       133547.33 (var: 3.90%)  (+1.81%)

~~~~~~~~~~~~~~~~~
~ Spec-JBB NPS1 ~
~~~~~~~~~~~~~~~~~

Max-jOPS and Critical-jOPS are same as the tip kernel.

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

-> unixbench-dhry2reg

o NPS1

kernel:                                        tip                          sis_core
Min       unixbench-dhry2reg-1            48876615.50 (   0.00%)         48891544.00 (   0.03%)
Min       unixbench-dhry2reg-512        6260344658.90 (   0.00%)       6282967594.10 (   0.36%)
Hmean     unixbench-dhry2reg-1            49299721.81 (   0.00%)         49233828.70 (  -0.13%)
Hmean     unixbench-dhry2reg-512        6267459427.19 (   0.00%)       6288772961.79 *   0.34%*
CoeffVar  unixbench-dhry2reg-1                   0.90 (   0.00%)                0.68 (  24.66%)
CoeffVar  unixbench-dhry2reg-512                 0.10 (   0.00%)                0.10 (   7.54%)

o NPS2

kernel:                                        tip                          sis_core
Min       unixbench-dhry2reg-1            48828251.70 (   0.00%)         48856709.20 (   0.06%)
Min       unixbench-dhry2reg-512        6244987739.10 (   0.00%)       6271229549.10 (   0.42%)
Hmean     unixbench-dhry2reg-1            48869882.65 (   0.00%)         49302481.81 (   0.89%)
Hmean     unixbench-dhry2reg-512        6261073948.84 (   0.00%)       6272564898.35 (   0.18%)
CoeffVar  unixbench-dhry2reg-1                   0.08 (   0.00%)                0.87 (-945.28%)
CoeffVar  unixbench-dhry2reg-512                 0.23 (   0.00%)                0.03 (  85.94%)

o NPS4

kernel:                                        tip                          sis_core
Min       unixbench-dhry2reg-1            48523981.30 (   0.00%)         49083957.50 (   1.15%)
Min       unixbench-dhry2reg-512        6253738837.10 (   0.00%)       6271747119.10 (   0.29%)
Hmean     unixbench-dhry2reg-1            48781044.09 (   0.00%)         49232218.87 *   0.92%*
Hmean     unixbench-dhry2reg-512        6264428474.90 (   0.00%)       6280484789.64 (   0.26%)
CoeffVar  unixbench-dhry2reg-1                   0.46 (   0.00%)                0.26 (  42.63%)
CoeffVar  unixbench-dhry2reg-512                 0.17 (   0.00%)                0.21 ( -26.72%)

-> unixbench-syscall

o NPS1

kernel:                             tip                  sis_core
Min       unixbench-syscall-1    2975654.80 (   0.00%)  2978489.40 (  -0.10%)
Min       unixbench-syscall-512  7840226.50 (   0.00%)  7822133.40 (   0.23%)
Amean     unixbench-syscall-1    2976326.47 (   0.00%)  2980985.27 *  -0.16%*
Amean     unixbench-syscall-512  7850493.90 (   0.00%)  7844527.50 (   0.08%)
CoeffVar  unixbench-syscall-1          0.03 (   0.00%)        0.07 (-154.43%)
CoeffVar  unixbench-syscall-512        0.13 (   0.00%)        0.34 (-158.96%)

o NPS2

kernel:                             tip                  sis_core
Min       unixbench-syscall-1    2969863.60 (   0.00%)  2977936.50 (  -0.27%)
Min       unixbench-syscall-512  8053157.60 (   0.00%)  8072239.00 (  -0.24%)
Amean     unixbench-syscall-1    2970462.30 (   0.00%)  2981732.50 *  -0.38%*
Amean     unixbench-syscall-512  8061454.50 (   0.00%)  8079287.73 *  -0.22%*
CoeffVar  unixbench-syscall-1          0.02 (   0.00%)        0.11 (-527.26%)
CoeffVar  unixbench-syscall-512        0.12 (   0.00%)        0.08 (  37.30%)

o NPS4

kernel:                             tip                  sis_core
Min       unixbench-syscall-1    2971799.80 (   0.00%)  2979335.60 (  -0.25%)
Min       unixbench-syscall-512  7824196.90 (   0.00%)  8155610.20 (  -4.24%)
Amean     unixbench-syscall-1    2973045.43 (   0.00%)  2982036.13 *  -0.30%*
Amean     unixbench-syscall-512  7826302.17 (   0.00%)  8173026.57 *  -4.43%*   <-- Regression in syscall for larger worker count
CoeffVar  unixbench-syscall-1          0.04 (   0.00%)        0.09 (-139.63%)
CoeffVar  unixbench-syscall-512        0.03 (   0.00%)        0.20 (-701.13%)


-> unixbench-pipe

o NPS1

kernel:                               tip                  sis_core
Min       unixbench-pipe-1        2894765.30 (   0.00%)   2891505.30 (  -0.11%)
Min       unixbench-pipe-512    329818573.50 (   0.00%) 325610257.80 (  -1.28%)
Hmean     unixbench-pipe-1        2898803.38 (   0.00%)   2896940.25 (  -0.06%)
Hmean     unixbench-pipe-512    330226401.69 (   0.00%) 326311984.29 *  -1.19%*
CoeffVar  unixbench-pipe-1              0.14 (   0.00%)         0.17 ( -21.99%)
CoeffVar  unixbench-pipe-512            0.11 (   0.00%)         0.20 ( -88.38%)

o NPS2

kernel:                               tip                   sis_core
Min       unixbench-pipe-1        2895327.90 (   0.00%)    2894798.20 (  -0.02%)
Min       unixbench-pipe-512    328350065.60 (   0.00%)  325681163.10 (  -0.81%)
Hmean     unixbench-pipe-1        2899129.86 (   0.00%)    2897067.80 (  -0.07%)
Hmean     unixbench-pipe-512    329436096.80 (   0.00%)  326023030.94 *  -1.04%*
CoeffVar  unixbench-pipe-1              0.12 (   0.00%)          0.09 (  21.96%)
CoeffVar  unixbench-pipe-512            0.30 (   0.00%)          0.12 (  60.80%)

o NPS4

kernel:                               tip                   sis_core
Min       unixbench-pipe-1        2901525.60 (   0.00%)    2885730.80 (  -0.54%)
Min       unixbench-pipe-512    330265873.90 (   0.00%)  326730770.60 (  -1.07%)
Hmean     unixbench-pipe-1        2906184.70 (   0.00%)    2891616.18 *  -0.50%*
Hmean     unixbench-pipe-512    330854683.27 (   0.00%)  327113296.63 *  -1.13%*
CoeffVar  unixbench-pipe-1              0.14 (   0.00%)          0.19 ( -33.74%)
CoeffVar  unixbench-pipe-512            0.16 (   0.00%)          0.11 (  31.75%)

-> unixbench-spawn

o NPS1

kernel:                             tip                  sis_core
Min       unixbench-spawn-1       6536.50 (   0.00%)     6000.30 (  -8.20%)
Min       unixbench-spawn-512    72571.40 (   0.00%)    70829.60 (  -2.40%)
Hmean     unixbench-spawn-1       6811.16 (   0.00%)     7016.11 (   3.01%)
Hmean     unixbench-spawn-512    72801.77 (   0.00%)    71012.03 *  -2.46%*
CoeffVar  unixbench-spawn-1          3.69 (   0.00%)       13.52 (-266.69%)
CoeffVar  unixbench-spawn-512        0.27 (   0.00%)        0.22 (  18.25%)

o NPS2

kernel:                             tip                  sis_core
Min       unixbench-spawn-1       7042.20 (   0.00%)     7078.70 (   0.52%)
Min       unixbench-spawn-512    85571.60 (   0.00%)    77362.60 (  -9.59%)
Hmean     unixbench-spawn-1       7199.01 (   0.00%)     7276.55 (   1.08%)
Hmean     unixbench-spawn-512    85717.77 (   0.00%)    77923.73 *  -9.09%*     <-- Regression in spawn test for larger worker count
CoeffVar  unixbench-spawn-1          3.50 (   0.00%)        3.30 (   5.70%)
CoeffVar  unixbench-spawn-512        0.20 (   0.00%)        0.82 (-304.88%)

o NPS4

kernel:                             tip                  sis_core
Min       unixbench-spawn-1       7521.90 (   0.00%)     8102.80 (   7.72%)
Min       unixbench-spawn-512    84245.70 (   0.00%)    73074.50 ( -13.26%)
Hmean     unixbench-spawn-1       7659.12 (   0.00%)     8645.19 *  12.87%*
Hmean     unixbench-spawn-512    84908.77 (   0.00%)    73409.49 * -13.54%*     <-- Regression in spawn test for larger worker count
CoeffVar  unixbench-spawn-1          1.92 (   0.00%)        5.78 (-200.56%)
CoeffVar  unixbench-spawn-512        0.76 (   0.00%)        0.41 (  46.58%)

-> unixbench-execl

o NPS1

kernel:                             tip                  sis_core
Min       unixbench-execl-1       5421.50 (   0.00%)     5471.50 (   0.92%)
Min       unixbench-execl-512    11213.50 (   0.00%)    11677.20 (   4.14%)
Hmean     unixbench-execl-1       5443.75 (   0.00%)     5475.36 *   0.58%*
Hmean     unixbench-execl-512    11311.94 (   0.00%)    11804.52 *   4.35%*
CoeffVar  unixbench-execl-1          0.38 (   0.00%)        0.12 (  69.22%)
CoeffVar  unixbench-execl-512        1.03 (   0.00%)        1.73 ( -68.91%)

o NPS2

kernel:                             tip                  sis_core
Min       unixbench-execl-1       5089.10 (   0.00%)     5405.40 (   6.22%)
Min       unixbench-execl-512    11772.70 (   0.00%)    11917.20 (   1.23%)
Hmean     unixbench-execl-1       5321.65 (   0.00%)     5421.41 (   1.87%)
Hmean     unixbench-execl-512    12201.73 (   0.00%)    12327.95 (   1.03%)
CoeffVar  unixbench-execl-1          3.87 (   0.00%)        0.28 (  92.88%)
CoeffVar  unixbench-execl-512        6.23 (   0.00%)        5.78 (   7.21%)

o NPS4

kernel:                             tip                  sis_core
Min       unixbench-execl-1       5099.40 (   0.00%)     5479.60 (   7.46%)
Min       unixbench-execl-512    11692.80 (   0.00%)    12205.50 (   4.38%)
Hmean     unixbench-execl-1       5136.86 (   0.00%)     5487.93 *   6.83%*
Hmean     unixbench-execl-512    12053.71 (   0.00%)    12712.96 (   5.47%)
CoeffVar  unixbench-execl-1          1.05 (   0.00%)        0.14 (  86.57%)
CoeffVar  unixbench-execl-512        3.85 (   0.00%)        5.86 ( -52.14%)

For unixbench regressions, I do not see anything obvious jump up
in perf traces captureed with IBS. top shows over 99% utilization
which would ideally mean there are not many updates to the mask.
I'll take some more look at the spawn test case and get back to you.

On 11/15/2022 4:58 PM, K Prateek Nayak wrote:
> Hello Abel,
> 
> Thank you for taking a look at the report.
> 
> On 11/15/2022 2:01 PM, Abel Wu wrote:
>> Hi Prateek, thanks very much for your detailed testing!
>>
>> On 11/14/22 1:45 PM, K Prateek Nayak wrote:
>>> Hello Abel,
>>>
>>> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
>>> (2 x 64C/128T)
>>>
>>> tl;dr
>>>
>>> o I do not notice any regressions with the standard benchmarks.
>>> o schbench sees a nice improvement to the tail latency when the number
>>>    of worker are equal to the number of cores in the system in NPS1 and
>>>    NPS2 mode. (Marked with "^")
>>> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
>>>    (Marked with "^")
>>>
>>> I'm still in the process of running larger workloads. If there is any
>>> specific workload you would like me to run on the test system, please
>>> do let me know. Below is the detailed report:
>>
>> Not particularly in my mind, and I think testing larger workloads is
>> great. Thanks!
>>
>>>
>>> Following are the results from running standard benchmarks on a
>>> dual socket Zen3 (2 x 64C/128T) machine configured in different
>>> NPS modes.
>>>
>>> NPS Modes are used to logically divide single socket into
>>> multiple NUMA region.
>>> Following is the NUMA configuration for each NPS mode on the system:
>>>
>>> NPS1: Each socket is a NUMA node.
>>>      Total 2 NUMA nodes in the dual socket machine.
>>>
>>>      Node 0: 0-63,   128-191
>>>      Node 1: 64-127, 192-255
>>>
>>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>>>      Total 4 NUMA nodes exist over 2 socket.
>>>          Node 0: 0-31,   128-159
>>>      Node 1: 32-63,  160-191
>>>      Node 2: 64-95,  192-223
>>>      Node 3: 96-127, 223-255
>>>
>>> NPS4: Each socket is logically divided into 4 NUMA regions.
>>>      Total 8 NUMA nodes exist over 2 socket.
>>>          Node 0: 0-15,    128-143
>>>      Node 1: 16-31,   144-159
>>>      Node 2: 32-47,   160-175
>>>      Node 3: 48-63,   176-191
>>>      Node 4: 64-79,   192-207
>>>      Node 5: 80-95,   208-223
>>>      Node 6: 96-111,  223-231
>>>      Node 7: 112-127, 232-255
>>>
>>> Benchmark Results:
>>>
>>> Kernel versions:
>>> - tip:          5.19.0 tip sched/core
>>> - sis_core:     5.19.0 tip sched/core + this series
>>>
>>> When we started testing, the tip was at:
>>> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
>>>
>>> ~~~~~~~~~~~~~
>>> ~ hackbench ~
>>> ~~~~~~~~~~~~~
>>>
>>> o NPS1
>>>
>>> Test:            tip            sis_core
>>>   1-groups:       4.06 (0.00 pct)       4.26 (-4.92 pct)    *
>>>   1-groups:       4.14 (0.00 pct)       4.09 (1.20 pct)    [Verification Run]
>>>   2-groups:       4.76 (0.00 pct)       4.71 (1.05 pct)
>>>   4-groups:       5.22 (0.00 pct)       5.11 (2.10 pct)
>>>   8-groups:       5.35 (0.00 pct)       5.31 (0.74 pct)
>>> 16-groups:       7.21 (0.00 pct)       6.80 (5.68 pct)
>>>
>>> o NPS2
>>>
>>> Test:            tip            sis_core
>>>   1-groups:       4.09 (0.00 pct)       4.08 (0.24 pct)
>>>   2-groups:       4.70 (0.00 pct)       4.69 (0.21 pct)
>>>   4-groups:       5.05 (0.00 pct)       4.92 (2.57 pct)
>>>   8-groups:       5.35 (0.00 pct)       5.26 (1.68 pct)
>>> 16-groups:       6.37 (0.00 pct)       6.34 (0.47 pct)
>>>
>>> o NPS4
>>>
>>> Test:            tip            sis_core
>>>   1-groups:       4.07 (0.00 pct)       3.99 (1.96 pct)
>>>   2-groups:       4.65 (0.00 pct)       4.59 (1.29 pct)
>>>   4-groups:       5.13 (0.00 pct)       5.00 (2.53 pct)
>>>   8-groups:       5.47 (0.00 pct)       5.43 (0.73 pct)
>>> 16-groups:       6.82 (0.00 pct)       6.56 (3.81 pct)
>>
>> Although each cpu will get 2.5 tasks when 16-groups, which can
>> be considered overloaded, I tested in AMD EPYC 7Y83 machine and
>> the total cpu usage was ~82% (with some older kernel version),
>> so there is still lots of idle time.
>>
>> I guess cutting off at 16-groups is because it's enough loaded
>> compared to the real workloads, so testing more groups might just
>> be a waste of time?
> 
> The machine has 16 LLCs so I capped the results at 16-groups.
> Previously I had seen some run-to-run variance with larger group counts
> so I limited the reports to 16-groups. I'll run hackbench with more
> number of groups (32, 64, 128, 256) and get back to you with the
> results along with results for a couple of long running workloads. 

~~~~~~~~~~~~~
~ Hackbench ~
~~~~~~~~~~~~~

$ perf bench sched messaging -p -l 50000 -g <groups>

o NPS1

kernel:               tip                     sis_core
32-groups:         6.20 (0.00 pct)         5.86 (5.48 pct)
64-groups:        16.55 (0.00 pct)        15.21 (8.09 pct)
128-groups:       42.57 (0.00 pct)        34.63 (18.65 pct)
256-groups:       71.69 (0.00 pct)        67.11 (6.38 pct)
512-groups:      108.48 (0.00 pct)       110.23 (-1.61 pct)

o NPS2

kernel:                tip                     sis_core
32-groups:         6.56 (0.00 pct)         5.60 (14.63 pct)
64-groups:        15.74 (0.00 pct)        14.45 (8.19 pct)
128-groups:       39.93 (0.00 pct)        35.33 (11.52 pct)
256-groups:       74.49 (0.00 pct)        69.65 (6.49 pct)
512-groups:      112.22 (0.00 pct)       113.75 (-1.36 pct)

o NPS4:

kernel:               tip                     sis_core
32-groups:         9.48 (0.00 pct)         5.64 (40.50 pct)
64-groups:        15.38 (0.00 pct)        14.13 (8.12 pct)
128-groups:       39.93 (0.00 pct)        34.47 (13.67 pct)
256-groups:       75.31 (0.00 pct)        67.98 (9.73 pct)
512-groups:      115.37 (0.00 pct)       111.15 (3.65 pct)

Note: Hackbench with 32-groups show run to run variation
on tip but is more stable with sis_core. Hackbench for
64-groups and beyond is stable on both kernels.

> 
>>
>> Thanks & Best Regards,
>>     Abel
>>
>> [..snip..]
>>
> 
> 
> --
> Thanks and Regards,
> Prateek

Apart from the couple of regressions in Unixbench, everything looks good.
If you would like me to get any more data for any workload on the test
system, please do let me know.
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
  2022-11-22 11:28       ` K Prateek Nayak
@ 2022-11-24  3:50         ` Abel Wu
  0 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2022-11-24  3:50 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra, Ingo Molnar, Mel Gorman,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider
  Cc: Josh Don, Chen Yu, Tim Chen, Gautham R . Shenoy, Aubrey Li,
	Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang, Barry Song,
	linux-kernel

Hi Prateek, thanks again for your detailed test!

On 11/22/22 7:28 PM, K Prateek Nayak wrote:
> Hello Abel,
> 
> Following are the results for hackbench with larger number of
> groups, ycsb-mongodb, Spec-JBB, and unixbench. Apart for
> a regression in unixbench spawn in NPS2 and NPS4 mode and
> unixbench syscall in NPs2 mode, everything looks good.
> 
> ...
> 
> -> unixbench-syscall
> 
> o NPS4
> 
> kernel:                             tip                  sis_core
> Min       unixbench-syscall-1    2971799.80 (   0.00%)  2979335.60 (  -0.25%)
> Min       unixbench-syscall-512  7824196.90 (   0.00%)  8155610.20 (  -4.24%)
> Amean     unixbench-syscall-1    2973045.43 (   0.00%)  2982036.13 *  -0.30%*
> Amean     unixbench-syscall-512  7826302.17 (   0.00%)  8173026.57 *  -4.43%*   <-- Regression in syscall for larger worker count
> CoeffVar  unixbench-syscall-1          0.04 (   0.00%)        0.09 (-139.63%)
> CoeffVar  unixbench-syscall-512        0.03 (   0.00%)        0.20 (-701.13%)
> 
> 
> -> unixbench-spawn
> 
> o NPS1
> 
> kernel:                             tip                  sis_core
> Min       unixbench-spawn-1       6536.50 (   0.00%)     6000.30 (  -8.20%)
> Min       unixbench-spawn-512    72571.40 (   0.00%)    70829.60 (  -2.40%)
> Hmean     unixbench-spawn-1       6811.16 (   0.00%)     7016.11 (   3.01%)
> Hmean     unixbench-spawn-512    72801.77 (   0.00%)    71012.03 *  -2.46%*
> CoeffVar  unixbench-spawn-1          3.69 (   0.00%)       13.52 (-266.69%)
> CoeffVar  unixbench-spawn-512        0.27 (   0.00%)        0.22 (  18.25%)
> 
> o NPS2
> 
> kernel:                             tip                  sis_core
> Min       unixbench-spawn-1       7042.20 (   0.00%)     7078.70 (   0.52%)
> Min       unixbench-spawn-512    85571.60 (   0.00%)    77362.60 (  -9.59%)
> Hmean     unixbench-spawn-1       7199.01 (   0.00%)     7276.55 (   1.08%)
> Hmean     unixbench-spawn-512    85717.77 (   0.00%)    77923.73 *  -9.09%*     <-- Regression in spawn test for larger worker count
> CoeffVar  unixbench-spawn-1          3.50 (   0.00%)        3.30 (   5.70%)
> CoeffVar  unixbench-spawn-512        0.20 (   0.00%)        0.82 (-304.88%)
> 
> o NPS4
> 
> kernel:                             tip                  sis_core
> Min       unixbench-spawn-1       7521.90 (   0.00%)     8102.80 (   7.72%)
> Min       unixbench-spawn-512    84245.70 (   0.00%)    73074.50 ( -13.26%)
> Hmean     unixbench-spawn-1       7659.12 (   0.00%)     8645.19 *  12.87%*
> Hmean     unixbench-spawn-512    84908.77 (   0.00%)    73409.49 * -13.54%*     <-- Regression in spawn test for larger worker count
> CoeffVar  unixbench-spawn-1          1.92 (   0.00%)        5.78 (-200.56%)
> CoeffVar  unixbench-spawn-512        0.76 (   0.00%)        0.41 (  46.58%)
> 
> ...
> 
> For unixbench regressions, I do not see anything obvious jump up
> in perf traces captureed with IBS. top shows over 99% utilization
> which would ideally mean there are not many updates to the mask.
> I'll take some more look at the spawn test case and get back to you.

These regressions seems to be common in full parallel tests. I
guess it might be due to over updating the idle cpumask when LLC
is overloaded which is not necessary if SIS_UTIL enabled, but I
need to dig it further. Maybe the rq avg_idle or nr_idle_scan need
to be taken into consideration as well. Thanks for providing these
important information.

> 
> ~~~~~~~~~~~~~
> ~ Hackbench ~
> ~~~~~~~~~~~~~
> 
> $ perf bench sched messaging -p -l 50000 -g <groups>
> 
> o NPS1
> 
> kernel:               tip                     sis_core
> 32-groups:         6.20 (0.00 pct)         5.86 (5.48 pct)
> 64-groups:        16.55 (0.00 pct)        15.21 (8.09 pct)
> 128-groups:       42.57 (0.00 pct)        34.63 (18.65 pct)
> 256-groups:       71.69 (0.00 pct)        67.11 (6.38 pct)
> 512-groups:      108.48 (0.00 pct)       110.23 (-1.61 pct)
> 
> o NPS2
> 
> kernel:                tip                     sis_core
> 32-groups:         6.56 (0.00 pct)         5.60 (14.63 pct)
> 64-groups:        15.74 (0.00 pct)        14.45 (8.19 pct)
> 128-groups:       39.93 (0.00 pct)        35.33 (11.52 pct)
> 256-groups:       74.49 (0.00 pct)        69.65 (6.49 pct)
> 512-groups:      112.22 (0.00 pct)       113.75 (-1.36 pct)
> 
> o NPS4:
> 
> kernel:               tip                     sis_core
> 32-groups:         9.48 (0.00 pct)         5.64 (40.50 pct)
> 64-groups:        15.38 (0.00 pct)        14.13 (8.12 pct)
> 128-groups:       39.93 (0.00 pct)        34.47 (13.67 pct)
> 256-groups:       75.31 (0.00 pct)        67.98 (9.73 pct)
> 512-groups:      115.37 (0.00 pct)       111.15 (3.65 pct)
> 
> Note: Hackbench with 32-groups show run to run variation
> on tip but is more stable with sis_core. Hackbench for
> 64-groups and beyond is stable on both kernels.
> 
The result is consistent with mine except 512-groups which I
didn't test. The 512-groups test may have the same problem
aforementioned.

Thanks & Regards,
	Abel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
  2022-10-19 12:28 [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
                   ` (5 preceding siblings ...)
  2022-11-14  5:45 ` K Prateek Nayak
@ 2023-02-07  3:42 ` K Prateek Nayak
  2023-02-16 13:18   ` Abel Wu
  6 siblings, 1 reply; 18+ messages in thread
From: K Prateek Nayak @ 2023-02-07  3:42 UTC (permalink / raw)
  To: Abel Wu, Peter Zijlstra, Ingo Molnar, Mel Gorman,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider
  Cc: Josh Don, Chen Yu, Tim Chen, Gautham R . Shenoy, Aubrey Li,
	Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang, Barry Song,
	linux-kernel

Hello Abel,

I've retested the patches with on the updated tip and the results
are still promising.

tl;dr

o Hackbench sees improvements when the machine is overloaded.
o tbench shows improvements when the machine is overloaded.
o The unixbench regression seen previously seems to be unrelated
  to the patch as the spawn test scores are vastly different
  after a reboot/kexec for the same kernel.
o Other benchmarks show slight improvements or are comparable to
  the numbers on tip.

Following are the results from running standard benchmarks on a
dual socket Zen3 (2 x 64C/128T) machine configured in different
NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Following are the Kernel versions:

tip:            6.2.0-rc2 tip:sched/core at
                commit: bbd0b031509b "sched/rseq: Fix concurrency ID handling of usermodehelper kthreads"
sis_short:      tip + series

The patch applied cleanly on the tip.

Benchmark Results:

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

NPS1

Test:			tip			sis_core
 1-groups:	   4.36 (0.00 pct)	   4.17 (4.35 pct)
 2-groups:	   5.17 (0.00 pct)	   5.03 (2.70 pct)
 4-groups:	   4.17 (0.00 pct)	   4.14 (0.71 pct)
 8-groups:	   4.64 (0.00 pct)	   4.63 (0.21 pct)
16-groups:	   5.43 (0.00 pct)	   5.32 (2.02 pct)

NPS2

Test:			tip			sis_core
 1-groups:	   4.43 (0.00 pct)	   4.27 (3.61 pct)
 2-groups:	   4.61 (0.00 pct)	   4.92 (-6.72 pct)	*
 2-groups:	   4.52 (0.00 pct)	   4.55 (-0.66 pct)	[Verification Run]
 4-groups:	   4.25 (0.00 pct)	   4.10 (3.52 pct)
 8-groups:	   4.91 (0.00 pct)	   4.53 (7.73 pct)
16-groups:	   5.84 (0.00 pct)	   5.54 (5.13 pct)

NPS4

Test:			tip			sis_core
 1-groups:	   4.34 (0.00 pct)	   4.23 (2.53 pct)
 2-groups:	   4.64 (0.00 pct)	   4.84 (-4.31 pct)
 4-groups:	   4.20 (0.00 pct)	   4.17 (0.71 pct)
 8-groups:	   5.21 (0.00 pct)	   5.06 (2.87 pct)
16-groups:	   6.24 (0.00 pct)	   5.60 (10.25 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

NPS1

#workers:	tip			sis_core
  1:	  36.00 (0.00 pct)	  23.00 (36.11 pct)
  2:	  37.00 (0.00 pct)	  37.00 (0.00 pct)
  4:	  37.00 (0.00 pct)	  38.00 (-2.70 pct)
  8:	  47.00 (0.00 pct)	  52.00 (-10.63 pct)
 16:	  64.00 (0.00 pct)	  65.00 (-1.56 pct)
 32:	 109.00 (0.00 pct)	 111.00 (-1.83 pct)
 64:	 222.00 (0.00 pct)	 215.00 (3.15 pct)
128:	 515.00 (0.00 pct)	 486.00 (5.63 pct)
256:	 39744.00 (0.00 pct)	 47808.00 (-20.28 pct)	* (Machine Overloaded ~ 2 tasks per rq)
256:	 43242.00 (0.00 pct)	 42293.00 (2.19 pct)	[Verification Run]
512:	 81280.00 (0.00 pct)	 76416.00 (5.98 pct)

NPS2

#workers:	tip			sis_core
  1:	  27.00 (0.00 pct)	  27.00 (0.00 pct)
  2:	  31.00 (0.00 pct)	  30.00 (3.22 pct)
  4:	  38.00 (0.00 pct)	  37.00 (2.63 pct)
  8:	  50.00 (0.00 pct)	  46.00 (8.00 pct)
 16:	  66.00 (0.00 pct)	  68.00 (-3.03 pct)
 32:	 116.00 (0.00 pct)	 113.00 (2.58 pct)
 64:	 210.00 (0.00 pct)	 228.00 (-8.57 pct)	*
 64:	 206.00 (0.00 pct)	 219.00 (-6.31 pct)	[Verification Run]
128:	 523.00 (0.00 pct)	 559.00 (-6.88 pct)	*
128:	 474.00 (0.00 pct)	 497.00 (-4.85 pct)	[Verification Run]
256:	 44864.00 (0.00 pct)	 47040.00 (-4.85 pct)
512:	 78464.00 (0.00 pct)	 81280.00 (-3.58 pct)

NPS4

#workers:	tip			sis_core
  1:	  32.00 (0.00 pct)	  27.00 (15.62 pct)
  2:	  32.00 (0.00 pct)	  35.00 (-9.37 pct)
  4:	  34.00 (0.00 pct)	  41.00 (-20.58 pct)
  8:	  58.00 (0.00 pct)	  58.00 (0.00 pct)
 16:	  67.00 (0.00 pct)	  69.00 (-2.98 pct)
 32:	 118.00 (0.00 pct)	 112.00 (5.08 pct)
 64:	 224.00 (0.00 pct)	 209.00 (6.69 pct)
128:	 533.00 (0.00 pct)	 519.00 (2.62 pct)
256:	 43456.00 (0.00 pct)	 45248.00 (-4.12 pct)
512:	 78976.00 (0.00 pct)	 76160.00 (3.56 pct)


~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

NPS1

Clients:	tip			sis_core
    1	 539.96 (0.00 pct)	 538.19 (-0.32 pct)
    2	 1068.21 (0.00 pct)	 1063.04 (-0.48 pct)
    4	 1994.76 (0.00 pct)	 1990.47 (-0.21 pct)
    8	 3602.30 (0.00 pct)	 3496.07 (-2.94 pct)
   16	 6075.49 (0.00 pct)	 6061.74 (-0.22 pct)
   32	 11641.07 (0.00 pct)	 11904.58 (2.26 pct)
   64	 21529.16 (0.00 pct)	 22124.81 (2.76 pct)
  128	 30852.92 (0.00 pct)	 31258.56 (1.31 pct)
  256	 51901.20 (0.00 pct)	 53249.69 (2.59 pct)
  512	 46797.40 (0.00 pct)	 54477.79 (16.41 pct)
 1024	 46057.28 (0.00 pct)	 53676.58 (16.54 pct)

NPS2

Clients:	tip			sis_core
    1	 536.11 (0.00 pct)	 541.18 (0.94 pct)
    2	 1044.58 (0.00 pct)	 1064.16 (1.87 pct)
    4	 2043.92 (0.00 pct)	 2017.84 (-1.27 pct)
    8	 3572.50 (0.00 pct)	 3494.83 (-2.17 pct)
   16	 6040.97 (0.00 pct)	 5530.10 (-8.45 pct)	*
   16	 5814.03 (0.00 pct)	 6012.33 (-8.45 pct)	[Verification Run]
   32	 10794.10 (0.00 pct)	 10841.68 (0.44 pct)
   64	 20905.89 (0.00 pct)	 21438.82 (2.54 pct)
  128	 30885.39 (0.00 pct)	 30064.78 (-2.65 pct)
  256	 48901.25 (0.00 pct)	 51395.08 (5.09 pct)
  512	 49673.91 (0.00 pct)	 51725.89 (4.13 pct)
 1024	 47626.34 (0.00 pct)	 52662.01 (10.57 pct)

NPS4

Clients:	tip			sis_core
    1	 544.91 (0.00 pct)	 544.66 (-0.04 pct)
    2	 1046.49 (0.00 pct)	 1072.42 (2.47 pct)
    4	 2007.11 (0.00 pct)	 1970.05 (-1.84 pct)
    8	 3590.66 (0.00 pct)	 3670.45 (2.22 pct)
   16	 5956.60 (0.00 pct)	 6045.07 (1.48 pct)
   32	 10431.73 (0.00 pct)	 10439.40 (0.07 pct)
   64	 21563.37 (0.00 pct)	 19344.05 (-10.29 pct)	*
   64	 19387.71 (0.00 pct)	 19050.47 (-1.73 pct)	[Verification Run]
  128	 30352.16 (0.00 pct)	 26998.85 (-11.04 pct)	*
  128	 29110.99 (0.00 pct)	 29690.37 (1.99 pct)	[Verification Run]
  256	 49504.51 (0.00 pct)	 50921.66 (2.86 pct)
  512	 44916.61 (0.00 pct)	 52176.11 (16.16 pct)
 1024	 49986.21 (0.00 pct)	 51639.91 (3.30 pct)


~~~~~~~~~~
~ stream ~
~~~~~~~~~~

NPS1

10 Runs:

Test:		tip			sis_core
 Copy:	 339390.30 (0.00 pct)	 324656.88 (-4.34 pct)
Scale:	 212472.78 (0.00 pct)	 210641.39 (-0.86 pct)
  Add:	 247598.48 (0.00 pct)	 241669.10 (-2.39 pct)
Triad:	 261852.07 (0.00 pct)	 252088.55 (-3.72 pct)

100 Runs:

Test:		tip			sis_core
 Copy:	 335938.02 (0.00 pct)	 331491.32 (-1.32 pct)
Scale:	 212597.92 (0.00 pct)	 218705.84 (2.87 pct)
  Add:	 248294.62 (0.00 pct)	 243830.42 (-1.79 pct)
Triad:	 258400.88 (0.00 pct)	 248178.42 (-3.95 pct)

NPS2

10 Runs:

Test:		tip			sis_core
 Copy:	 334500.32 (0.00 pct)	 335317.70 (0.24 pct)
Scale:	 216804.76 (0.00 pct)	 217862.71 (0.48 pct)
  Add:	 250787.33 (0.00 pct)	 258839.00 (3.21 pct)
Triad:	 259451.40 (0.00 pct)	 264847.88 (2.07 pct)

100 Runs:

Test:		tip			sis_core
 Copy:	 326385.13 (0.00 pct)	 338030.70 (3.56 pct)
Scale:	 216440.37 (0.00 pct)	 230053.24 (6.28 pct)
  Add:	 255062.22 (0.00 pct)	 259197.23 (1.62 pct)
Triad:	 265442.03 (0.00 pct)	 271365.65 (2.23 pct)

NPS4

10 Runs:

Test:		tip			sis_core
 Copy:   363927.86 (0.00 pct)    361014.15 (-0.80 pct)
Scale:   238190.49 (0.00 pct)    242176.02 (1.67 pct)
  Add:   262806.49 (0.00 pct)    266348.50 (1.34 pct)
Triad:   276492.33 (0.00 pct)    276769.10 (0.10 pct)

100 Runs:

Test:		tip			sis_core
 Copy:   365041.37 (0.00 pct)    349299.35 (-4.31 pct)
Scale:   239295.27 (0.00 pct)    229944.85 (-3.90 pct)
  Add:   264085.21 (0.00 pct)    252651.56 (-4.32 pct)
Triad:   279664.56 (0.00 pct)    274254.22 (-1.93 pct)

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1

tip:                    131328.67 (var: 2.97%)
sis_core:               131702.33 (var: 3.61%)	(0.28%)

o NPS2:

tip:			132482.33 (var: 2.06%)
sis_core:		132338.33 (var: 0.97%)  (-0.11%)

o NPS4:

tip:                    134130.00 (var: 4.12%)
sis_core:               133224.33 (var: 4.13%)	(-0.67%)

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

Test			Metric	  Parallelism			tip		      sis_core
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48770555.20 (   0.00%)    49025161.73 (   0.52%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6268185467.60 (   0.00%)  6266351964.20 (  -0.03%)
unixbench-syscall       Amean     unixbench-syscall-1        2685321.17 (   0.00%)     2694468.30 *  -0.34%*
unixbench-syscall       Amean     unixbench-syscall-512      7291476.20 (   0.00%)     7295087.67 (  -0.05%)
unixbench-pipe          Hmean     unixbench-pipe-1           2480858.53 (   0.00%)     2536923.44 *   2.26%*
unixbench-pipe          Hmean     unixbench-pipe-512       300739256.62 (   0.00%)   303470605.93 *   0.91%*
unixbench-spawn         Hmean     unixbench-spawn-1             4358.14 (   0.00%)        4104.88 (  -5.81%)	* (Known to be unstable)
unixbench-spawn         Hmean     unixbench-spawn-1             4711.00 (   0.00%)        4006.20 ( -14.96%)	[Verification Run]
unixbench-spawn         Hmean     unixbench-spawn-512          76497.32 (   0.00%)       75555.94 *  -1.23%*
unixbench-execl         Hmean     unixbench-execl-1             4147.12 (   0.00%)        4157.33 (   0.25%)
unixbench-execl         Hmean     unixbench-execl-512          12435.26 (   0.00%)       11992.43 (  -3.56%)

o NPS2

Test			Metric	  Parallelism			tip		      sis_core
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48872335.50 (   0.00%)    48902553.70 (   0.06%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6264134378.20 (   0.00%)  6260631689.40 (  -0.06%)
unixbench-syscall       Amean     unixbench-syscall-1        2683903.13 (   0.00%)     2694829.17 *  -0.41%*
unixbench-syscall       Amean     unixbench-syscall-512      7746773.60 (   0.00%)     7493782.67 *   3.27%*
unixbench-pipe          Hmean     unixbench-pipe-1           2476724.23 (   0.00%)     2537127.96 *   2.44%*
unixbench-pipe          Hmean     unixbench-pipe-512       300277350.41 (   0.00%)   302979776.19 *   0.90%*
unixbench-spawn         Hmean     unixbench-spawn-1             5026.50 (   0.00%)        4680.63 (  -6.88%)	*
unixbench-spawn         Hmean     unixbench-spawn-1             5421.70 (   0.00%)        5311.50 (  -2.03%)	[Verification Run]
unixbench-spawn         Hmean     unixbench-spawn-512          80549.70 (   0.00%)       78888.60 (  -2.06%)
unixbench-execl         Hmean     unixbench-execl-1             4151.70 (   0.00%)        3913.76 *  -5.73%*	*
unixbench-execl         Hmean     unixbench-execl-1             4304.30 (   0.00%)        4303.20 (  -0.02%)	[Verification run]
unixbench-execl         Hmean     unixbench-execl-512          13605.15 (   0.00%)       13129.23 (  -3.50%)

o NPS4

Test			Metric	  Parallelism			tip		      sis_core
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48506771.20 (   0.00%)    48894866.70 (   0.80%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6280954362.50 (   0.00%)  6282759876.40 (   0.03%)
unixbench-syscall       Amean     unixbench-syscall-1        2687259.30 (   0.00%)     2695379.93 *  -0.30%*
unixbench-syscall       Amean     unixbench-syscall-512      7350275.67 (   0.00%)     7366923.73 (  -0.23%)
unixbench-pipe          Hmean     unixbench-pipe-1           2478893.01 (   0.00%)     2540015.88 *   2.47%*
unixbench-pipe          Hmean     unixbench-pipe-512       301830155.61 (   0.00%)   304305539.27 *   0.82%*
unixbench-spawn         Hmean     unixbench-spawn-1             5208.55 (   0.00%)        5273.11 (   1.24%)
unixbench-spawn         Hmean     unixbench-spawn-512          80745.79 (   0.00%)       81940.71 *   1.48%*
unixbench-execl         Hmean     unixbench-execl-1             4072.72 (   0.00%)        4126.13 *   1.31%*
unixbench-execl         Hmean     unixbench-execl-512          13746.56 (   0.00%)       12848.77 (  -6.53%)	*
unixbench-execl         Hmean     unixbench-execl-512          13898.30 (   0.00%)       13959.70 (   0.44%)	[Verification Run]

On 10/19/2022 5:58 PM, Abel Wu wrote:
> This patchset tries to improve SIS scan efficiency by recording idle
> cpus in a cpumask for each LLC which will be used as a target cpuset
> in the domain scan. The cpus are recorded at CORE granule to avoid
> tasks being stack on same core.
> 
> v5 -> v6:
>  - Rename SIS_FILTER to SIS_CORE as it can only be activated when
>    SMT is enabled and better describes the behavior of CORE granule
>    update & load delivery.
>  - Removed the part of limited scan for idle cores since it might be
>    better to open another thread to discuss the strategies such as
>    limited or scaled depth. But keep the part of full scan for idle
>    cores when LLC is overloaded because SIS_CORE can greatly reduce
>    the overhead of full scan in such case.
>  - Removed the state of sd_is_busy which indicates an LLC is fully
>    busy and we can safely skip the SIS domain scan. I would prefer
>    leave this to SIS_UTIL.
>  - The filter generation mechanism is replaced by in-place updates
>    during domain scan to better deal with partial scan failures.
>  - Collect Reviewed-bys from Tim Chen
> 
> v4 -> v5:
>  - Add limited scan for idle cores when overloaded, suggested by Mel
>  - Split out several patches since they are irrelevant to this scope
>  - Add quick check on ttwu_pending before core update
>  - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
>  - Move the main filter logic to the idle path, because the newidle
>    balance can bail out early if rq->avg_idle is small enough and
>    lose chances to update the filter.
> 
> v3 -> v4:
>  - Update filter in load_balance rather than in the tick
>  - Now the filter contains unoccupied cpus rather than overloaded ones
>  - Added mechanisms to deal with the false positive cases
> 
> v2 -> v3:
>  - Removed sched-idle balance feature and focus on SIS
>  - Take non-CFS tasks into consideration
>  - Several fixes/improvement suggested by Josh Don
> 
> v1 -> v2:
>  - Several optimizations on sched-idle balancing
>  - Ignore asym topos in can_migrate_task
>  - Add more benchmarks including SIS efficiency
>  - Re-organize patch as suggested by Mel Gorman
> 
> Abel Wu (4):
>   sched/fair: Skip core update if task pending
>   sched/fair: Ignore SIS_UTIL when has_idle_core
>   sched/fair: Introduce SIS_CORE
>   sched/fair: Deal with SIS scan failures
> 
>  include/linux/sched/topology.h |  15 ++++
>  kernel/sched/fair.c            | 122 +++++++++++++++++++++++++++++----
>  kernel/sched/features.h        |   7 ++
>  kernel/sched/sched.h           |   3 +
>  kernel/sched/topology.c        |   8 ++-
>  5 files changed, 141 insertions(+), 14 deletions(-)
> 

Testing with couple of larger workloads like SpecJBB are still underway.
I'll update the thread with the results once they are done. The idea
is promising. I'll also try to run schbench / hackbench pinned in a
manner such that all wakeups happen on an external LLC to spot any
impact of rapid changes to the idle cpu mask of an external LLC.
Please let me know if you would like me to test or get data for any
particular benchmark from my test setup.

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
  2023-02-07  3:42 ` K Prateek Nayak
@ 2023-02-16 13:18   ` Abel Wu
  0 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2023-02-16 13:18 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra, Ingo Molnar, Mel Gorman,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider
  Cc: Josh Don, Chen Yu, Tim Chen, Gautham R . Shenoy, Aubrey Li,
	Qais Yousef, Juri Lelli, Rik van Riel, Yicong Yang, Barry Song,
	linux-kernel

Hi Prateek, thanks very much for your solid testings!

On 2/7/23 11:42 AM, K Prateek Nayak wrote:
> Hello Abel,
> 
> I've retested the patches with on the updated tip and the results
> are still promising.
> 
> tl;dr
> 
> o Hackbench sees improvements when the machine is overloaded.
> o tbench shows improvements when the machine is overloaded.
> o The unixbench regression seen previously seems to be unrelated
>    to the patch as the spawn test scores are vastly different
>    after a reboot/kexec for the same kernel.
> o Other benchmarks show slight improvements or are comparable to
>    the numbers on tip.

Cheers! Yet I still see some minor regressions in the report
below. As we discussed last time, reducing unnecessary updates
on the idle cpumask when LLC is overloaded should help.

Thanks & Best regards,
	Abel

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-02-16 13:18 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-19 12:28 [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
2022-10-19 12:28 ` [PATCH v6 1/4] sched/fair: Skip core update if task pending Abel Wu
2022-10-19 12:28 ` [PATCH v6 2/4] sched/fair: Ignore SIS_UTIL when has_idle_core Abel Wu
2022-10-19 12:28 ` [PATCH v6 3/4] sched/fair: Introduce SIS_CORE Abel Wu
2022-10-21  4:03   ` Chen Yu
2022-10-21  4:30     ` Abel Wu
2022-10-21  4:34       ` Chen Yu
2022-10-21  9:35         ` Abel Wu
2022-10-21 11:14           ` Chen Yu
2022-10-19 12:28 ` [PATCH v6 4/4] sched/fair: Deal with SIS scan failures Abel Wu
2022-11-04  7:29 ` [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Abel Wu
2022-11-14  5:45 ` K Prateek Nayak
2022-11-15  8:31   ` Abel Wu
2022-11-15 11:28     ` K Prateek Nayak
2022-11-22 11:28       ` K Prateek Nayak
2022-11-24  3:50         ` Abel Wu
2023-02-07  3:42 ` K Prateek Nayak
2023-02-16 13:18   ` Abel Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.