linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] introduce sched-idle balancing
@ 2022-02-17 15:43 Abel Wu
  2022-02-17 15:43 ` [RFC PATCH 1/5] sched/fair: record overloaded cpus Abel Wu
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Abel Wu @ 2022-02-17 15:43 UTC (permalink / raw)
  To: Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Ingo Molnar, Juri Lelli, Mel Gorman, Peter Zijlstra,
	Steven Rostedt, Vincent Guittot
  Cc: linux-kernel

Current load balancing is mainly based on cpu capacity
and task util, which makes sense in the POV of overall
throughput. While there still might be some improvement
can be done by reducing number of overloaded cfs rqs if
sched-idle or idle rq exists.

An CFS runqueue is considered overloaded when there are
more than one pullable non-idle tasks on it (since sched-
idle cpus are treated as idle cpus). And idle tasks are
counted towards rq->cfs.idle_h_nr_running, that is either
assigned SCHED_IDLE policy or placed under idle cgroups.

The overloaded cfs rqs can cause performance issues to
both task types:

  - for latency critical tasks like SCHED_NORMAL,
    time of waiting in the rq will increase and
    result in higher pct99 latency, and

  - batch tasks may not be able to make full use
    of cpu capacity if sched-idle rq exists, thus
    presents poorer throughput.

So in short, the goal of the sched-idle balancing is to
let the *non-idle tasks* make full use of cpu resources.
To achieve that, we mainly do two things:

  - pull non-idle tasks for sched-idle or idle rqs
    from the overloaded ones, and

  - prevent pulling the last non-idle task in an rq

The mask of overloaded cpus is updated in periodic tick
and the idle path at the LLC domain basis. This cpumask
will also be used in SIS as a filter, improving idle cpu
searching.

Tests are done in an Intel Xeon E5-2650 v4 server with
2 NUMA nodes each of which has 12 cores, and with SMT2
enabled, so 48 CPUs in total. Test results are listed
as follows.

  - we used perf messaging test to test throughput
    at different load (groups).

      perf bench sched messaging -g [N] -l 40000

	N	w/o	w/	diff
	1	2.897	2.834	-2.17%
	3	5.156	4.904	-4.89%
	5	7.850	7.617	-2.97%
	10	15.140	14.574	-3.74%
	20	29.387	27.602	-6.07%

    the result shows approximate 2~6% improvement.

  - and schbench to test latency performance in two
    scenarios: quiet and noisy. In quiet test, we
    run schbench in a normal cpu cgroup in a quiet
    system, while the noisy test additionally runs
    perf messaging workload inside an idle cgroup
    as nosie.

      schbench -m 2 -t 24 -i 60 -r 60
      perf bench sched messaging -g 1 -l 4000000

	[quiet]
			w/o	w/
	50.0th		31	31
	75.0th		45	45
	90.0th		55	55
	95.0th		62	61
	*99.0th		85	86
	99.5th		565	318
	99.9th		11536	10992
	max		13029	13067

	[nosiy]
			w/o	w/
	50.0th		34	32
	75.0th		48	45
	90.0th		58	55
	95.0th		65	61
	*99.0th		2364	208
	99.5th		6696	2068
	99.9th		12688	8816
	max		15209	14191

    it can be seen that the quiet test results are
    quite similar, but the p99 latency is greatly
    improved in the nosiy test.

Comments and tests are appreciated!

Abel Wu (5):
  sched/fair: record overloaded cpus
  sched/fair: introduce sched-idle balance
  sched/fair: add stats for sched-idle balancing
  sched/fair: filter out overloaded cpus in sis
  sched/fair: favor cpu capacity for idle tasks

 include/linux/sched/idle.h     |   1 +
 include/linux/sched/topology.h |  15 ++++
 kernel/sched/core.c            |   1 +
 kernel/sched/fair.c            | 187 ++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h           |   6 ++
 kernel/sched/stats.c           |   5 +-
 kernel/sched/topology.c        |   4 +-
 7 files changed, 215 insertions(+), 4 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 1/5] sched/fair: record overloaded cpus
  2022-02-17 15:43 [RFC PATCH 0/5] introduce sched-idle balancing Abel Wu
@ 2022-02-17 15:43 ` Abel Wu
  2022-02-24  7:10   ` Gautham R. Shenoy
  2022-02-17 15:43 ` [RFC PATCH 2/5] sched/fair: introduce sched-idle balance Abel Wu
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Abel Wu @ 2022-02-17 15:43 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: linux-kernel

An CFS runqueue is considered overloaded when there are
more than one pullable non-idle tasks on it (since sched-
idle cpus are treated as idle cpus). And idle tasks are
counted towards rq->cfs.idle_h_nr_running, that is either
assigned SCHED_IDLE policy or placed under idle cgroups.

The overloaded cfs rqs can cause performance issues to
both task types:

  - for latency critical tasks like SCHED_NORMAL,
    time of waiting in the rq will increase and
    result in higher pct99 latency, and

  - batch tasks may not be able to make full use
    of cpu capacity if sched-idle rq exists, thus
    presents poorer throughput.

The mask of overloaded cpus is updated in periodic tick
and the idle path at the LLC domain basis. This cpumask
will also be used in SIS as a filter, improving idle cpu
searching.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 include/linux/sched/topology.h | 10 ++++++++++
 kernel/sched/core.c            |  1 +
 kernel/sched/fair.c            | 43 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h           |  6 ++++++
 kernel/sched/topology.c        |  4 +++-
 5 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..03c9c81dc886 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -81,6 +81,16 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
+
+	/*
+	 * The above varibles are used in idle path and
+	 * select_task_rq, and the following two are
+	 * mainly updated in tick. They are all hot but
+	 * for different usage, so start a new cacheline
+	 * to avoid false sharing.
+	 */
+	atomic_t	nr_overloaded	____cacheline_aligned;
+	unsigned long	overloaded[];	/* Must be last */
 };
 
 struct sched_domain {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d863d7f6ad7..a6da2998ec49 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9423,6 +9423,7 @@ void __init sched_init(void)
 		rq->wake_stamp = jiffies;
 		rq->wake_avg_idle = rq->avg_idle;
 		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+		rq->overloaded = 0;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c4bfffe8c2c..0a0438c3319b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6968,6 +6968,46 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	return newidle_balance(rq, rf) != 0;
 }
+
+static inline int cfs_rq_overloaded(struct rq *rq)
+{
+	return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1;
+}
+
+/* Must be called with rq locked */
+static void update_overload_status(struct rq *rq)
+{
+	struct sched_domain_shared *sds;
+	int overloaded = cfs_rq_overloaded(rq);
+	int cpu = cpu_of(rq);
+
+	lockdep_assert_rq_held(rq);
+
+	if (rq->overloaded == overloaded)
+		return;
+
+	rcu_read_lock();
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (unlikely(!sds))
+		goto unlock;
+
+	if (overloaded) {
+		cpumask_set_cpu(cpu, sdo_mask(sds));
+		atomic_inc(&sds->nr_overloaded);
+	} else {
+		cpumask_clear_cpu(cpu, sdo_mask(sds));
+		atomic_dec(&sds->nr_overloaded);
+	}
+
+	rq->overloaded = overloaded;
+unlock:
+	rcu_read_unlock();
+}
+
+#else
+
+static inline void update_overload_status(struct rq *rq) { }
+
 #endif /* CONFIG_SMP */
 
 static unsigned long wakeup_gran(struct sched_entity *se)
@@ -7315,6 +7355,8 @@ done: __maybe_unused;
 	if (new_tasks > 0)
 		goto again;
 
+	update_overload_status(rq);
+
 	/*
 	 * rq is about to be idle, check if we need to update the
 	 * lost_idle_time of clock_pelt
@@ -11131,6 +11173,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 
+	update_overload_status(rq);
 	update_misfit_status(curr, rq);
 	update_overutilized_status(task_rq(curr));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9b33ba9c3c42..c81a87082b8b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1012,6 +1012,7 @@ struct rq {
 
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
+	unsigned char		overloaded;
 
 	unsigned long		misfit_task_load;
 
@@ -1762,6 +1763,11 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 	return sd;
 }
 
+static inline struct cpumask *sdo_mask(struct sched_domain_shared *sds)
+{
+	return to_cpumask(sds->overloaded);
+}
+
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index e6cd55951304..641f11415819 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1623,6 +1623,8 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+		atomic_set(&sd->shared->nr_overloaded, 0);
+		cpumask_clear(sdo_mask(sd->shared));
 	}
 
 	sd->private = sdd;
@@ -2050,7 +2052,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
 			*per_cpu_ptr(sdd->sd, j) = sd;
 
-			sds = kzalloc_node(sizeof(struct sched_domain_shared),
+			sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
 			if (!sds)
 				return -ENOMEM;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 2/5] sched/fair: introduce sched-idle balance
  2022-02-17 15:43 [RFC PATCH 0/5] introduce sched-idle balancing Abel Wu
  2022-02-17 15:43 ` [RFC PATCH 1/5] sched/fair: record overloaded cpus Abel Wu
@ 2022-02-17 15:43 ` Abel Wu
  2022-02-17 15:43 ` [RFC PATCH 3/5] sched/fair: add stats for sched-idle balancing Abel Wu
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Abel Wu @ 2022-02-17 15:43 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: linux-kernel

The goal of the sched-idle balancing is to let the non-idle
tasks make full use of cpu resources. To achieve that, we
mainly do two things:

  - pull non-idle tasks for sched-idle or idle rqs
    from the overloaded ones, and

  - prevent pulling the last non-idle task in an rq

We do sched-idle balance at normal load balancing and newly
idle if necessary. The idle balancing is ignored due to high
wakeup latency.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 include/linux/sched/idle.h |   1 +
 kernel/sched/fair.c        | 128 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 129 insertions(+)

diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
index d73d314d59c6..50ec5c770f85 100644
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -8,6 +8,7 @@ enum cpu_idle_type {
 	CPU_IDLE,
 	CPU_NOT_IDLE,
 	CPU_NEWLY_IDLE,
+	CPU_SCHED_IDLE,
 	CPU_MAX_IDLE_TYPES
 };
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a0438c3319b..070a6fb1d2bf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -456,6 +456,21 @@ static int se_is_idle(struct sched_entity *se)
 	return cfs_rq_is_idle(group_cfs_rq(se));
 }
 
+/* Is task idle from the top hierarchy POV */
+static int task_h_idle(struct task_struct *p)
+{
+	struct sched_entity *se = &p->se;
+
+	if (task_has_idle_policy(p))
+		return 1;
+
+	for_each_sched_entity(se)
+		if (cfs_rq_is_idle(cfs_rq_of(se)))
+			return 1;
+
+	return 0;
+}
+
 #else	/* !CONFIG_FAIR_GROUP_SCHED */
 
 #define for_each_sched_entity(se) \
@@ -508,6 +523,11 @@ static int se_is_idle(struct sched_entity *se)
 	return 0;
 }
 
+static inline int task_h_idle(struct task_struct *p)
+{
+	return task_has_idle_policy(p);
+}
+
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
 static __always_inline
@@ -6974,6 +6994,11 @@ static inline int cfs_rq_overloaded(struct rq *rq)
 	return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1;
 }
 
+static inline bool need_pull_cfs_task(struct rq *rq)
+{
+	return rq->cfs.h_nr_running == rq->cfs.idle_h_nr_running;
+}
+
 /* Must be called with rq locked */
 static void update_overload_status(struct rq *rq)
 {
@@ -7767,6 +7792,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	if (kthread_is_per_cpu(p))
 		return 0;
 
+	/*
+	 * Disregard hierarchically idle tasks during sched-idle
+	 * load balancing.
+	 */
+	if (env->idle == CPU_SCHED_IDLE && task_h_idle(p))
+		return 0;
+
+	/*
+	 * Skip p if it is the last non-idle task in src_rq. This
+	 * protects latency and throughput for non-idle tasks at
+	 * the cost of temporary load imbalance (which will probably
+	 * be fixed soon).
+	 */
+	if (!cfs_rq_overloaded(env->src_rq) && !task_h_idle(p))
+		return 0;
+
 	if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
 		int cpu;
 
@@ -10265,6 +10306,83 @@ static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
 }
 
 /*
+ * The sched-idle balancing tries to eliminate overloaded cfs rqs
+ * by spreading out non-idle tasks prior to normal load balancing.
+ */
+static void sched_idle_balance(struct rq *dst_rq)
+{
+	struct sched_domain *sd;
+	struct task_struct *p;
+	int dst_cpu = cpu_of(dst_rq), cpu;
+
+	sd = rcu_dereference(per_cpu(sd_llc, dst_cpu));
+	if (unlikely(!sd))
+		return;
+
+	if (!atomic_read(&sd->shared->nr_overloaded))
+		return;
+
+	for_each_cpu_wrap(cpu, sdo_mask(sd->shared), dst_cpu + 1) {
+		struct rq *rq = cpu_rq(cpu);
+		struct rq_flags rf;
+		struct lb_env env;
+
+		if (cpu == dst_cpu)
+			continue;
+
+		if (!cfs_rq_overloaded(rq))
+			continue;
+
+		rq_lock_irqsave(rq, &rf);
+
+		/*
+		 * Check again to ensure there are pullable tasks.
+		 * This is necessary because multiple rqs can pull
+		 * tasks at the same time. IOW contention on this
+		 * rq is heavy, so it would be better clear this
+		 * cpu from overloaded mask.
+		 */
+		if (unlikely(!cfs_rq_overloaded(rq))) {
+			update_overload_status(rq);
+			rq_unlock_irqrestore(rq, &rf);
+			continue;
+		}
+
+		env = (struct lb_env) {
+			.sd		= sd,
+			.dst_cpu	= dst_cpu,
+			.dst_rq		= dst_rq,
+			.src_cpu	= cpu,
+			.src_rq		= rq,
+			.idle		= CPU_SCHED_IDLE, /* non-idle only */
+			.flags		= LBF_DST_PINNED, /* pin dst_cpu */
+		};
+
+		update_rq_clock(rq);
+		p = detach_one_task(&env);
+
+		/*
+		 * Lazy updating overloaded mask here. If the rq is
+		 * still overloaded then we are just wasting cycles.
+		 * And it's OK even if the rq becomes un-overloaded
+		 * since the cost of peeking rq's data without lock
+		 * won't be much in next loops (during which the rq
+		 * can even be overloaded again).
+		 */
+
+		rq_unlock(rq, &rf);
+
+		if (p) {
+			attach_one_task(dst_rq, p);
+			local_irq_restore(rf.flags);
+			return;
+		}
+
+		local_irq_restore(rf.flags);
+	}
+}
+
+/*
  * It checks each scheduling domain to see if it is due to be balanced,
  * and initiates a balancing operation if so.
  *
@@ -10284,6 +10402,10 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 	u64 max_cost = 0;
 
 	rcu_read_lock();
+
+	if (need_pull_cfs_task(rq))
+		sched_idle_balance(rq);
+
 	for_each_domain(cpu, sd) {
 		/*
 		 * Decay the newidle max times here because this is a regular
@@ -10913,6 +11035,12 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	update_blocked_averages(this_cpu);
 
 	rcu_read_lock();
+
+	sched_idle_balance(this_rq);
+	t1 = sched_clock_cpu(this_cpu);
+	curr_cost += t1 - t0;
+	t0 = t1;
+
 	for_each_domain(this_cpu, sd) {
 		int continue_balancing = 1;
 		u64 domain_cost;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 3/5] sched/fair: add stats for sched-idle balancing
  2022-02-17 15:43 [RFC PATCH 0/5] introduce sched-idle balancing Abel Wu
  2022-02-17 15:43 ` [RFC PATCH 1/5] sched/fair: record overloaded cpus Abel Wu
  2022-02-17 15:43 ` [RFC PATCH 2/5] sched/fair: introduce sched-idle balance Abel Wu
@ 2022-02-17 15:43 ` Abel Wu
  2022-02-17 15:44 ` [RFC PATCH 4/5] sched/fair: filter out overloaded cpus in sis Abel Wu
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Abel Wu @ 2022-02-17 15:43 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: linux-kernel

To better understand the behavior of sched-idle balancing, add
some statistics like other load balancing mechanisms did.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 include/linux/sched/topology.h | 5 +++++
 kernel/sched/fair.c            | 6 +++++-
 kernel/sched/stats.c           | 5 +++--
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 03c9c81dc886..4259963d3e5e 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -150,6 +150,11 @@ struct sched_domain {
 	unsigned int ttwu_wake_remote;
 	unsigned int ttwu_move_affine;
 	unsigned int ttwu_move_balance;
+
+	/* sched-idle balancing */
+	unsigned int sib_peeked;
+	unsigned int sib_pulled;
+	unsigned int sib_failed;
 #endif
 #ifdef CONFIG_SCHED_DEBUG
 	char *name;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 070a6fb1d2bf..c83c0864e429 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10330,8 +10330,10 @@ static void sched_idle_balance(struct rq *dst_rq)
 		if (cpu == dst_cpu)
 			continue;
 
-		if (!cfs_rq_overloaded(rq))
+		if (!cfs_rq_overloaded(rq)) {
+			schedstat_inc(sd->sib_peeked);
 			continue;
+		}
 
 		rq_lock_irqsave(rq, &rf);
 
@@ -10375,10 +10377,12 @@ static void sched_idle_balance(struct rq *dst_rq)
 		if (p) {
 			attach_one_task(dst_rq, p);
 			local_irq_restore(rf.flags);
+			schedstat_inc(sd->sib_pulled);
 			return;
 		}
 
 		local_irq_restore(rf.flags);
+		schedstat_inc(sd->sib_failed);
 	}
 }
 
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 07dde2928c79..3ee476c72806 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -164,12 +164,13 @@ static int show_schedstat(struct seq_file *seq, void *v)
 				    sd->lb_nobusyg[itype]);
 			}
 			seq_printf(seq,
-				   " %u %u %u %u %u %u %u %u %u %u %u %u\n",
+				   " %u %u %u %u %u %u %u %u %u %u %u %u %u %u %u\n",
 			    sd->alb_count, sd->alb_failed, sd->alb_pushed,
 			    sd->sbe_count, sd->sbe_balanced, sd->sbe_pushed,
 			    sd->sbf_count, sd->sbf_balanced, sd->sbf_pushed,
 			    sd->ttwu_wake_remote, sd->ttwu_move_affine,
-			    sd->ttwu_move_balance);
+			    sd->ttwu_move_balance, sd->sib_peeked,
+			    sd->sib_pulled, sd->sib_failed);
 		}
 		rcu_read_unlock();
 #endif
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 4/5] sched/fair: filter out overloaded cpus in sis
  2022-02-17 15:43 [RFC PATCH 0/5] introduce sched-idle balancing Abel Wu
                   ` (2 preceding siblings ...)
  2022-02-17 15:43 ` [RFC PATCH 3/5] sched/fair: add stats for sched-idle balancing Abel Wu
@ 2022-02-17 15:44 ` Abel Wu
  2022-02-17 15:44 ` [RFC PATCH 5/5] sched/fair: favor cpu capacity for idle tasks Abel Wu
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Abel Wu @ 2022-02-17 15:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: linux-kernel

Skip overloaded cpus in SIS if any. This improves idle
cpu searching, especially under the SIS_PROP constrain
that search depth is limited.

The mask of overloaded cpus might not be quite accurate
since it is generally updated at tick granule, but the
overloaded cpus are unlikely to go into idle shortly.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c83c0864e429..1d8f396e6f41 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6273,6 +6273,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
+	if (atomic_read(&sd->shared->nr_overloaded))
+		cpumask_andnot(cpus, cpus, sdo_mask(sd->shared));
+
 	if (sched_feat(SIS_PROP) && !has_idle_core) {
 		u64 avg_cost, avg_idle, span_avg;
 		unsigned long now = jiffies;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 5/5] sched/fair: favor cpu capacity for idle tasks
  2022-02-17 15:43 [RFC PATCH 0/5] introduce sched-idle balancing Abel Wu
                   ` (3 preceding siblings ...)
  2022-02-17 15:44 ` [RFC PATCH 4/5] sched/fair: filter out overloaded cpus in sis Abel Wu
@ 2022-02-17 15:44 ` Abel Wu
  2022-02-24  3:19 ` [RFC PATCH 0/5] introduce sched-idle balancing Abel Wu
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Abel Wu @ 2022-02-17 15:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: linux-kernel

Unlike select_idle_sibling() in which we need to find a
not-so-bad candidate ASAP, the slowpath gives us more
tolerance: ignore sched-idle cpus for idle tasks since
they prefer cpu capacity rather than latency, and besides
spreading out idle tasks also good for latency of normal
tasks.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 kernel/sched/fair.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d8f396e6f41..57f1d8c43228 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6007,6 +6007,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu);
 static int
 find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 {
+	bool ignore_si = task_h_idle(p);
 	unsigned long load, min_load = ULONG_MAX;
 	unsigned int min_exit_latency = UINT_MAX;
 	u64 latest_idle_timestamp = 0;
@@ -6025,7 +6026,13 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 		if (!sched_core_cookie_match(rq, p))
 			continue;
 
-		if (sched_idle_cpu(i))
+		/*
+		 * The idle tasks prefer cpu capacity rather than
+		 * latency. Spreading out idle tasks also good for
+		 * latency of normal tasks since they won't suffer
+		 * high cpu wakeup delay.
+		 */
+		if (!ignore_si && sched_idle_cpu(i))
 			return i;
 
 		if (available_idle_cpu(i)) {
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] introduce sched-idle balancing
  2022-02-17 15:43 [RFC PATCH 0/5] introduce sched-idle balancing Abel Wu
                   ` (4 preceding siblings ...)
  2022-02-17 15:44 ` [RFC PATCH 5/5] sched/fair: favor cpu capacity for idle tasks Abel Wu
@ 2022-02-24  3:19 ` Abel Wu
  2022-02-24 15:20 ` Peter Zijlstra
  2022-02-24 16:47 ` Mel Gorman
  7 siblings, 0 replies; 22+ messages in thread
From: Abel Wu @ 2022-02-24  3:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ben Segall, Juri Lelli, Steven Rostedt, Mel Gorman,
	Vincent Guittot, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Ingo Molnar, linux-kernel, Abel Wu

Ping :)

On 2/17/22 11:43 PM, Abel Wu Wrote:
> Current load balancing is mainly based on cpu capacity
> and task util, which makes sense in the POV of overall
> throughput. While there still might be some improvement
> can be done by reducing number of overloaded cfs rqs if
> sched-idle or idle rq exists.
> 
> An CFS runqueue is considered overloaded when there are
> more than one pullable non-idle tasks on it (since sched-
> idle cpus are treated as idle cpus). And idle tasks are
> counted towards rq->cfs.idle_h_nr_running, that is either
> assigned SCHED_IDLE policy or placed under idle cgroups.
> 
> The overloaded cfs rqs can cause performance issues to
> both task types:
> 
>    - for latency critical tasks like SCHED_NORMAL,
>      time of waiting in the rq will increase and
>      result in higher pct99 latency, and
> 
>    - batch tasks may not be able to make full use
>      of cpu capacity if sched-idle rq exists, thus
>      presents poorer throughput.
> 
> So in short, the goal of the sched-idle balancing is to
> let the *non-idle tasks* make full use of cpu resources.
> To achieve that, we mainly do two things:
> 
>    - pull non-idle tasks for sched-idle or idle rqs
>      from the overloaded ones, and
> 
>    - prevent pulling the last non-idle task in an rq
> 
> The mask of overloaded cpus is updated in periodic tick
> and the idle path at the LLC domain basis. This cpumask
> will also be used in SIS as a filter, improving idle cpu
> searching.
> 
> Tests are done in an Intel Xeon E5-2650 v4 server with
> 2 NUMA nodes each of which has 12 cores, and with SMT2
> enabled, so 48 CPUs in total. Test results are listed
> as follows.
> 
>    - we used perf messaging test to test throughput
>      at different load (groups).
> 
>        perf bench sched messaging -g [N] -l 40000
> 
> 	N	w/o	w/	diff
> 	1	2.897	2.834	-2.17%
> 	3	5.156	4.904	-4.89%
> 	5	7.850	7.617	-2.97%
> 	10	15.140	14.574	-3.74%
> 	20	29.387	27.602	-6.07%
> 
>      the result shows approximate 2~6% improvement.
> 
>    - and schbench to test latency performance in two
>      scenarios: quiet and noisy. In quiet test, we
>      run schbench in a normal cpu cgroup in a quiet
>      system, while the noisy test additionally runs
>      perf messaging workload inside an idle cgroup
>      as nosie.
> 
>        schbench -m 2 -t 24 -i 60 -r 60
>        perf bench sched messaging -g 1 -l 4000000
> 
> 	[quiet]
> 			w/o	w/
> 	50.0th		31	31
> 	75.0th		45	45
> 	90.0th		55	55
> 	95.0th		62	61
> 	*99.0th		85	86
> 	99.5th		565	318
> 	99.9th		11536	10992
> 	max		13029	13067
> 
> 	[nosiy]
> 			w/o	w/
> 	50.0th		34	32
> 	75.0th		48	45
> 	90.0th		58	55
> 	95.0th		65	61
> 	*99.0th		2364	208
> 	99.5th		6696	2068
> 	99.9th		12688	8816
> 	max		15209	14191
> 
>      it can be seen that the quiet test results are
>      quite similar, but the p99 latency is greatly
>      improved in the nosiy test.
> 
> Comments and tests are appreciated!
> 
> Abel Wu (5):
>    sched/fair: record overloaded cpus
>    sched/fair: introduce sched-idle balance
>    sched/fair: add stats for sched-idle balancing
>    sched/fair: filter out overloaded cpus in sis
>    sched/fair: favor cpu capacity for idle tasks
> 
>   include/linux/sched/idle.h     |   1 +
>   include/linux/sched/topology.h |  15 ++++
>   kernel/sched/core.c            |   1 +
>   kernel/sched/fair.c            | 187 ++++++++++++++++++++++++++++++++++++++++-
>   kernel/sched/sched.h           |   6 ++
>   kernel/sched/stats.c           |   5 +-
>   kernel/sched/topology.c        |   4 +-
>   7 files changed, 215 insertions(+), 4 deletions(-)
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/5] sched/fair: record overloaded cpus
  2022-02-17 15:43 ` [RFC PATCH 1/5] sched/fair: record overloaded cpus Abel Wu
@ 2022-02-24  7:10   ` Gautham R. Shenoy
  2022-02-24 14:36     ` Abel Wu
  2022-02-27  8:08     ` Aubrey Li
  0 siblings, 2 replies; 22+ messages in thread
From: Gautham R. Shenoy @ 2022-02-24  7:10 UTC (permalink / raw)
  To: Abel Wu
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel, srikar, aubrey.li

Hello Abel,

(+ Aubrey Li, Srikar)

On Thu, Feb 17, 2022 at 11:43:57PM +0800, Abel Wu wrote:
> An CFS runqueue is considered overloaded when there are
> more than one pullable non-idle tasks on it (since sched-
> idle cpus are treated as idle cpus). And idle tasks are
> counted towards rq->cfs.idle_h_nr_running, that is either
> assigned SCHED_IDLE policy or placed under idle cgroups.
> 
> The overloaded cfs rqs can cause performance issues to
> both task types:
> 
>   - for latency critical tasks like SCHED_NORMAL,
>     time of waiting in the rq will increase and
>     result in higher pct99 latency, and
> 
>   - batch tasks may not be able to make full use
>     of cpu capacity if sched-idle rq exists, thus
>     presents poorer throughput.
> 
> The mask of overloaded cpus is updated in periodic tick
> and the idle path at the LLC domain basis. This cpumask
> will also be used in SIS as a filter, improving idle cpu
> searching.

This is an interesting approach to minimise the tail latencies by
keeping track of the overloaded cpus in the LLC so that
idle/sched-idle CPUs can pull from them. This approach contrasts with the
following approaches that were previously tried :

1. Maintain the idle cpumask at the LLC level by Aubrey Li
   https://lore.kernel.org/all/1615872606-56087-1-git-send-email-aubrey.li@intel.com/
   
2. Maintain the identity of the idle core itself at the LLC level, by Srikar :
   https://lore.kernel.org/lkml/20210513074027.543926-3-srikar@linux.vnet.ibm.com/

There have been concerns in the past about having to update the shared
mask/counter at regular intervals. Srikar, Aubrey any thoughts on this
?



> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>  include/linux/sched/topology.h | 10 ++++++++++
>  kernel/sched/core.c            |  1 +
>  kernel/sched/fair.c            | 43 ++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h           |  6 ++++++
>  kernel/sched/topology.c        |  4 +++-
>  5 files changed, 63 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..03c9c81dc886 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -81,6 +81,16 @@ struct sched_domain_shared {
>  	atomic_t	ref;
>  	atomic_t	nr_busy_cpus;
>  	int		has_idle_cores;
> +
> +	/*
> +	 * The above varibles are used in idle path and
> +	 * select_task_rq, and the following two are
> +	 * mainly updated in tick. They are all hot but
> +	 * for different usage, so start a new cacheline
> +	 * to avoid false sharing.
> +	 */
> +	atomic_t	nr_overloaded	____cacheline_aligned;
> +	unsigned long	overloaded[];	/* Must be last */
>  };
>  
>  struct sched_domain {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1d863d7f6ad7..a6da2998ec49 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9423,6 +9423,7 @@ void __init sched_init(void)
>  		rq->wake_stamp = jiffies;
>  		rq->wake_avg_idle = rq->avg_idle;
>  		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
> +		rq->overloaded = 0;
>  
>  		INIT_LIST_HEAD(&rq->cfs_tasks);
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5c4bfffe8c2c..0a0438c3319b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6968,6 +6968,46 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  
>  	return newidle_balance(rq, rf) != 0;
>  }
> +
> +static inline int cfs_rq_overloaded(struct rq *rq)
> +{
> +	return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1;
> +}
> +
> +/* Must be called with rq locked */
> +static void update_overload_status(struct rq *rq)
> +{
> +	struct sched_domain_shared *sds;
> +	int overloaded = cfs_rq_overloaded(rq);
> +	int cpu = cpu_of(rq);
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	if (rq->overloaded == overloaded)
> +		return;
> +
> +	rcu_read_lock();
> +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> +	if (unlikely(!sds))
> +		goto unlock;
> +
> +	if (overloaded) {
> +		cpumask_set_cpu(cpu, sdo_mask(sds));
> +		atomic_inc(&sds->nr_overloaded);
> +	} else {
> +		cpumask_clear_cpu(cpu, sdo_mask(sds));
> +		atomic_dec(&sds->nr_overloaded);
> +	}
> +
> +	rq->overloaded = overloaded;
> +unlock:
> +	rcu_read_unlock();
> +}
> +
> +#else
> +
> +static inline void update_overload_status(struct rq *rq) { }
> +
>  #endif /* CONFIG_SMP */
>  
>  static unsigned long wakeup_gran(struct sched_entity *se)
> @@ -7315,6 +7355,8 @@ done: __maybe_unused;
>  	if (new_tasks > 0)
>  		goto again;
>  
> +	update_overload_status(rq);
> +

So here, we are calling update_overload_status() after
newidle_balance(). If we had pulled a single task as a part of
newidle_balance(), in your current code, we do not update the overload
status. While this should get remedied in the next tick, should we
move update_overload_status(rq) prior to the new_tasks > 0 check ?


--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/5] sched/fair: record overloaded cpus
  2022-02-24  7:10   ` Gautham R. Shenoy
@ 2022-02-24 14:36     ` Abel Wu
  2022-02-27  8:08     ` Aubrey Li
  1 sibling, 0 replies; 22+ messages in thread
From: Abel Wu @ 2022-02-24 14:36 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel, srikar, aubrey.li,
	Abel Wu

Hi Gautham, thanks for your comment!

On 2/24/22 3:10 PM, Gautham R. Shenoy wrote:
> Hello Abel,
> 
> (+ Aubrey Li, Srikar)
> 
> On Thu, Feb 17, 2022 at 11:43:57PM +0800, Abel Wu wrote:
>> An CFS runqueue is considered overloaded when there are
>> more than one pullable non-idle tasks on it (since sched-
>> idle cpus are treated as idle cpus). And idle tasks are
>> counted towards rq->cfs.idle_h_nr_running, that is either
>> assigned SCHED_IDLE policy or placed under idle cgroups.
>>
>> The overloaded cfs rqs can cause performance issues to
>> both task types:
>>
>>    - for latency critical tasks like SCHED_NORMAL,
>>      time of waiting in the rq will increase and
>>      result in higher pct99 latency, and
>>
>>    - batch tasks may not be able to make full use
>>      of cpu capacity if sched-idle rq exists, thus
>>      presents poorer throughput.
>>
>> The mask of overloaded cpus is updated in periodic tick
>> and the idle path at the LLC domain basis. This cpumask
>> will also be used in SIS as a filter, improving idle cpu
>> searching.
> 
> This is an interesting approach to minimise the tail latencies by
> keeping track of the overloaded cpus in the LLC so that
> idle/sched-idle CPUs can pull from them. This approach contrasts with the
> following approaches that were previously tried :
> 
> 1. Maintain the idle cpumask at the LLC level by Aubrey Li
>     https://lore.kernel.org/all/1615872606-56087-1-git-send-email-aubrey.li@intel.com/

It's a similar approach from different sight in SIS. Both have pros and
cons, and I couldn't tell which one is more appropriate..
While since SIS can fail in finding one idle cpu due to scaling issues,
the sched-idle load balancing might be a valuable supplement to that to
consume the idle/sched-idle cpus.

>     
> 2. Maintain the identity of the idle core itself at the LLC level, by Srikar :
>     https://lore.kernel.org/lkml/20210513074027.543926-3-srikar@linux.vnet.ibm.com/

The efforts done by Srikar seems focused on idle core searching, which
has a different goal from my approach I think.
The case of short running tasks pointed out by Vincent should not be a
problem in updating the overloaded cpu mask/counter, since they are not
updated either when cpu becomes busy, or when cpu frequently goes idle
during a tick period.

> 
> There have been concerns in the past about having to update the shared
> mask/counter at regular intervals. Srikar, Aubrey any thoughts on this
> ?
> 

I'm afraid I didn't fully catch up with these loops, it is appreciated
if someone can shed some light, thanks!

> 
> 
>>
>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>> ---
>>   include/linux/sched/topology.h | 10 ++++++++++
>>   kernel/sched/core.c            |  1 +
>>   kernel/sched/fair.c            | 43 ++++++++++++++++++++++++++++++++++++++++++
>>   kernel/sched/sched.h           |  6 ++++++
>>   kernel/sched/topology.c        |  4 +++-
>>   5 files changed, 63 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 56cffe42abbc..03c9c81dc886 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -81,6 +81,16 @@ struct sched_domain_shared {
>>   	atomic_t	ref;
>>   	atomic_t	nr_busy_cpus;
>>   	int		has_idle_cores;
>> +
>> +	/*
>> +	 * The above varibles are used in idle path and
>> +	 * select_task_rq, and the following two are
>> +	 * mainly updated in tick. They are all hot but
>> +	 * for different usage, so start a new cacheline
>> +	 * to avoid false sharing.
>> +	 */
>> +	atomic_t	nr_overloaded	____cacheline_aligned;
>> +	unsigned long	overloaded[];	/* Must be last */
>>   };
>>   
>>   struct sched_domain {
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 1d863d7f6ad7..a6da2998ec49 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -9423,6 +9423,7 @@ void __init sched_init(void)
>>   		rq->wake_stamp = jiffies;
>>   		rq->wake_avg_idle = rq->avg_idle;
>>   		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>> +		rq->overloaded = 0;
>>   
>>   		INIT_LIST_HEAD(&rq->cfs_tasks);
>>   
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 5c4bfffe8c2c..0a0438c3319b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6968,6 +6968,46 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>>   
>>   	return newidle_balance(rq, rf) != 0;
>>   }
>> +
>> +static inline int cfs_rq_overloaded(struct rq *rq)
>> +{
>> +	return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1;
>> +}
>> +
>> +/* Must be called with rq locked */
>> +static void update_overload_status(struct rq *rq)
>> +{
>> +	struct sched_domain_shared *sds;
>> +	int overloaded = cfs_rq_overloaded(rq);
>> +	int cpu = cpu_of(rq);
>> +
>> +	lockdep_assert_rq_held(rq);
>> +
>> +	if (rq->overloaded == overloaded)
>> +		return;
>> +
>> +	rcu_read_lock();
>> +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>> +	if (unlikely(!sds))
>> +		goto unlock;
>> +
>> +	if (overloaded) {
>> +		cpumask_set_cpu(cpu, sdo_mask(sds));
>> +		atomic_inc(&sds->nr_overloaded);
>> +	} else {
>> +		cpumask_clear_cpu(cpu, sdo_mask(sds));
>> +		atomic_dec(&sds->nr_overloaded);
>> +	}
>> +
>> +	rq->overloaded = overloaded;
>> +unlock:
>> +	rcu_read_unlock();
>> +}
>> +
>> +#else
>> +
>> +static inline void update_overload_status(struct rq *rq) { }
>> +
>>   #endif /* CONFIG_SMP */
>>   
>>   static unsigned long wakeup_gran(struct sched_entity *se)
>> @@ -7315,6 +7355,8 @@ done: __maybe_unused;
>>   	if (new_tasks > 0)
>>   		goto again;
>>   
>> +	update_overload_status(rq);
>> +
> 
> So here, we are calling update_overload_status() after
> newidle_balance(). If we had pulled a single task as a part of
> newidle_balance(), in your current code, we do not update the overload
> status. While this should get remedied in the next tick, should we
> move update_overload_status(rq) prior to the new_tasks > 0 check ?

A single task won't change the overloaded status :)
And I think it would be better not do an update even if pulled several
tasks because that would break the rate limit which is undesired.

Best Regards,
	Abel

> 
> 
> --
> Thanks and Regards
> gautham.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] introduce sched-idle balancing
  2022-02-17 15:43 [RFC PATCH 0/5] introduce sched-idle balancing Abel Wu
                   ` (5 preceding siblings ...)
  2022-02-24  3:19 ` [RFC PATCH 0/5] introduce sched-idle balancing Abel Wu
@ 2022-02-24 15:20 ` Peter Zijlstra
  2022-02-24 15:29   ` Vincent Guittot
  2022-02-25  6:46   ` Abel Wu
  2022-02-24 16:47 ` Mel Gorman
  7 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2022-02-24 15:20 UTC (permalink / raw)
  To: Abel Wu
  Cc: Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Ingo Molnar, Juri Lelli, Mel Gorman, Steven Rostedt,
	Vincent Guittot, linux-kernel

On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
> Current load balancing is mainly based on cpu capacity
> and task util, which makes sense in the POV of overall
> throughput. While there still might be some improvement
> can be done by reducing number of overloaded cfs rqs if
> sched-idle or idle rq exists.

I'm much confused, there is an explicit new-idle balancer and a periodic
idle balancer already there.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] introduce sched-idle balancing
  2022-02-24 15:20 ` Peter Zijlstra
@ 2022-02-24 15:29   ` Vincent Guittot
  2022-02-25  6:51     ` Abel Wu
  2022-02-25  6:46   ` Abel Wu
  1 sibling, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2022-02-24 15:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Abel Wu, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Mel Gorman,
	Steven Rostedt, linux-kernel

On Thu, 24 Feb 2022 at 16:20, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
> > Current load balancing is mainly based on cpu capacity
> > and task util, which makes sense in the POV of overall
> > throughput. While there still might be some improvement
> > can be done by reducing number of overloaded cfs rqs if
> > sched-idle or idle rq exists.
>
> I'm much confused, there is an explicit new-idle balancer and a periodic
> idle balancer already there.

I agree, You failed to explain why newly_idle and periodic idle load
balance are not enough and we need this new one

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] introduce sched-idle balancing
  2022-02-17 15:43 [RFC PATCH 0/5] introduce sched-idle balancing Abel Wu
                   ` (6 preceding siblings ...)
  2022-02-24 15:20 ` Peter Zijlstra
@ 2022-02-24 16:47 ` Mel Gorman
  2022-02-25  8:15   ` Abel Wu
  7 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2022-02-24 16:47 UTC (permalink / raw)
  To: Abel Wu
  Cc: Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Ingo Molnar, Juri Lelli, Peter Zijlstra, Steven Rostedt,
	Vincent Guittot, linux-kernel

On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
> Current load balancing is mainly based on cpu capacity
> and task util, which makes sense in the POV of overall
> throughput. While there still might be some improvement
> can be done by reducing number of overloaded cfs rqs if
> sched-idle or idle rq exists.
> 
> An CFS runqueue is considered overloaded when there are
> more than one pullable non-idle tasks on it (since sched-
> idle cpus are treated as idle cpus). And idle tasks are
> counted towards rq->cfs.idle_h_nr_running, that is either
> assigned SCHED_IDLE policy or placed under idle cgroups.
> 

It's not clear how your tests evaluated the balancing of SCHED_IDLE tasks
versus the existing idle balancing and isolated that impact. I suspect
the tests may primarily measured the effect of the SIS filter.

> So in short, the goal of the sched-idle balancing is to
> let the *non-idle tasks* make full use of cpu resources.
> To achieve that, we mainly do two things:
> 
>   - pull non-idle tasks for sched-idle or idle rqs
>     from the overloaded ones, and
> 
>   - prevent pulling the last non-idle task in an rq
> 
> The mask of overloaded cpus is updated in periodic tick
> and the idle path at the LLC domain basis. This cpumask
> will also be used in SIS as a filter, improving idle cpu
> searching.
> 

As the overloaded mask may be updated on each idle, it could be a
significant source of cache misses between CPUs sharing the domain for
workloads that rapidly idle so there should be data on whether cache misses
are increased heavily. It also potentially delays the CPU reaching idle
but it may not be by much.

The filter may be out of date. It takes up to one tick to detect
overloaded and the filter to have a positive impact. As a CPU is not
guaranteed to enter idle if there is at least one CPU-bound task, it may
also be up to 1 tick before the mask is cleared. I'm not sure this is a
serious problem though as SIS would not pick the CPU with the CPU-bound
task anyway.

At minimum, the filter should be split out and considered first as it
is the most likely reason why a performance difference was measured. It
has some oddities like why nr_overloaded is really a boolean and as
it's under rq lock, it's not clear why it's atomic. The changelog
would ideally contain some comment on the impact to cache misses
if any and some sort of proof that SIS search depth is reduced which
https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/
may be some help.

At that point, compare the idle task balancing on top to isolate how
much it improves things if any and identify why existing balancing is
insufficient. Split out the can_migrate_task change beforehand in case it
is the main source of difference as opposed to the new balancing mechanism.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] introduce sched-idle balancing
  2022-02-24 15:20 ` Peter Zijlstra
  2022-02-24 15:29   ` Vincent Guittot
@ 2022-02-25  6:46   ` Abel Wu
  2022-02-25  8:29     ` Vincent Guittot
  1 sibling, 1 reply; 22+ messages in thread
From: Abel Wu @ 2022-02-25  6:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Ingo Molnar, Juri Lelli, Mel Gorman, Steven Rostedt,
	Vincent Guittot, linux-kernel

Hi Peter,

On 2/24/22 11:20 PM, Peter Zijlstra Wrote:
> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
>> Current load balancing is mainly based on cpu capacity
>> and task util, which makes sense in the POV of overall
>> throughput. While there still might be some improvement
>> can be done by reducing number of overloaded cfs rqs if
>> sched-idle or idle rq exists.
> 
> I'm much confused, there is an explicit new-idle balancer and a periodic
> idle balancer already there.

The two balancers are triggered on the rqs that have no tasks on them,
and load_balance() seems don't show a preference for non-idle tasks so
there might be possibility that only idle tasks are pulled during load
balance while overloaded rqs (rq->cfs.h_nr_running > 1) exist. As a
result the normal tasks, mostly latency-critical ones in our case, on
that overloaded rq still suffer waiting for each other. I observed this
through perf sched.

IOW the main difference from the POV of load_balance() between the
latency-critical tasks and the idle ones is load.

The sched-idle balancer is triggered on the sched-idle rqs periodically
and the newly-idle ones. It does a 'fast' pull of non-idle tasks from
the overloaded rqs to the sched-idle/idle ones to let the non-idle tasks
make full use of cpu resources.

The sched-idle balancer only focuses on non-idle tasks' performance, so
it can introduce overall load imbalance, and that's why I put it before
load_balance().

Best Regards,
	Abel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] introduce sched-idle balancing
  2022-02-24 15:29   ` Vincent Guittot
@ 2022-02-25  6:51     ` Abel Wu
  0 siblings, 0 replies; 22+ messages in thread
From: Abel Wu @ 2022-02-25  6:51 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra
  Cc: Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Ingo Molnar, Juri Lelli, Mel Gorman, Steven Rostedt,
	linux-kernel, Abel Wu

On 2/24/22 11:29 PM, Vincent Guittot Wrote:
> On Thu, 24 Feb 2022 at 16:20, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
>>> Current load balancing is mainly based on cpu capacity
>>> and task util, which makes sense in the POV of overall
>>> throughput. While there still might be some improvement
>>> can be done by reducing number of overloaded cfs rqs if
>>> sched-idle or idle rq exists.
>>
>> I'm much confused, there is an explicit new-idle balancer and a periodic
>> idle balancer already there.
> 
> I agree, You failed to explain why newly_idle and periodic idle load
> balance are not enough and we need this new one

Hi Vincent, sorry for not giving a clearer explanation. Please check
my previous email replying to Peter, thanks.

Best Regards,
	Abel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] introduce sched-idle balancing
  2022-02-24 16:47 ` Mel Gorman
@ 2022-02-25  8:15   ` Abel Wu
  2022-02-25 10:16     ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Abel Wu @ 2022-02-25  8:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Ingo Molnar, Juri Lelli, Peter Zijlstra, Steven Rostedt,
	Vincent Guittot, linux-kernel, Abel Wu

Hi Mel,

On 2/25/22 12:47 AM, Mel Gorman Wrote:
> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
>> Current load balancing is mainly based on cpu capacity
>> and task util, which makes sense in the POV of overall
>> throughput. While there still might be some improvement
>> can be done by reducing number of overloaded cfs rqs if
>> sched-idle or idle rq exists.
>>
>> An CFS runqueue is considered overloaded when there are
>> more than one pullable non-idle tasks on it (since sched-
>> idle cpus are treated as idle cpus). And idle tasks are
>> counted towards rq->cfs.idle_h_nr_running, that is either
>> assigned SCHED_IDLE policy or placed under idle cgroups.
>>
> 
> It's not clear how your tests evaluated the balancing of SCHED_IDLE tasks
> versus the existing idle balancing and isolated that impact. I suspect
> the tests may primarily measured the effect of the SIS filter.

The sched-idle balancing doesn't really care about the idle tasks.
It tries to improve the non-idle tasks' performance by spreading
them out to make full use of cpu capacity.

I will do some individual tests to SIS and sched-idle balancer
each, and keep you informed.

> 
>> So in short, the goal of the sched-idle balancing is to
>> let the *non-idle tasks* make full use of cpu resources.
>> To achieve that, we mainly do two things:
>>
>>    - pull non-idle tasks for sched-idle or idle rqs
>>      from the overloaded ones, and
>>
>>    - prevent pulling the last non-idle task in an rq
>>
>> The mask of overloaded cpus is updated in periodic tick
>> and the idle path at the LLC domain basis. This cpumask
>> will also be used in SIS as a filter, improving idle cpu
>> searching.
>>
> 
> As the overloaded mask may be updated on each idle, it could be a
> significant source of cache misses between CPUs sharing the domain for
> workloads that rapidly idle so there should be data on whether cache misses
> are increased heavily. It also potentially delays the CPU reaching idle
> but it may not be by much.

Yes, that's why I cached overloaded status in rq->overloaded. So in
this case of short running tasks, when cpus rapidly/frequently go
idle, the cpumask/counter are not actually updated if the cached
status is already 0 (not overloaded).

> 
> The filter may be out of date. It takes up to one tick to detect
> overloaded and the filter to have a positive impact. As a CPU is not
> guaranteed to enter idle if there is at least one CPU-bound task, it may
> also be up to 1 tick before the mask is cleared. I'm not sure this is a
> serious problem though as SIS would not pick the CPU with the CPU-bound
> task anyway.

Yes, it can be out of date, but increasing the accuracy means more
frequent update which would introduce cache issues you mentioned
above. Rate limit the updating to tick at the LLC basis might be an
acceptable tradeoff I presume.

> 
> At minimum, the filter should be split out and considered first as it
> is the most likely reason why a performance difference was measured. It
> has some oddities like why nr_overloaded is really a boolean and as
> it's under rq lock, it's not clear why it's atomic. The changelog
> would ideally contain some comment on the impact to cache misses
> if any and some sort of proof that SIS search depth is reduced which
> https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/
> may be some help.
> 
> At that point, compare the idle task balancing on top to isolate how
> much it improves things if any and identify why existing balancing is
> insufficient. Split out the can_migrate_task change beforehand in case it
> is the main source of difference as opposed to the new balancing mechanism.
> 

The nr_overloaded sits in shared domain structure thus shared in
LLC domain and needs to be atomic_t, while rq->overloaded is a
boolean updated under rq lock. Maybe the naming can cause some
confusion, please lighten me up if you have better idea :)

And yes, I agree it would be nice if test result on SIS search
depth can be shown, and I actually did the test, but the result
didn't show a reduction in depth due to sched-idle balancing
will also consume sched-idle/idle cpus. I will apply your patch
and make some further tests on that, thanks.

Best Regards,
	Abel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] introduce sched-idle balancing
  2022-02-25  6:46   ` Abel Wu
@ 2022-02-25  8:29     ` Vincent Guittot
  2022-02-25 10:46       ` Abel Wu
  0 siblings, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2022-02-25  8:29 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Mel Gorman,
	Steven Rostedt, linux-kernel

On Fri, 25 Feb 2022 at 07:46, Abel Wu <wuyun.abel@bytedance.com> wrote:
>
> Hi Peter,
>
> On 2/24/22 11:20 PM, Peter Zijlstra Wrote:
> > On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
> >> Current load balancing is mainly based on cpu capacity
> >> and task util, which makes sense in the POV of overall
> >> throughput. While there still might be some improvement
> >> can be done by reducing number of overloaded cfs rqs if
> >> sched-idle or idle rq exists.
> >
> > I'm much confused, there is an explicit new-idle balancer and a periodic
> > idle balancer already there.
>
> The two balancers are triggered on the rqs that have no tasks on them,
> and load_balance() seems don't show a preference for non-idle tasks so

The load balance will happen at the idle pace if a sched_idle task is
running on the cpu so you will have an ILB on each cpu that run a
sched-idle task

> there might be possibility that only idle tasks are pulled during load
> balance while overloaded rqs (rq->cfs.h_nr_running > 1) exist. As a

There is a LB_MIN feature (disable by default) that filters task with
very low load ( < 16) which includes sched-idle task which has a max
load of 3

> result the normal tasks, mostly latency-critical ones in our case, on
> that overloaded rq still suffer waiting for each other. I observed this
> through perf sched.
>
> IOW the main difference from the POV of load_balance() between the
> latency-critical tasks and the idle ones is load.
>
> The sched-idle balancer is triggered on the sched-idle rqs periodically
> and the newly-idle ones. It does a 'fast' pull of non-idle tasks from
> the overloaded rqs to the sched-idle/idle ones to let the non-idle tasks
> make full use of cpu resources.
>
> The sched-idle balancer only focuses on non-idle tasks' performance, so
> it can introduce overall load imbalance, and that's why I put it before
> load_balance().

According to the very low weight of a sched-idle task, I don't expect
much imbalance because of sched-idle tasks. But this also depends of
the number of sched-idle task.


>
> Best Regards,
>         Abel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] introduce sched-idle balancing
  2022-02-25  8:15   ` Abel Wu
@ 2022-02-25 10:16     ` Mel Gorman
  2022-02-25 13:20       ` Abel Wu
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2022-02-25 10:16 UTC (permalink / raw)
  To: Abel Wu
  Cc: Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Ingo Molnar, Juri Lelli, Peter Zijlstra, Steven Rostedt,
	Vincent Guittot, linux-kernel

On Fri, Feb 25, 2022 at 04:15:06PM +0800, Abel Wu wrote:
> > As the overloaded mask may be updated on each idle, it could be a
> > significant source of cache misses between CPUs sharing the domain for
> > workloads that rapidly idle so there should be data on whether cache misses
> > are increased heavily. It also potentially delays the CPU reaching idle
> > but it may not be by much.
> 
> Yes, that's why I cached overloaded status in rq->overloaded. So in
> this case of short running tasks, when cpus rapidly/frequently go
> idle, the cpumask/counter are not actually updated if the cached
> status is already 0 (not overloaded).
> 

Which is a good idea in some respects. It tries to limit the number of
updates and treats it as a boolean but it's probably prone to races.

> > The filter may be out of date. It takes up to one tick to detect
> > overloaded and the filter to have a positive impact. As a CPU is not
> > guaranteed to enter idle if there is at least one CPU-bound task, it may
> > also be up to 1 tick before the mask is cleared. I'm not sure this is a
> > serious problem though as SIS would not pick the CPU with the CPU-bound
> > task anyway.
> 
> Yes, it can be out of date, but increasing the accuracy means more
> frequent update which would introduce cache issues you mentioned
> above. Rate limit the updating to tick at the LLC basis might be an
> acceptable tradeoff I presume.
> 
> > 
> > At minimum, the filter should be split out and considered first as it
> > is the most likely reason why a performance difference was measured. It
> > has some oddities like why nr_overloaded is really a boolean and as
> > it's under rq lock, it's not clear why it's atomic. The changelog
> > would ideally contain some comment on the impact to cache misses
> > if any and some sort of proof that SIS search depth is reduced which
> > https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/
> > may be some help.
> > 
> > At that point, compare the idle task balancing on top to isolate how
> > much it improves things if any and identify why existing balancing is
> > insufficient. Split out the can_migrate_task change beforehand in case it
> > is the main source of difference as opposed to the new balancing mechanism.
> > 
> 
> The nr_overloaded sits in shared domain structure thus shared in
> LLC domain and needs to be atomic_t, while rq->overloaded is a
> boolean updated under rq lock. Maybe the naming can cause some
> confusion, please lighten me up if you have better idea :)
> 

The naming doesn't help because it's not really "the number of overloaded
rq's". atomic_t would be slightly safer against parallel updates but
it's race prone. I didn't think about it deeply but I suspect that two
separate rq's could disagree on what the boolean value should be if one rq
is overloaded, the other is not and they are updating via the idle path at
the same time. This probably can happen because the locking is rq based
and the cpumask is shared. On the flip side, making it an accurate count
would result in more updates and incur cache misses as well as probably
needing a cpumask check instead of a nr_overloaded comparison to determine
if the rq is already accounted for so it costs more. You are very likely
trading accuracy versus cost of update.

Whichever choice you make, add comments on the pros/cons and describe
the limitation of either approach. e.g. if overloaded is effectively a
boolean, describe in a comment the limitations.

> And yes, I agree it would be nice if test result on SIS search
> depth can be shown, and I actually did the test, but the result
> didn't show a reduction in depth due to sched-idle balancing
> will also consume sched-idle/idle cpus. I will apply your patch
> and make some further tests on that, thanks.
> 

Just remember to use the patch to measure changes in SIS depth but
performance figures should not include the patch as the schedstat
overhead distorts results.

Also place the filter first and do any measurements of any change to
balancing versus the filter. I'm suggesting placing the filter first
because it's less controversial than a new balancer. Just be aware that
the filter alone is not a guarantee of merging as there have been a few
approaches to filtering and so far all of them had downsides on either cost
or accuracy. IIRC the only active approach to reducing search cost in SIS
is https://lore.kernel.org/all/20220207034013.599214-1-yu.c.chen@intel.com/
and it's likely to get a new version due to
https://lore.kernel.org/all/20220207135253.GF23216@worktop.programming.kicks-ass.net/.
It also updates sched_domain_shared but with a single boolean instead of
an atomic+cpumask.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] introduce sched-idle balancing
  2022-02-25  8:29     ` Vincent Guittot
@ 2022-02-25 10:46       ` Abel Wu
  2022-02-25 13:15         ` Vincent Guittot
  0 siblings, 1 reply; 22+ messages in thread
From: Abel Wu @ 2022-02-25 10:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Mel Gorman,
	Steven Rostedt, linux-kernel, Abel Wu

On 2/25/22 4:29 PM, Vincent Guittot Wrote:
> On Fri, 25 Feb 2022 at 07:46, Abel Wu <wuyun.abel@bytedance.com> wrote:
>>
>> Hi Peter,
>>
>> On 2/24/22 11:20 PM, Peter Zijlstra Wrote:
>>> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
>>>> Current load balancing is mainly based on cpu capacity
>>>> and task util, which makes sense in the POV of overall
>>>> throughput. While there still might be some improvement
>>>> can be done by reducing number of overloaded cfs rqs if
>>>> sched-idle or idle rq exists.
>>>
>>> I'm much confused, there is an explicit new-idle balancer and a periodic
>>> idle balancer already there.
>>
>> The two balancers are triggered on the rqs that have no tasks on them,
>> and load_balance() seems don't show a preference for non-idle tasks so
> 
> The load balance will happen at the idle pace if a sched_idle task is
> running on the cpu so you will have an ILB on each cpu that run a
> sched-idle task

I'm afraid I don't quite follow you, since sched-idle balancer doesn't
touch the ILB part, can you elaborate on this? Thanks.

> 
>> there might be possibility that only idle tasks are pulled during load
>> balance while overloaded rqs (rq->cfs.h_nr_running > 1) exist. As a
> 
> There is a LB_MIN feature (disable by default) that filters task with
> very low load ( < 16) which includes sched-idle task which has a max
> load of 3

This feature might not that friendly to the situation that only
sched-idle tasks are running in the system. And this situation
can last more than half a day in our co-location systems in which
the training/batch tasks are placed under idle groups or directly
assigned to SCHED_IDLE.

> 
>> result the normal tasks, mostly latency-critical ones in our case, on
>> that overloaded rq still suffer waiting for each other. I observed this
>> through perf sched.
>>
>> IOW the main difference from the POV of load_balance() between the
>> latency-critical tasks and the idle ones is load.
>>
>> The sched-idle balancer is triggered on the sched-idle rqs periodically
>> and the newly-idle ones. It does a 'fast' pull of non-idle tasks from
>> the overloaded rqs to the sched-idle/idle ones to let the non-idle tasks
>> make full use of cpu resources.
>>
>> The sched-idle balancer only focuses on non-idle tasks' performance, so
>> it can introduce overall load imbalance, and that's why I put it before
>> load_balance().
> 
> According to the very low weight of a sched-idle task, I don't expect
> much imbalance because of sched-idle tasks. But this also depends of
> the number of sched-idle task.
> 
> 
>>
>> Best Regards,
>>          Abel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] introduce sched-idle balancing
  2022-02-25 10:46       ` Abel Wu
@ 2022-02-25 13:15         ` Vincent Guittot
  0 siblings, 0 replies; 22+ messages in thread
From: Vincent Guittot @ 2022-02-25 13:15 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Mel Gorman,
	Steven Rostedt, linux-kernel

On Fri, 25 Feb 2022 at 11:46, Abel Wu <wuyun.abel@bytedance.com> wrote:
>
> On 2/25/22 4:29 PM, Vincent Guittot Wrote:
> > On Fri, 25 Feb 2022 at 07:46, Abel Wu <wuyun.abel@bytedance.com> wrote:
> >>
> >> Hi Peter,
> >>
> >> On 2/24/22 11:20 PM, Peter Zijlstra Wrote:
> >>> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
> >>>> Current load balancing is mainly based on cpu capacity
> >>>> and task util, which makes sense in the POV of overall
> >>>> throughput. While there still might be some improvement
> >>>> can be done by reducing number of overloaded cfs rqs if
> >>>> sched-idle or idle rq exists.
> >>>
> >>> I'm much confused, there is an explicit new-idle balancer and a periodic
> >>> idle balancer already there.
> >>
> >> The two balancers are triggered on the rqs that have no tasks on them,
> >> and load_balance() seems don't show a preference for non-idle tasks so
> >
> > The load balance will happen at the idle pace if a sched_idle task is
> > running on the cpu so you will have an ILB on each cpu that run a
> > sched-idle task
>
> I'm afraid I don't quite follow you, since sched-idle balancer doesn't
> touch the ILB part, can you elaborate on this? Thanks.

I was referring to your sentence " The two balancers are triggered on
the rqs that have no tasks on them". When there is only sched-idle
tasks on a rq, the load_balance behave like the Idle Load Balance when
there is no task i.e. as often

>
> >
> >> there might be possibility that only idle tasks are pulled during load
> >> balance while overloaded rqs (rq->cfs.h_nr_running > 1) exist. As a
> >
> > There is a LB_MIN feature (disable by default) that filters task with
> > very low load ( < 16) which includes sched-idle task which has a max
> > load of 3

but we could easily change this like if !sched_idle_cpus then LB can
migrate only cfs tasks otherwise can migrate sched_idle task as well.
Instead of creating another side channel

>
> This feature might not that friendly to the situation that only
> sched-idle tasks are running in the system. And this situation
> can last more than half a day in our co-location systems in which
> the training/batch tasks are placed under idle groups or directly
> assigned to SCHED_IDLE.
>
> >
> >> result the normal tasks, mostly latency-critical ones in our case, on
> >> that overloaded rq still suffer waiting for each other. I observed this
> >> through perf sched.
> >>
> >> IOW the main difference from the POV of load_balance() between the
> >> latency-critical tasks and the idle ones is load.
> >>
> >> The sched-idle balancer is triggered on the sched-idle rqs periodically
> >> and the newly-idle ones. It does a 'fast' pull of non-idle tasks from
> >> the overloaded rqs to the sched-idle/idle ones to let the non-idle tasks
> >> make full use of cpu resources.
> >>
> >> The sched-idle balancer only focuses on non-idle tasks' performance, so
> >> it can introduce overall load imbalance, and that's why I put it before
> >> load_balance().
> >
> > According to the very low weight of a sched-idle task, I don't expect
> > much imbalance because of sched-idle tasks. But this also depends of
> > the number of sched-idle task.
> >
> >
> >>
> >> Best Regards,
> >>          Abel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] introduce sched-idle balancing
  2022-02-25 10:16     ` Mel Gorman
@ 2022-02-25 13:20       ` Abel Wu
  2022-03-02  0:41         ` Josh Don
  0 siblings, 1 reply; 22+ messages in thread
From: Abel Wu @ 2022-02-25 13:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Ingo Molnar, Juri Lelli, Peter Zijlstra, Steven Rostedt,
	Vincent Guittot, linux-kernel, Abel Wu

Hi Mel, thanks a lot for your review!

On 2/25/22 6:16 PM, Mel Gorman Wrote:
> On Fri, Feb 25, 2022 at 04:15:06PM +0800, Abel Wu wrote:
>>> As the overloaded mask may be updated on each idle, it could be a
>>> significant source of cache misses between CPUs sharing the domain for
>>> workloads that rapidly idle so there should be data on whether cache misses
>>> are increased heavily. It also potentially delays the CPU reaching idle
>>> but it may not be by much.
>>
>> Yes, that's why I cached overloaded status in rq->overloaded. So in
>> this case of short running tasks, when cpus rapidly/frequently go
>> idle, the cpumask/counter are not actually updated if the cached
>> status is already 0 (not overloaded).
>>
> 
> Which is a good idea in some respects. It tries to limit the number of
> updates and treats it as a boolean but it's probably prone to races.
> 
>>> The filter may be out of date. It takes up to one tick to detect
>>> overloaded and the filter to have a positive impact. As a CPU is not
>>> guaranteed to enter idle if there is at least one CPU-bound task, it may
>>> also be up to 1 tick before the mask is cleared. I'm not sure this is a
>>> serious problem though as SIS would not pick the CPU with the CPU-bound
>>> task anyway.
>>
>> Yes, it can be out of date, but increasing the accuracy means more
>> frequent update which would introduce cache issues you mentioned
>> above. Rate limit the updating to tick at the LLC basis might be an
>> acceptable tradeoff I presume.
>>
>>>
>>> At minimum, the filter should be split out and considered first as it
>>> is the most likely reason why a performance difference was measured. It
>>> has some oddities like why nr_overloaded is really a boolean and as
>>> it's under rq lock, it's not clear why it's atomic. The changelog
>>> would ideally contain some comment on the impact to cache misses
>>> if any and some sort of proof that SIS search depth is reduced which
>>> https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/
>>> may be some help.
>>>
>>> At that point, compare the idle task balancing on top to isolate how
>>> much it improves things if any and identify why existing balancing is
>>> insufficient. Split out the can_migrate_task change beforehand in case it
>>> is the main source of difference as opposed to the new balancing mechanism.
>>>
>>
>> The nr_overloaded sits in shared domain structure thus shared in
>> LLC domain and needs to be atomic_t, while rq->overloaded is a
>> boolean updated under rq lock. Maybe the naming can cause some
>> confusion, please lighten me up if you have better idea :)
>>
> 
> The naming doesn't help because it's not really "the number of overloaded
> rq's". atomic_t would be slightly safer against parallel updates but
> it's race prone. I didn't think about it deeply but I suspect that two
> separate rq's could disagree on what the boolean value should be if one rq
> is overloaded, the other is not and they are updating via the idle path at
> the same time. This probably can happen because the locking is rq based
> and the cpumask is shared. On the flip side, making it an accurate count
> would result in more updates and incur cache misses as well as probably
> needing a cpumask check instead of a nr_overloaded comparison to determine
> if the rq is already accounted for so it costs more. You are very likely
> trading accuracy versus cost of update.

The boolean value (rq->overloaded) is accessed under rq lock, and almost
accessed by its own rq except the very rare case in sched_idle_balance()
where a double check failed on cfs_rq_overloaded(). So this value should
be accurate and has good data locality.

But as you said, the nr_overloaded and cpu mask are race prone in the
following pattern in my patches:

	if (nr_overloaded > 0)
		/* nr_overloaded can be zero now */
		read(overloaded_mask);

since the mask is accessed without rq locked, the cost might not be too
much. This is quite similar with the idle_cpu() usage in SIS I guess.

> 
> Whichever choice you make, add comments on the pros/cons and describe
> the limitation of either approach. e.g. if overloaded is effectively a
> boolean, describe in a comment the limitations.

OK, will do.

> 
>> And yes, I agree it would be nice if test result on SIS search
>> depth can be shown, and I actually did the test, but the result
>> didn't show a reduction in depth due to sched-idle balancing
>> will also consume sched-idle/idle cpus. I will apply your patch
>> and make some further tests on that, thanks.
>>
> 
> Just remember to use the patch to measure changes in SIS depth but
> performance figures should not include the patch as the schedstat
> overhead distorts results.

Yes, agreed.

> 
> Also place the filter first and do any measurements of any change to
> balancing versus the filter. I'm suggesting placing the filter first
> because it's less controversial than a new balancer. Just be aware that
> the filter alone is not a guarantee of merging as there have been a few
> approaches to filtering and so far all of them had downsides on either cost

Yes, understood. I will adjust the patches as you suggested and send v2
together with more tests next week.

> or accuracy. IIRC the only active approach to reducing search cost in SIS
> is https://lore.kernel.org/all/20220207034013.599214-1-yu.c.chen@intel.com/
> and it's likely to get a new version due to
> https://lore.kernel.org/all/20220207135253.GF23216@worktop.programming.kicks-ass.net/.
> It also updates sched_domain_shared but with a single boolean instead of
> an atomic+cpumask.
> 

Chen Yu's patch disables idle cpu searching in SIS when the LLC domain
is overloaded (that is 85% capacity usage) and Peter suggested him use
this metric to replace/improve SIS_PROP feature to make search depth
varying gently.

I don't think either of the two approaches conflict with mine, as they
are to reduce the effort of searching when system is busy and cpus are
not likely to be idle, and mine is to consume sched-idle/idle cpus by
themselves by pulling non-idle tasks from overloaded rqs so there will
be fewer sched-idle/idle cpus.

Thanks and best regards,
	Abel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/5] sched/fair: record overloaded cpus
  2022-02-24  7:10   ` Gautham R. Shenoy
  2022-02-24 14:36     ` Abel Wu
@ 2022-02-27  8:08     ` Aubrey Li
  1 sibling, 0 replies; 22+ messages in thread
From: Aubrey Li @ 2022-02-27  8:08 UTC (permalink / raw)
  To: Gautham R. Shenoy, Abel Wu
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel, srikar, aubrey.li

On 2/24/22 3:10 PM, Gautham R. Shenoy wrote:
> Hello Abel,
> 
> (+ Aubrey Li, Srikar)
> 
> On Thu, Feb 17, 2022 at 11:43:57PM +0800, Abel Wu wrote:
>> An CFS runqueue is considered overloaded when there are
>> more than one pullable non-idle tasks on it (since sched-
>> idle cpus are treated as idle cpus). And idle tasks are
>> counted towards rq->cfs.idle_h_nr_running, that is either
>> assigned SCHED_IDLE policy or placed under idle cgroups.
>>
>> The overloaded cfs rqs can cause performance issues to
>> both task types:
>>
>>   - for latency critical tasks like SCHED_NORMAL,
>>     time of waiting in the rq will increase and
>>     result in higher pct99 latency, and
>>
>>   - batch tasks may not be able to make full use
>>     of cpu capacity if sched-idle rq exists, thus
>>     presents poorer throughput.
>>
>> The mask of overloaded cpus is updated in periodic tick
>> and the idle path at the LLC domain basis. This cpumask
>> will also be used in SIS as a filter, improving idle cpu
>> searching.
> 
> This is an interesting approach to minimise the tail latencies by
> keeping track of the overloaded cpus in the LLC so that
> idle/sched-idle CPUs can pull from them. This approach contrasts with the
> following approaches that were previously tried :
> 
> 1. Maintain the idle cpumask at the LLC level by Aubrey Li
>    https://lore.kernel.org/all/1615872606-56087-1-git-send-email-aubrey.li@intel.com/
>    
> 2. Maintain the identity of the idle core itself at the LLC level, by Srikar :
>    https://lore.kernel.org/lkml/20210513074027.543926-3-srikar@linux.vnet.ibm.com/
> 
> There have been concerns in the past about having to update the shared
> mask/counter at regular intervals. Srikar, Aubrey any thoughts on this
> ?
> 
https://lkml.org/lkml/2022/2/7/1129

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] introduce sched-idle balancing
  2022-02-25 13:20       ` Abel Wu
@ 2022-03-02  0:41         ` Josh Don
  0 siblings, 0 replies; 22+ messages in thread
From: Josh Don @ 2022-03-02  0:41 UTC (permalink / raw)
  To: Abel Wu
  Cc: Mel Gorman, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Steven Rostedt, Vincent Guittot, linux-kernel

On Fri, Feb 25, 2022 at 5:36 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
[snip]
> > Also place the filter first and do any measurements of any change to
> > balancing versus the filter. I'm suggesting placing the filter first
> > because it's less controversial than a new balancer. Just be aware that
> > the filter alone is not a guarantee of merging as there have been a few
> > approaches to filtering and so far all of them had downsides on either cost
>
> Yes, understood. I will adjust the patches as you suggested and send v2
> together with more tests next week.

+1 to trying the filter rather than introducing a new balance path.

We've found the sched_idle_cpu() checks in the wakeup path to be
adequate in allowing non-idle tasks to fully consume cpu resources
(but that of course relies on wakeup balancing, and not periodic
balancing).

Please cc me on the next series.

Thanks,
Josh

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-03-02  0:42 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-17 15:43 [RFC PATCH 0/5] introduce sched-idle balancing Abel Wu
2022-02-17 15:43 ` [RFC PATCH 1/5] sched/fair: record overloaded cpus Abel Wu
2022-02-24  7:10   ` Gautham R. Shenoy
2022-02-24 14:36     ` Abel Wu
2022-02-27  8:08     ` Aubrey Li
2022-02-17 15:43 ` [RFC PATCH 2/5] sched/fair: introduce sched-idle balance Abel Wu
2022-02-17 15:43 ` [RFC PATCH 3/5] sched/fair: add stats for sched-idle balancing Abel Wu
2022-02-17 15:44 ` [RFC PATCH 4/5] sched/fair: filter out overloaded cpus in sis Abel Wu
2022-02-17 15:44 ` [RFC PATCH 5/5] sched/fair: favor cpu capacity for idle tasks Abel Wu
2022-02-24  3:19 ` [RFC PATCH 0/5] introduce sched-idle balancing Abel Wu
2022-02-24 15:20 ` Peter Zijlstra
2022-02-24 15:29   ` Vincent Guittot
2022-02-25  6:51     ` Abel Wu
2022-02-25  6:46   ` Abel Wu
2022-02-25  8:29     ` Vincent Guittot
2022-02-25 10:46       ` Abel Wu
2022-02-25 13:15         ` Vincent Guittot
2022-02-24 16:47 ` Mel Gorman
2022-02-25  8:15   ` Abel Wu
2022-02-25 10:16     ` Mel Gorman
2022-02-25 13:20       ` Abel Wu
2022-03-02  0:41         ` Josh Don

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).