linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/4] sched/fair: Rebalance tasks based on affinity
@ 2016-06-20 12:15 Jiri Olsa
  2016-06-20 12:15 ` [PATCH 1/4] sched/fair: Introduce sched_entity::dont_balance Jiri Olsa
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Jiri Olsa @ 2016-06-20 12:15 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: lkml, James Hartsock, Rik van Riel, Srivatsa Vaddagiri, Kirill Tkhai

hi,
please consider this to be more of a question
for opinions on how to solve this issue ;-)

I'm following up on my previous post:
  http://marc.info/?t=145975837400001&r=1&w=2

The patchset is working for my testcase and our
other tests looks good so far, but I'm not sure
I haven't broken something else or used the right
wheel to implement this.

The issue itself and the fix are described in
changelog of patch 3/4.

thanks for any feedback,
jirka


---
Jiri Olsa (4):
      sched/fair: Introduce sched_entity::dont_balance
      sched/fair: Introduce idle enter/exit balance callbacks
      sched/fair: Add REBALANCE_AFFINITY rebalancing code
      sched/fair: Add schedstat debug values for REBALANCE_AFFINITY

 include/linux/sched.h   |   4 +++
 kernel/sched/debug.c    |   4 +++
 kernel/sched/fair.c     | 170 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/features.h |   1 +
 kernel/sched/idle.c     |   2 ++
 kernel/sched/sched.h    |   8 ++++++
 6 files changed, 181 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] sched/fair: Introduce sched_entity::dont_balance
  2016-06-20 12:15 [RFC 0/4] sched/fair: Rebalance tasks based on affinity Jiri Olsa
@ 2016-06-20 12:15 ` Jiri Olsa
  2016-06-20 14:28   ` Peter Zijlstra
  2016-06-20 12:15 ` [PATCH 2/4] sched/fair: Introduce idle enter/exit balance callbacks Jiri Olsa
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Jiri Olsa @ 2016-06-20 12:15 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: lkml, James Hartsock, Rik van Riel, Srivatsa Vaddagiri, Kirill Tkhai

Adding dont_balance bool into struct sched_entity,
to mark tasks which are rebalanced based on affinity.

It's used only when REBALANCE_AFFINITY feature is
switched on. The code functionality of this feature
is introduced in following patch.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/sched.h   |  2 ++
 kernel/sched/fair.c     | 21 ++++++++++++++++++---
 kernel/sched/features.h |  1 +
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dee41bf59e6b..0e6ac882283b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1331,6 +1331,8 @@ struct sched_entity {
 
 	u64			nr_migrations;
 
+	bool			dont_balance;
+
 #ifdef CONFIG_SCHEDSTATS
 	struct sched_statistics statistics;
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6dd8bab010c..f19c9435c64d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -900,6 +900,11 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	se->exec_start = rq_clock_task(rq_of(cfs_rq));
 }
 
+static bool dont_balance(struct task_struct *p)
+{
+	return sched_feat(REBALANCE_AFFINITY) && p->se.dont_balance;
+}
+
 /**************************************************
  * Scheduling class queueing methods:
  */
@@ -2172,6 +2177,10 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	if (!static_branch_likely(&sched_numa_balancing))
 		return;
 
+	/* Don't move task if the affinity balance is active. */
+	if (dont_balance(p))
+		return;
+
 	/* for example, ksmd faulting in a user's mm */
 	if (!p->mm)
 		return;
@@ -3387,6 +3396,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	update_min_vruntime(cfs_rq);
 	update_cfs_shares(cfs_rq);
+
+	se->dont_balance = false;
 }
 
 /*
@@ -6017,13 +6028,17 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) throttled_lb_pair, or
-	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
-	 * 3) running (obviously), or
-	 * 4) are cache-hot on their current CPU.
+	 * 2) dont_balance is set, or
+	 * 3) cannot be migrated to this CPU due to cpus_allowed, or
+	 * 4) running (obviously), or
+	 * 5) are cache-hot on their current CPU.
 	 */
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;
 
+	if (dont_balance(p))
+		return 0;
+
 	if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
 		int cpu;
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 69631fa46c2f..acdd78a6be4d 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -68,4 +68,5 @@ SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
+SCHED_FEAT(REBALANCE_AFFINITY, false)
 
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] sched/fair: Introduce idle enter/exit balance callbacks
  2016-06-20 12:15 [RFC 0/4] sched/fair: Rebalance tasks based on affinity Jiri Olsa
  2016-06-20 12:15 ` [PATCH 1/4] sched/fair: Introduce sched_entity::dont_balance Jiri Olsa
@ 2016-06-20 12:15 ` Jiri Olsa
  2016-06-20 14:30   ` Peter Zijlstra
  2016-06-20 12:15 ` [PATCH 3/4] sched/fair: Add REBALANCE_AFFINITY rebalancing code Jiri Olsa
  2016-06-20 12:15 ` [PATCH 4/4] sched/fair: Add schedstat debug values for REBALANCE_AFFINITY Jiri Olsa
  3 siblings, 1 reply; 15+ messages in thread
From: Jiri Olsa @ 2016-06-20 12:15 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: lkml, James Hartsock, Rik van Riel, Srivatsa Vaddagiri, Kirill Tkhai

Introducing idle enter/exit balance callbacks to keep
balance.idle_cpus_mask cpumask of current idle cpus
in system.

It's used only when REBALANCE_AFFINITY feature is
switched on. The code functionality of this feature
is introduced in following patch.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/sched/fair.c  | 32 ++++++++++++++++++++++++++++++++
 kernel/sched/idle.c  |  2 ++
 kernel/sched/sched.h |  3 +++
 3 files changed, 37 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f19c9435c64d..78c4127f2f3a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7802,6 +7802,37 @@ static inline int on_null_domain(struct rq *rq)
 	return unlikely(!rcu_dereference_sched(rq->sd));
 }
 
+static struct {
+	cpumask_var_t	idle_cpus_mask;
+	atomic_t	nr_cpus;
+} balance ____cacheline_aligned;
+
+void sched_idle_enter(int cpu)
+{
+	if (!sched_feat(REBALANCE_AFFINITY))
+		return;
+
+	if (!cpu_active(cpu))
+		return;
+
+	if (on_null_domain(cpu_rq(cpu)))
+		return;
+
+	cpumask_set_cpu(cpu, balance.idle_cpus_mask);
+	atomic_inc(&balance.nr_cpus);
+}
+
+void sched_idle_exit(int cpu)
+{
+	if (!sched_feat(REBALANCE_AFFINITY))
+		return;
+
+	if (likely(cpumask_test_cpu(cpu, balance.idle_cpus_mask))) {
+		cpumask_clear_cpu(cpu, balance.idle_cpus_mask);
+		atomic_dec(&balance.nr_cpus);
+	}
+}
+
 #ifdef CONFIG_NO_HZ_COMMON
 /*
  * idle load balancing details
@@ -8731,6 +8762,7 @@ __init void init_sched_fair_class(void)
 	nohz.next_balance = jiffies;
 	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
 #endif
+	zalloc_cpumask_var(&balance.idle_cpus_mask, GFP_NOWAIT);
 #endif /* SMP */
 
 }
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index db4ff7c100b9..0cc0109eacd3 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -216,6 +216,7 @@ static void cpu_idle_loop(void)
 		__current_set_polling();
 		quiet_vmstat();
 		tick_nohz_idle_enter();
+		sched_idle_enter(cpu);
 
 		while (!need_resched()) {
 			check_pgt_cache();
@@ -256,6 +257,7 @@ static void cpu_idle_loop(void)
 		 */
 		preempt_set_need_resched();
 		tick_nohz_idle_exit();
+		sched_idle_exit(cpu);
 		__current_clr_polling();
 
 		/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 72f1f3087b04..d1a6224cd140 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1692,6 +1692,9 @@ extern void init_dl_rq(struct dl_rq *dl_rq);
 extern void cfs_bandwidth_usage_inc(void);
 extern void cfs_bandwidth_usage_dec(void);
 
+void sched_idle_enter(int cpu);
+void sched_idle_exit(int cpu);
+
 #ifdef CONFIG_NO_HZ_COMMON
 enum rq_nohz_flag_bits {
 	NOHZ_TICK_STOPPED,
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4] sched/fair: Add REBALANCE_AFFINITY rebalancing code
  2016-06-20 12:15 [RFC 0/4] sched/fair: Rebalance tasks based on affinity Jiri Olsa
  2016-06-20 12:15 ` [PATCH 1/4] sched/fair: Introduce sched_entity::dont_balance Jiri Olsa
  2016-06-20 12:15 ` [PATCH 2/4] sched/fair: Introduce idle enter/exit balance callbacks Jiri Olsa
@ 2016-06-20 12:15 ` Jiri Olsa
  2016-06-30 10:58   ` Peter Zijlstra
  2016-06-20 12:15 ` [PATCH 4/4] sched/fair: Add schedstat debug values for REBALANCE_AFFINITY Jiri Olsa
  3 siblings, 1 reply; 15+ messages in thread
From: Jiri Olsa @ 2016-06-20 12:15 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: lkml, James Hartsock, Rik van Riel, Srivatsa Vaddagiri, Kirill Tkhai

Adding rebalance_affinity function that place tasks
based on their cpus_allowed with following logic.

Current load balancing places tasks on runqueues based
on their weight to achieve balance within sched domains.

Sched domains are defined at the start and can't be changed
during runtime. If user defines workload affinity settings
unevenly with sched domains, he could get unbalanced state
within his affinity group, like:

Say we have following sched domains:
  domain 0: (pairs)
  domain 1: 0-5,12-17 (group1)  6-11,18-23 (group2)
  domain 2: 0-23 level NUMA

User runs workload with affinity setup that takes
one CPU from group1 (0) and the rest from group 2:
    0,6,7,8,9,10,11,18,19,20,21,22

User will see idle CPUs within his affinity group,
because load balancer will balance tasks based on load
within group1 and group2, thus placing eqaul load
of tasks on CPU 0 and on the rest of CPUs.

The rebalance_affinity function detects above setup
and tries to place task with cpus_allowed on idle
CPUs within their allowed mask if there are any.

Once such task is re-balanced the load balancer is not
allowed to touch it (balance it) unless it's reattached
to runqueue.

This functionality is in place only if REBALANCE_AFFINITY
feature is enabled.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/sched/fair.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 99 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 78c4127f2f3a..736e525e189c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6100,16 +6100,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	return 0;
 }
 
+static void __detach_task(struct task_struct *p,
+			  struct rq *src_rq, int dst_cpu)
+{
+	lockdep_assert_held(&src_rq->lock);
+
+	p->on_rq = TASK_ON_RQ_MIGRATING;
+	deactivate_task(src_rq, p, 0);
+	set_task_cpu(p, dst_cpu);
+}
+
 /*
  * detach_task() -- detach the task for the migration specified in env
  */
 static void detach_task(struct task_struct *p, struct lb_env *env)
 {
-	lockdep_assert_held(&env->src_rq->lock);
-
-	p->on_rq = TASK_ON_RQ_MIGRATING;
-	deactivate_task(env->src_rq, p, 0);
-	set_task_cpu(p, env->dst_cpu);
+	__detach_task(p, env->src_rq, env->dst_cpu);
 }
 
 /*
@@ -7833,6 +7839,91 @@ void sched_idle_exit(int cpu)
 	}
 }
 
+static bool has_affinity_set(struct task_struct *p, cpumask_var_t mask)
+{
+	if (!cpumask_and(mask, tsk_cpus_allowed(p), cpu_active_mask))
+		return false;
+
+	cpumask_xor(mask, mask, cpu_active_mask);
+	return !cpumask_empty(mask);
+}
+
+static void rebalance_affinity(struct rq *rq)
+{
+	struct task_struct *p;
+	unsigned long flags;
+	cpumask_var_t mask;
+	bool mask_alloc = false;
+
+	/*
+	 * No need to bother if:
+	 * - there's only 1 task on the queue
+	 * - there's no idle cpu at the moment.
+	 */
+	if (rq->nr_running <= 1)
+		return;
+
+	if (!atomic_read(&balance.nr_cpus))
+		return;
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
+
+	list_for_each_entry(p, &rq->cfs_tasks, se.group_node) {
+		struct rq *dst_rq;
+		int cpu;
+
+		/*
+		 * Force affinity balance only if:
+		 * - task is not current one
+		 * - task is already balanced (p->se.dont_balance is set)
+		 * - task has cpus_allowed set
+		 * - we have idle cpu ready within task's cpus_allowed
+		 */
+		if (task_running(rq, p))
+			continue;
+
+		if (p->se.dont_balance)
+			continue;
+
+		if (!mask_alloc) {
+			int ret = zalloc_cpumask_var(&mask, GFP_KERNEL);
+
+			if (WARN_ON_ONCE(!ret))
+				return;
+			mask_alloc = true;
+		}
+
+		if (!has_affinity_set(p, mask))
+			continue;
+
+		if (!cpumask_and(mask, tsk_cpus_allowed(p), balance.idle_cpus_mask))
+			continue;
+
+		cpu = cpumask_any_but(mask, task_cpu(p));
+		if (cpu >= nr_cpu_ids)
+			continue;
+
+		__detach_task(p, rq, cpu);
+		raw_spin_unlock(&rq->lock);
+
+		dst_rq = cpu_rq(cpu);
+
+		raw_spin_lock(&dst_rq->lock);
+		attach_task(dst_rq, p);
+		p->se.dont_balance = true;
+		raw_spin_unlock(&dst_rq->lock);
+
+		local_irq_restore(flags);
+		free_cpumask_var(mask);
+		return;
+	}
+
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+	if (mask_alloc)
+		free_cpumask_var(mask);
+}
+
 #ifdef CONFIG_NO_HZ_COMMON
 /*
  * idle load balancing details
@@ -8077,6 +8168,9 @@ out:
 			nohz.next_balance = rq->next_balance;
 #endif
 	}
+
+	if (sched_feat(REBALANCE_AFFINITY))
+		rebalance_affinity(rq);
 }
 
 #ifdef CONFIG_NO_HZ_COMMON
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4] sched/fair: Add schedstat debug values for REBALANCE_AFFINITY
  2016-06-20 12:15 [RFC 0/4] sched/fair: Rebalance tasks based on affinity Jiri Olsa
                   ` (2 preceding siblings ...)
  2016-06-20 12:15 ` [PATCH 3/4] sched/fair: Add REBALANCE_AFFINITY rebalancing code Jiri Olsa
@ 2016-06-20 12:15 ` Jiri Olsa
  3 siblings, 0 replies; 15+ messages in thread
From: Jiri Olsa @ 2016-06-20 12:15 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: lkml, James Hartsock, Rik van Riel, Srivatsa Vaddagiri, Kirill Tkhai

It was helpful to watch several stats when testing
this feature. Adding the most useful.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/sched.h |  2 ++
 kernel/sched/debug.c  |  4 ++++
 kernel/sched/fair.c   | 15 ++++++++++++++-
 kernel/sched/sched.h  |  5 +++++
 4 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0e6ac882283b..4b772820436b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1305,6 +1305,8 @@ struct sched_statistics {
 	u64			nr_failed_migrations_running;
 	u64			nr_failed_migrations_hot;
 	u64			nr_forced_migrations;
+	u64			nr_dont_balance;
+	u64			nr_balanced_affinity;
 
 	u64			nr_wakeups;
 	u64			nr_wakeups_sync;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2a0a9995256d..5947558dab65 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -635,6 +635,9 @@ do {									\
 		P(sched_goidle);
 		P(ttwu_count);
 		P(ttwu_local);
+		P(nr_dont_balance);
+		P(nr_affinity_out);
+		P(nr_affinity_in);
 	}
 
 #undef P
@@ -912,6 +915,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 		P(se.statistics.nr_wakeups_affine_attempts);
 		P(se.statistics.nr_wakeups_passive);
 		P(se.statistics.nr_wakeups_idle);
+		P(se.statistics.nr_dont_balance);
 
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 736e525e189c..a4f1ed403f1e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -902,7 +902,15 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 static bool dont_balance(struct task_struct *p)
 {
-	return sched_feat(REBALANCE_AFFINITY) && p->se.dont_balance;
+	bool dont_balance = false;
+
+	if (sched_feat(REBALANCE_AFFINITY) && p->se.dont_balance) {
+		dont_balance = true;
+		schedstat_inc(task_rq(p), nr_dont_balance);
+		schedstat_inc(p, se.statistics.nr_dont_balance);
+	}
+
+	return dont_balance;
 }
 
 /**************************************************
@@ -7903,6 +7911,7 @@ static void rebalance_affinity(struct rq *rq)
 		if (cpu >= nr_cpu_ids)
 			continue;
 
+		schedstat_inc(rq, nr_affinity_out);
 		__detach_task(p, rq, cpu);
 		raw_spin_unlock(&rq->lock);
 
@@ -7911,6 +7920,10 @@ static void rebalance_affinity(struct rq *rq)
 		raw_spin_lock(&dst_rq->lock);
 		attach_task(dst_rq, p);
 		p->se.dont_balance = true;
+
+		schedstat_inc(p, se.statistics.nr_balanced_affinity);
+		schedstat_inc(dst_rq, nr_affinity_in);
+
 		raw_spin_unlock(&dst_rq->lock);
 
 		local_irq_restore(flags);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d1a6224cd140..f086e233f7e6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -701,6 +701,11 @@ struct rq {
 	/* try_to_wake_up() stats */
 	unsigned int ttwu_count;
 	unsigned int ttwu_local;
+
+	/* rebalance_affinity() stats */
+	unsigned int nr_dont_balance;
+	unsigned int nr_affinity_out;
+	unsigned int nr_affinity_in;
 #endif
 
 #ifdef CONFIG_SMP
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] sched/fair: Introduce sched_entity::dont_balance
  2016-06-20 12:15 ` [PATCH 1/4] sched/fair: Introduce sched_entity::dont_balance Jiri Olsa
@ 2016-06-20 14:28   ` Peter Zijlstra
  2016-06-20 15:59     ` Jiri Olsa
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2016-06-20 14:28 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Ingo Molnar, lkml, James Hartsock, Rik van Riel,
	Srivatsa Vaddagiri, Kirill Tkhai

On Mon, Jun 20, 2016 at 02:15:11PM +0200, Jiri Olsa wrote:
> Adding dont_balance bool into struct sched_entity,
> to mark tasks which are rebalanced based on affinity.
> 
> It's used only when REBALANCE_AFFINITY feature is
> switched on. The code functionality of this feature
> is introduced in following patch.
> 
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  include/linux/sched.h   |  2 ++
>  kernel/sched/fair.c     | 21 ++++++++++++++++++---
>  kernel/sched/features.h |  1 +
>  3 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index dee41bf59e6b..0e6ac882283b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1331,6 +1331,8 @@ struct sched_entity {
>  
>  	u64			nr_migrations;
>  
> +	bool			dont_balance;

Never use bool in structures.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] sched/fair: Introduce idle enter/exit balance callbacks
  2016-06-20 12:15 ` [PATCH 2/4] sched/fair: Introduce idle enter/exit balance callbacks Jiri Olsa
@ 2016-06-20 14:30   ` Peter Zijlstra
  2016-06-20 16:27     ` Jiri Olsa
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2016-06-20 14:30 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Ingo Molnar, lkml, James Hartsock, Rik van Riel,
	Srivatsa Vaddagiri, Kirill Tkhai

On Mon, Jun 20, 2016 at 02:15:12PM +0200, Jiri Olsa wrote:
> Introducing idle enter/exit balance callbacks to keep
> balance.idle_cpus_mask cpumask of current idle cpus
> in system.
> 
> It's used only when REBALANCE_AFFINITY feature is
> switched on. The code functionality of this feature
> is introduced in following patch.
> 
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  kernel/sched/fair.c  | 32 ++++++++++++++++++++++++++++++++
>  kernel/sched/idle.c  |  2 ++
>  kernel/sched/sched.h |  3 +++
>  3 files changed, 37 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f19c9435c64d..78c4127f2f3a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7802,6 +7802,37 @@ static inline int on_null_domain(struct rq *rq)
>  	return unlikely(!rcu_dereference_sched(rq->sd));
>  }
>  
> +static struct {
> +	cpumask_var_t	idle_cpus_mask;
> +	atomic_t	nr_cpus;
> +} balance ____cacheline_aligned;

How is this different from the nohz idle cpu mask?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] sched/fair: Introduce sched_entity::dont_balance
  2016-06-20 14:28   ` Peter Zijlstra
@ 2016-06-20 15:59     ` Jiri Olsa
  2016-06-21  8:09       ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Jiri Olsa @ 2016-06-20 15:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Ingo Molnar, lkml, James Hartsock, Rik van Riel,
	Srivatsa Vaddagiri, Kirill Tkhai

On Mon, Jun 20, 2016 at 04:28:23PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 20, 2016 at 02:15:11PM +0200, Jiri Olsa wrote:
> > Adding dont_balance bool into struct sched_entity,
> > to mark tasks which are rebalanced based on affinity.
> > 
> > It's used only when REBALANCE_AFFINITY feature is
> > switched on. The code functionality of this feature
> > is introduced in following patch.
> > 
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  include/linux/sched.h   |  2 ++
> >  kernel/sched/fair.c     | 21 ++++++++++++++++++---
> >  kernel/sched/features.h |  1 +
> >  3 files changed, 21 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index dee41bf59e6b..0e6ac882283b 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1331,6 +1331,8 @@ struct sched_entity {
> >  
> >  	u64			nr_migrations;
> >  
> > +	bool			dont_balance;
> 
> Never use bool in structures.

ok, but I couldn't find anything real sinister about that..
so why is that? ;-)

thanks,
jirka

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] sched/fair: Introduce idle enter/exit balance callbacks
  2016-06-20 14:30   ` Peter Zijlstra
@ 2016-06-20 16:27     ` Jiri Olsa
  2016-06-21  8:12       ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Jiri Olsa @ 2016-06-20 16:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Ingo Molnar, lkml, James Hartsock, Rik van Riel,
	Srivatsa Vaddagiri, Kirill Tkhai, Frederic Weisbecker

On Mon, Jun 20, 2016 at 04:30:03PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 20, 2016 at 02:15:12PM +0200, Jiri Olsa wrote:
> > Introducing idle enter/exit balance callbacks to keep
> > balance.idle_cpus_mask cpumask of current idle cpus
> > in system.
> > 
> > It's used only when REBALANCE_AFFINITY feature is
> > switched on. The code functionality of this feature
> > is introduced in following patch.
> > 
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  kernel/sched/fair.c  | 32 ++++++++++++++++++++++++++++++++
> >  kernel/sched/idle.c  |  2 ++
> >  kernel/sched/sched.h |  3 +++
> >  3 files changed, 37 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f19c9435c64d..78c4127f2f3a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7802,6 +7802,37 @@ static inline int on_null_domain(struct rq *rq)
> >  	return unlikely(!rcu_dereference_sched(rq->sd));
> >  }
> >  
> > +static struct {
> > +	cpumask_var_t	idle_cpus_mask;
> > +	atomic_t	nr_cpus;
> > +} balance ____cacheline_aligned;
> 
> How is this different from the nohz idle cpu mask?

well, the nohz idle cpu mask is deep in the nohz code:

tick_irq_exit
  if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) {
    if (!in_interrupt())
      tick_nohz_irq_exit
        if (ts->inidle) {
          __tick_nohz_idle_enter(ts)
            tick_nohz_idle_enter
             __tick_nohz_idle_enter
               if (can_stop_idle_tick(cpu, ts)) {
                 tick_nohz_stop_sched_tick
                   if (ts->tick_stopped)
                     nohz_balance_enter_idle(cpu) {
                       set idle mask for cpu
                     }
        } else {
                tick_nohz_full_update_tick(ts);
                 if (can_stop_full_tick(ts))
                    tick_nohz_stop_sched_tick
                      if (ts->tick_stopped)
                        nohz_balance_enter_idle(cpu) {
                          set idle mask for cpu
                        }
	}

cpu_idle_loop
  tick_nohz_idle_enter
     __tick_nohz_idle_enter(ts)
       tick_nohz_idle_enter
        __tick_nohz_idle_enter
          if (can_stop_idle_tick(cpu, ts)) {
            tick_nohz_stop_sched_tick
              if (ts->tick_stopped)
                nohz_balance_enter_idle(cpu) {
                  set idle mask for cpu
                }


sched_cpu_dying
  nohz_balance_exit_idle(cpu) {
    unset idle mask for cpu
  }

trigger_load_balance
  nohz_kick_needed
    nohz_balance_exit_idle(cpu) {
      unset idle mask for cpu
    }

... I might have missed some of the paths


so it might not be that easy to switch it to use the easy
change I added as part of this RFC, but AFAIU it should be
the same idle mask, but this approach might be too naive
and miss some idle enter/exit paths.. CC-ing Frederic

thanks,
jirka

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] sched/fair: Introduce sched_entity::dont_balance
  2016-06-20 15:59     ` Jiri Olsa
@ 2016-06-21  8:09       ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2016-06-21  8:09 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Jiri Olsa, Ingo Molnar, lkml, James Hartsock, Rik van Riel,
	Srivatsa Vaddagiri, Kirill Tkhai

On Mon, Jun 20, 2016 at 05:59:29PM +0200, Jiri Olsa wrote:
> On Mon, Jun 20, 2016 at 04:28:23PM +0200, Peter Zijlstra wrote:

> > Never use bool in structures.
> 
> ok, but I couldn't find anything real sinister about that..
> so why is that? ;-)

C does not define sizeof(bool) :-)

Obviously any implementation must, but you then get implementation
defined layout differences in your structures, which is crap.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] sched/fair: Introduce idle enter/exit balance callbacks
  2016-06-20 16:27     ` Jiri Olsa
@ 2016-06-21  8:12       ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2016-06-21  8:12 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Jiri Olsa, Ingo Molnar, lkml, James Hartsock, Rik van Riel,
	Srivatsa Vaddagiri, Kirill Tkhai, Frederic Weisbecker

On Mon, Jun 20, 2016 at 06:27:41PM +0200, Jiri Olsa wrote:
> so it might not be that easy to switch it to use the easy
> change I added as part of this RFC, but AFAIU it should be
> the same idle mask, but this approach might be too naive
> and miss some idle enter/exit paths.. CC-ing Frederic

Right, for RFC this will do; but I really want to avoid adding another
global atomic bitmask on the idle path.

In any case, I'll go look over the actual patches sometime soon, I still
got a few other patches series to stare at first.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] sched/fair: Add REBALANCE_AFFINITY rebalancing code
  2016-06-20 12:15 ` [PATCH 3/4] sched/fair: Add REBALANCE_AFFINITY rebalancing code Jiri Olsa
@ 2016-06-30 10:58   ` Peter Zijlstra
  2016-07-01  7:35     ` Jiri Olsa
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2016-06-30 10:58 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Ingo Molnar, lkml, James Hartsock, Rik van Riel,
	Srivatsa Vaddagiri, Kirill Tkhai

On Mon, Jun 20, 2016 at 02:15:13PM +0200, Jiri Olsa wrote:
> Sched domains are defined at the start and can't be changed
> during runtime.

Not entirely true; you can change them using cpusets, although there are
strict constraints on how and you can only make the 'problem' worse.

> If user defines workload affinity settings
> unevenly with sched domains, he could get unbalanced state
> within his affinity group, like:
> 
> Say we have following sched domains:
>   domain 0: (pairs)

s/pairs/smt siblings/

>   domain 1: 0-5,12-17 (group1)  6-11,18-23 (group2)

this would typically be cache groups

>   domain 2: 0-23 level NUMA
> 
> User runs workload with affinity setup that takes
> one CPU from group1 (0) and the rest from group 2:
>     0,6,7,8,9,10,11,18,19,20,21,22

But who would do something like that?

I'm really missing a problem statement here. Who cares and why?

sched_setaffinity() is an interface that says I know what I'm doing, and
you seem to be solving a problem resulting from not actually knowing wth
you're doing.

I'm not saying we shouldn't look into it, but I really want more
justification for this.

> User will see idle CPUs within his affinity group,
> because load balancer will balance tasks based on load
> within group1 and group2, thus placing eqaul load
> of tasks on CPU 0 and on the rest of CPUs.

So afaict this thing only cares about idleness, and we should be able to
fix that differently. The real problem is maintaining fairness in the
overloaded case under such silly constraints.

So why do you only care about this specific issue.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] sched/fair: Add REBALANCE_AFFINITY rebalancing code
  2016-06-30 10:58   ` Peter Zijlstra
@ 2016-07-01  7:35     ` Jiri Olsa
  2016-07-01  8:24       ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Jiri Olsa @ 2016-07-01  7:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Ingo Molnar, lkml, James Hartsock, Rik van Riel,
	Srivatsa Vaddagiri, Kirill Tkhai

On Thu, Jun 30, 2016 at 12:58:05PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 20, 2016 at 02:15:13PM +0200, Jiri Olsa wrote:
> > Sched domains are defined at the start and can't be changed
> > during runtime.
> 
> Not entirely true; you can change them using cpusets, although there are
> strict constraints on how and you can only make the 'problem' worse.
> 
> > If user defines workload affinity settings
> > unevenly with sched domains, he could get unbalanced state
> > within his affinity group, like:
> > 
> > Say we have following sched domains:
> >   domain 0: (pairs)
> 
> s/pairs/smt siblings/
> 
> >   domain 1: 0-5,12-17 (group1)  6-11,18-23 (group2)
> 
> this would typically be cache groups
> 
> >   domain 2: 0-23 level NUMA
> > 
> > User runs workload with affinity setup that takes
> > one CPU from group1 (0) and the rest from group 2:
> >     0,6,7,8,9,10,11,18,19,20,21,22
> 
> But who would do something like that?
> 
> I'm really missing a problem statement here. Who cares and why?
> 
> sched_setaffinity() is an interface that says I know what I'm doing, and
> you seem to be solving a problem resulting from not actually knowing wth
> you're doing.
> 
> I'm not saying we shouldn't look into it, but I really want more
> justification for this.
> 
> > User will see idle CPUs within his affinity group,
> > because load balancer will balance tasks based on load
> > within group1 and group2, thus placing eqaul load
> > of tasks on CPU 0 and on the rest of CPUs.
> 
> So afaict this thing only cares about idleness, and we should be able to
> fix that differently. The real problem is maintaining fairness in the
> overloaded case under such silly constraints.
> 
> So why do you only care about this specific issue.

well this is issue our partner met in the setup,
and I'm not sure what was their motivation for that,
perhaps James could clarify in here..

I tried to make the 'scratch that itch' solution as
mentioned in earlier discussion.

So IIUIC what you say is that it needs to be more generic
solution..?  I'll continue staring at it then ;-)

thanks,
jirka

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] sched/fair: Add REBALANCE_AFFINITY rebalancing code
  2016-07-01  7:35     ` Jiri Olsa
@ 2016-07-01  8:24       ` Peter Zijlstra
       [not found]         ` <CAM1qU8Ms2ZO5fnYRVX51uiBC5pZX6Hpa2W3Wqj5NjnYp3t8kOA@mail.gmail.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2016-07-01  8:24 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Jiri Olsa, Ingo Molnar, lkml, James Hartsock, Rik van Riel,
	Srivatsa Vaddagiri, Kirill Tkhai

On Fri, Jul 01, 2016 at 09:35:46AM +0200, Jiri Olsa wrote:
> well this is issue our partner met in the setup,
> and I'm not sure what was their motivation for that,
> perhaps James could clarify in here..
> 
> I tried to make the 'scratch that itch' solution as
> mentioned in earlier discussion.
> 
> So IIUIC what you say is that it needs to be more generic
> solution..?  I'll continue staring at it then ;-)

I just want to know what problem we're trying to solve..

Because it appears this is running 1 task each on a 'weird' subset of
cpus and things going badly. If this really is the case, then teaching
active balance to only move tasks to idle cpus or something should also
cure things.

Also, I'm curious why people set such weird masks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] sched/fair: Add REBALANCE_AFFINITY rebalancing code
       [not found]         ` <CAM1qU8Ms2ZO5fnYRVX51uiBC5pZX6Hpa2W3Wqj5NjnYp3t8kOA@mail.gmail.com>
@ 2016-07-01 14:58           ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2016-07-01 14:58 UTC (permalink / raw)
  To: James Hartsock
  Cc: Jiri Olsa, Jiri Olsa, Ingo Molnar, lkml, Rik van Riel,
	Srivatsa Vaddagiri, Kirill Tkhai

On Fri, Jul 01, 2016 at 09:15:55AM -0500, James Hartsock wrote:
> On Fri, Jul 1, 2016 at 3:24 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, Jul 01, 2016 at 09:35:46AM +0200, Jiri Olsa wrote:
> > > well this is issue our partner met in the setup,
> > > and I'm not sure what was their motivation for that,
> > > perhaps James could clarify in here..
> > >
> > > I tried to make the 'scratch that itch' solution as
> > > mentioned in earlier discussion.
> > >
> > > So IIUIC what you say is that it needs to be more generic
> > > solution..?  I'll continue staring at it then ;-)
> >
> > I just want to know what problem we're trying to solve..
> >
> > Because it appears this is running 1 task each on a 'weird' subset of
> > cpus and things going badly. If this really is the case, then teaching
> > active balance to only move tasks to idle cpus or something should also
> > cure things.
> >
> > Also, I'm curious why people set such weird masks.
> >
> 
> ​I think the original issue was reported/seen when using a straight range
> of CPUs, but in that range they crossed numa nodes.  Then in the no knowing
> what was triggering the issue and trying to reproduce the issue we started
> getting some crazy masks.
> 
> The work-around customer has been using is to use SCHED_RR as it doesn't
> have this balance across NUMA issue​.  But it is also the fact that
> SCHED_RR doesn't have this issue that makes it "seem" like a defect in
> SCHED_OTHER.  I have shared that this is triggered by the across numa node
> taskset with customer so they are aware of that.  But if this is something
> that is seen as a limitation of SCHED_OTHER and not reasonable to be
> addressed upstream I think maybe it as least something we should get
> documented.

But what exact usecase? A single task per cpu, or something else?

Note that RR has different constraints than OTHER, but in both cases
having skewed masks across a topology divide is unlikely to be good for
performance.

Esp. in the extreme case reported here, where one task is on an entirely
different node than the rest of them, that task will cause cacheline
transfers between the nodes slowing down itself as well as all the other
tasks that have to pull it back in.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-07-01 14:59 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-20 12:15 [RFC 0/4] sched/fair: Rebalance tasks based on affinity Jiri Olsa
2016-06-20 12:15 ` [PATCH 1/4] sched/fair: Introduce sched_entity::dont_balance Jiri Olsa
2016-06-20 14:28   ` Peter Zijlstra
2016-06-20 15:59     ` Jiri Olsa
2016-06-21  8:09       ` Peter Zijlstra
2016-06-20 12:15 ` [PATCH 2/4] sched/fair: Introduce idle enter/exit balance callbacks Jiri Olsa
2016-06-20 14:30   ` Peter Zijlstra
2016-06-20 16:27     ` Jiri Olsa
2016-06-21  8:12       ` Peter Zijlstra
2016-06-20 12:15 ` [PATCH 3/4] sched/fair: Add REBALANCE_AFFINITY rebalancing code Jiri Olsa
2016-06-30 10:58   ` Peter Zijlstra
2016-07-01  7:35     ` Jiri Olsa
2016-07-01  8:24       ` Peter Zijlstra
     [not found]         ` <CAM1qU8Ms2ZO5fnYRVX51uiBC5pZX6Hpa2W3Wqj5NjnYp3t8kOA@mail.gmail.com>
2016-07-01 14:58           ` Peter Zijlstra
2016-06-20 12:15 ` [PATCH 4/4] sched/fair: Add schedstat debug values for REBALANCE_AFFINITY Jiri Olsa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).