linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
@ 2013-01-29  9:08 Michael Wang
  2013-01-29  9:09 ` [RFC PATCH v3 1/3] sched: schedule balance map foundation Michael Wang
                   ` (4 more replies)
  0 siblings, 5 replies; 55+ messages in thread
From: Michael Wang @ 2013-01-29  9:08 UTC (permalink / raw)
  To: LKML, Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Mike Galbraith, Andrew Morton, alex.shi, Ram Pai,
	Nikunj A. Dadhania, Namhyung Kim

v3 change log:
	Fix small logical issues (Thanks to Mike Galbraith).
	Change the way of handling WAKE.

This patch set is trying to simplify the select_task_rq_fair() with
schedule balance map.

After get rid of the complex code and reorganize the logical, pgbench show
the improvement, more the clients, bigger the improvement.

				Prev:		Post:

	| db_size | clients |	|  tps  |	|  tps  |
        +---------+---------+   +-------+       +-------+
        | 22 MB   |       1 |   | 10788 |       | 10881 |
        | 22 MB   |       2 |   | 21617 |       | 21837 |
        | 22 MB   |       4 |   | 41597 |       | 42645 |
        | 22 MB   |       8 |   | 54622 |       | 57808 |
        | 22 MB   |      12 |   | 50753 |       | 54527 |
        | 22 MB   |      16 |   | 50433 |       | 56368 |	+11.77%
        | 22 MB   |      24 |   | 46725 |       | 54319 |	+16.25%
        | 22 MB   |      32 |   | 43498 |       | 54650 |	+25.64%
        | 7484 MB |       1 |   |  7894 |       |  8301 |
        | 7484 MB |       2 |   | 19477 |       | 19622 |
        | 7484 MB |       4 |   | 36458 |       | 38242 |
        | 7484 MB |       8 |   | 48423 |       | 50796 |
        | 7484 MB |      12 |   | 46042 |       | 49938 |
        | 7484 MB |      16 |   | 46274 |       | 50507 |	+9.15%
        | 7484 MB |      24 |   | 42583 |       | 49175 |	+15.48%
        | 7484 MB |      32 |   | 36413 |       | 49148 |	+34.97%
        | 15 GB   |       1 |   |  7742 |       |  7876 |
        | 15 GB   |       2 |   | 19339 |       | 19531 |
        | 15 GB   |       4 |   | 36072 |       | 37389 |
        | 15 GB   |       8 |   | 48549 |       | 50570 |
        | 15 GB   |      12 |   | 45716 |       | 49542 |
        | 15 GB   |      16 |   | 46127 |       | 49647 |	+7.63%
        | 15 GB   |      24 |   | 42539 |       | 48639 |	+14.34%
        | 15 GB   |      32 |   | 36038 |       | 48560 |	+34.75%

Please check the patch for more details about schedule balance map.

Support the NUMA domain but not well tested.
Support the rebuild of domain but not tested.

Comments are very welcomed.

Behind the v3:
	Some changes has been applied to the way of handling WAKE.

	And that's all around one question, whether we should do load balance
	for WAKE or not?

	In the old world, the only chance to do load balance for WAKE is when
	prev cpu and curr cpu are not cache affine, but that doesn't make sense.

	I suppose the real meaning behind that logical is, do balance only if
	cache benefit nothing after changing cpu.

	However, select_idle_sibling() is not only designed for the purpose to
	take care of cache, it also benefit latency, and cost less than the
	balance path.

	Besides, it's impossible to estimate the benefit of doing load balance
	at that point of time.

	And that's come out the v3, no load balance for WAKE.

Test with:
	12 cpu X86 server and linux-next 3.8.0-rc3.

Michael Wang (3):
	[RFC PATCH v3 1/3] sched: schedule balance map foundation
	[RFC PATCH v3 2/3] sched: build schedule balance map
	[RFC PATCH v3 3/3] sched: simplify select_task_rq_fair() with schedule balance map

Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
---
 b/kernel/sched/core.c  |   44 +++++++++++++++
 b/kernel/sched/fair.c  |  135 ++++++++++++++++++++++++++-----------------------
 b/kernel/sched/sched.h |   14 +++++
 kernel/sched/core.c    |   67 ++++++++++++++++++++++++
 4 files changed, 199 insertions(+), 61 deletions(-)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC PATCH v3 1/3] sched: schedule balance map foundation
  2013-01-29  9:08 [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair() Michael Wang
@ 2013-01-29  9:09 ` Michael Wang
  2013-02-20 13:21   ` Peter Zijlstra
  2013-02-20 13:25   ` Peter Zijlstra
  2013-01-29  9:09 ` [RFC PATCH v3 2/3] sched: build schedule balance map Michael Wang
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 55+ messages in thread
From: Michael Wang @ 2013-01-29  9:09 UTC (permalink / raw)
  To: LKML, Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Mike Galbraith, Andrew Morton, alex.shi, Ram Pai,
	Nikunj A. Dadhania, Namhyung Kim

In order to get rid of the complex code in select_task_rq_fair(),
approach to directly get sd on each level with proper flag is
required.

Schedule balance map is the solution, which record the sd according
to it's flag and level.

For example, cpu_sbm->sd[wake][l] will locate the sd of cpu which
support wake up on level l.

This patch contain the foundation of schedule balance map in order
to serve the follow patches.

Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
---
 kernel/sched/core.c  |   44 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |   14 ++++++++++++++
 2 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 257002c..092c801 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5575,6 +5575,9 @@ static void update_top_cache_domain(int cpu)
 	per_cpu(sd_llc_id, cpu) = id;
 }
 
+static int sbm_max_level;
+DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_balance_map, sbm_array);
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
@@ -6037,6 +6040,46 @@ static struct sched_domain_topology_level default_topology[] = {
 
 static struct sched_domain_topology_level *sched_domain_topology = default_topology;
 
+static void sched_init_sbm(void)
+{
+	size_t size;
+	int cpu, type, node;
+	struct sched_balance_map *sbm;
+	struct sched_domain_topology_level *tl;
+
+	/*
+	 * Inelegant method, any good idea?
+	 */
+	for (tl = sched_domain_topology; tl->init; tl++, sbm_max_level++)
+		;
+
+	for_each_possible_cpu(cpu) {
+		sbm = &per_cpu(sbm_array, cpu);
+		node = cpu_to_node(cpu);
+		size = sizeof(struct sched_domain *) * sbm_max_level;
+
+		for (type = 0; type < SBM_MAX_TYPE; type++) {
+			sbm->sd[type] = kmalloc_node(size, GFP_KERNEL, node);
+			WARN_ON(!sbm->sd[type]);
+			if (!sbm->sd[type])
+				goto failed;
+		}
+	}
+
+	return;
+
+failed:
+	for_each_possible_cpu(cpu) {
+		sbm = &per_cpu(sbm_array, cpu);
+
+		for (type = 0; type < SBM_MAX_TYPE; type++)
+			kfree(sbm->sd[type]);
+	}
+
+	/* prevent further work */
+	sbm_max_level = 0;
+}
+
 #ifdef CONFIG_NUMA
 
 static int sched_domains_numa_levels;
@@ -6765,6 +6808,7 @@ void __init sched_init_smp(void)
 	alloc_cpumask_var(&fallback_doms, GFP_KERNEL);
 
 	sched_init_numa();
+	sched_init_sbm();
 
 	get_online_cpus();
 	mutex_lock(&sched_domains_mutex);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fc88644..d060913 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -349,6 +349,19 @@ struct root_domain {
 
 extern struct root_domain def_root_domain;
 
+enum {
+	SBM_EXEC_TYPE,
+	SBM_FORK_TYPE,
+	SBM_WAKE_TYPE,
+	SBM_MAX_TYPE
+};
+
+struct sched_balance_map {
+	struct sched_domain **sd[SBM_MAX_TYPE];
+	int top_level[SBM_MAX_TYPE];
+	struct sched_domain *affine_map[NR_CPUS];
+};
+
 #endif /* CONFIG_SMP */
 
 /*
@@ -416,6 +429,7 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct root_domain *rd;
 	struct sched_domain *sd;
+	struct sched_balance_map *sbm;
 
 	unsigned long cpu_power;
 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC PATCH v3 2/3] sched: build schedule balance map
  2013-01-29  9:08 [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair() Michael Wang
  2013-01-29  9:09 ` [RFC PATCH v3 1/3] sched: schedule balance map foundation Michael Wang
@ 2013-01-29  9:09 ` Michael Wang
  2013-01-29  9:10 ` [RFC PATCH v3 3/3] sched: simplify select_task_rq_fair() with " Michael Wang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 55+ messages in thread
From: Michael Wang @ 2013-01-29  9:09 UTC (permalink / raw)
  To: LKML, Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Mike Galbraith, Andrew Morton, alex.shi, Ram Pai,
	Nikunj A. Dadhania, Namhyung Kim

This patch will build schedule balance map as designed, now
cpu_sbm->sd[f][l] can directly locate the sd of cpu on level 'l'
with flag 'f' supported.

In order to quickly locate the lower sd while changing the base cpu,
the level with empty sd in map will be filled with the lower sd.

Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
---
 kernel/sched/core.c |   67 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 67 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 092c801..c2a13bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5578,6 +5578,59 @@ static void update_top_cache_domain(int cpu)
 static int sbm_max_level;
 DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_balance_map, sbm_array);
 
+static void build_sched_balance_map(int cpu)
+{
+	struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
+	struct sched_domain *sd = cpu_rq(cpu)->sd;
+	int i, type, level = 0;
+
+	memset(sbm->top_level, 0, sizeof((*sbm).top_level));
+	memset(sbm->affine_map, 0, sizeof((*sbm).affine_map));
+	for (type = 0; type < SBM_MAX_TYPE; type++) {
+		memset(sbm->sd[type], 0,
+			sizeof(struct sched_domain *) * sbm_max_level);
+	}
+
+	while (sd) {
+		if (!(sd->flags & SD_LOAD_BALANCE))
+			continue;
+
+		if (sd->flags & SD_BALANCE_EXEC) {
+			sbm->top_level[SBM_EXEC_TYPE] = sd->level;
+			sbm->sd[SBM_EXEC_TYPE][sd->level] = sd;
+		}
+
+		if (sd->flags & SD_BALANCE_FORK) {
+			sbm->top_level[SBM_FORK_TYPE] = sd->level;
+			sbm->sd[SBM_FORK_TYPE][sd->level] = sd;
+		}
+
+		if (sd->flags & SD_BALANCE_WAKE) {
+			sbm->top_level[SBM_WAKE_TYPE] = sd->level;
+			sbm->sd[SBM_WAKE_TYPE][sd->level] = sd;
+		}
+
+		if (sd->flags & SD_WAKE_AFFINE) {
+			for_each_cpu(i, sched_domain_span(sd)) {
+				if (!sbm->affine_map[i])
+					sbm->affine_map[i] = sd;
+			}
+		}
+
+		sd = sd->parent;
+	}
+
+	/*
+	 * fill the hole to get lower level sd easily.
+	 */
+	for (type = 0; type < SBM_MAX_TYPE; type++) {
+		for (level = 1; level < sbm_max_level; level++) {
+			if (!sbm->sd[type][level])
+				sbm->sd[type][level] = sbm->sd[type][level - 1];
+		}
+	}
+}
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
@@ -5587,6 +5640,9 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	struct sched_domain *tmp;
+	struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
+
+	rcu_assign_pointer(rq->sbm, NULL);
 
 	/* Remove the sched domains which do not contribute to scheduling. */
 	for (tmp = sd; tmp; ) {
@@ -5619,6 +5675,17 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 	destroy_sched_domains(tmp, cpu);
 
 	update_top_cache_domain(cpu);
+
+	/* disable sbm if initialization failed or sd detached. */
+	if (!sbm_max_level || !sd)
+		return;
+
+	/*
+	 * synchronize_rcu() is unnecessary here since
+	 * destroy_sched_domains() already do the work.
+	 */
+	build_sched_balance_map(cpu);
+	rcu_assign_pointer(rq->sbm, sbm);
 }
 
 /* cpus with isolated domains */
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [RFC PATCH v3 3/3] sched: simplify select_task_rq_fair() with schedule balance map
  2013-01-29  9:08 [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair() Michael Wang
  2013-01-29  9:09 ` [RFC PATCH v3 1/3] sched: schedule balance map foundation Michael Wang
  2013-01-29  9:09 ` [RFC PATCH v3 2/3] sched: build schedule balance map Michael Wang
@ 2013-01-29  9:10 ` Michael Wang
  2013-02-18  5:52 ` [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair() Michael Wang
  2013-02-20 10:49 ` Ingo Molnar
  4 siblings, 0 replies; 55+ messages in thread
From: Michael Wang @ 2013-01-29  9:10 UTC (permalink / raw)
  To: LKML, Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Mike Galbraith, Andrew Morton, alex.shi, Ram Pai,
	Nikunj A. Dadhania, Namhyung Kim

Since schedule balance map provide the approach to get proper sd directly,
simplify the code of select_task_rq_fair() is possible.

The new code is designed to reserve most of the old logical, but get rid
of those 'for' by using the schedule balance map to locate proper sd
directly.

Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
---
 kernel/sched/fair.c |  135 ++++++++++++++++++++++++++++-----------------------
 1 files changed, 74 insertions(+), 61 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eea870..0935c7d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3302,100 +3302,113 @@ done:
 }
 
 /*
- * sched_balance_self: balance the current task (running on cpu) in domains
- * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
- * SD_BALANCE_EXEC.
+ * select_task_rq_fair()
+ *		select a proper cpu for task to run.
  *
- * Balance, ie. select the least loaded group.
- *
- * Returns the target CPU number, or the same CPU if no balancing is needed.
- *
- * preempt must be disabled.
+ *	p		-- the task we are going to select cpu for
+ *	sd_flag		-- indicate the context, WAKE, EXEC or FORK.
+ *	wake_flag	-- we only care about WF_SYNC currently
  */
 static int
 select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 {
-	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+	struct sched_domain *sd = NULL;
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
-	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
+	struct sched_balance_map *sbm = NULL;
+	int type = 0;
 
 	if (p->nr_cpus_allowed == 1)
 		return prev_cpu;
 
-	if (sd_flag & SD_BALANCE_WAKE) {
-		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
-			want_affine = 1;
-		new_cpu = prev_cpu;
-	}
+	if (sd_flag & SD_BALANCE_EXEC)
+		type = SBM_EXEC_TYPE;
+	else if (sd_flag & SD_BALANCE_FORK)
+		type = SBM_FORK_TYPE;
+	else if (sd_flag & SD_BALANCE_WAKE)
+		type = SBM_WAKE_TYPE;
 
 	rcu_read_lock();
-	for_each_domain(cpu, tmp) {
-		if (!(tmp->flags & SD_LOAD_BALANCE))
-			continue;
 
+	sbm = cpu_rq(cpu)->sbm;
+	if (!sbm)
+		goto unlock;
+
+	if (sd_flag & SD_BALANCE_WAKE) {
 		/*
-		 * If both cpu and prev_cpu are part of this domain,
-		 * cpu is a valid SD_WAKE_AFFINE target.
+		 * Tasks to be waked is special, memory it relied on
+		 * may has already been cached on prev_cpu, and usually
+		 * they require low latency.
+		 *
+		 * So firstly try to locate an idle cpu shared the cache
+		 * with prev_cpu, it has the chance to break the load
+		 * balance, fortunately, select_idle_sibling() will search
+		 * from top to bottom, which help to reduce the chance in
+		 * some cases.
 		 */
-		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
-		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
-			affine_sd = tmp;
-			break;
-		}
+		new_cpu = select_idle_sibling(p, prev_cpu);
+		if (idle_cpu(new_cpu))
+			goto unlock;
 
-		if (tmp->flags & sd_flag)
-			sd = tmp;
-	}
+		/*
+		 * No idle cpu could be found in the topology of prev_cpu,
+		 * try search again in the topology of current cpu if it is
+		 * the affine of prev_cpu.
+		 */
+		if (cpu == prev_cpu || !sbm->affine_map[prev_cpu] ||
+				!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+			goto unlock;
 
-	if (affine_sd) {
-		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-			prev_cpu = cpu;
+		new_cpu = select_idle_sibling(p, cpu);
 
-		new_cpu = select_idle_sibling(p, prev_cpu);
+		/*
+		 * Invoke wake_affine() finally since it is no doubt a
+		 * performance killer.
+		 */
+		if (idle_cpu(new_cpu) &&
+				wake_affine(sbm->affine_map[prev_cpu], p, sync))
+			goto unlock;
+
+		/*
+		 * Failed to locate an idle cpu in the topology of both cpu
+		 * and prev_cpu, since the benefit of balance could not be
+		 * estimated, just adopt the prev_cpu.
+		 */
+		new_cpu = prev_cpu;
 		goto unlock;
 	}
 
+	/* Balance path, only for FORK and EXEC. */
+	new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
+	sd = sbm->sd[type][sbm->top_level[type]];
+
 	while (sd) {
 		int load_idx = sd->forkexec_idx;
-		struct sched_group *group;
-		int weight;
-
-		if (!(sd->flags & sd_flag)) {
-			sd = sd->child;
-			continue;
-		}
+		struct sched_group *sg = NULL;
 
 		if (sd_flag & SD_BALANCE_WAKE)
 			load_idx = sd->wake_idx;
 
-		group = find_idlest_group(sd, p, cpu, load_idx);
-		if (!group) {
-			sd = sd->child;
-			continue;
-		}
+		sg = find_idlest_group(sd, p, cpu, load_idx);
+		if (!sg)
+			goto next_sd;
 
-		new_cpu = find_idlest_cpu(group, p, cpu);
-		if (new_cpu == -1 || new_cpu == cpu) {
-			/* Now try balancing at a lower domain level of cpu */
-			sd = sd->child;
-			continue;
-		}
+		new_cpu = find_idlest_cpu(sg, p, cpu);
+		if (new_cpu != -1)
+			cpu = new_cpu;
+next_sd:
+		if (!sd->level)
+			break;
 
-		/* Now try balancing at a lower domain level of new_cpu */
-		cpu = new_cpu;
-		weight = sd->span_weight;
-		sd = NULL;
-		for_each_domain(cpu, tmp) {
-			if (weight <= tmp->span_weight)
-				break;
-			if (tmp->flags & sd_flag)
-				sd = tmp;
-		}
-		/* while loop will break here if sd == NULL */
+		sbm = cpu_rq(cpu)->sbm;
+		if (!sbm)
+			break;
+
+		sd = sbm->sd[type][sd->level - 1];
 	}
+
 unlock:
 	rcu_read_unlock();
 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-01-29  9:08 [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair() Michael Wang
                   ` (2 preceding siblings ...)
  2013-01-29  9:10 ` [RFC PATCH v3 3/3] sched: simplify select_task_rq_fair() with " Michael Wang
@ 2013-02-18  5:52 ` Michael Wang
  2013-02-20 10:49 ` Ingo Molnar
  4 siblings, 0 replies; 55+ messages in thread
From: Michael Wang @ 2013-02-18  5:52 UTC (permalink / raw)
  To: LKML, Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Mike Galbraith, Andrew Morton, alex.shi, Ram Pai,
	Nikunj A. Dadhania, Namhyung Kim

On 01/29/2013 05:08 PM, Michael Wang wrote:
> v3 change log:
> 	Fix small logical issues (Thanks to Mike Galbraith).
> 	Change the way of handling WAKE.
> 
> This patch set is trying to simplify the select_task_rq_fair() with
> schedule balance map.
> 
> After get rid of the complex code and reorganize the logical, pgbench show
> the improvement, more the clients, bigger the improvement.
> 
> 				Prev:		Post:
> 
> 	| db_size | clients |	|  tps  |	|  tps  |
>         +---------+---------+   +-------+       +-------+
>         | 22 MB   |       1 |   | 10788 |       | 10881 |
>         | 22 MB   |       2 |   | 21617 |       | 21837 |
>         | 22 MB   |       4 |   | 41597 |       | 42645 |
>         | 22 MB   |       8 |   | 54622 |       | 57808 |
>         | 22 MB   |      12 |   | 50753 |       | 54527 |
>         | 22 MB   |      16 |   | 50433 |       | 56368 |	+11.77%
>         | 22 MB   |      24 |   | 46725 |       | 54319 |	+16.25%
>         | 22 MB   |      32 |   | 43498 |       | 54650 |	+25.64%
>         | 7484 MB |       1 |   |  7894 |       |  8301 |
>         | 7484 MB |       2 |   | 19477 |       | 19622 |
>         | 7484 MB |       4 |   | 36458 |       | 38242 |
>         | 7484 MB |       8 |   | 48423 |       | 50796 |
>         | 7484 MB |      12 |   | 46042 |       | 49938 |
>         | 7484 MB |      16 |   | 46274 |       | 50507 |	+9.15%
>         | 7484 MB |      24 |   | 42583 |       | 49175 |	+15.48%
>         | 7484 MB |      32 |   | 36413 |       | 49148 |	+34.97%
>         | 15 GB   |       1 |   |  7742 |       |  7876 |
>         | 15 GB   |       2 |   | 19339 |       | 19531 |
>         | 15 GB   |       4 |   | 36072 |       | 37389 |
>         | 15 GB   |       8 |   | 48549 |       | 50570 |
>         | 15 GB   |      12 |   | 45716 |       | 49542 |
>         | 15 GB   |      16 |   | 46127 |       | 49647 |	+7.63%
>         | 15 GB   |      24 |   | 42539 |       | 48639 |	+14.34%
>         | 15 GB   |      32 |   | 36038 |       | 48560 |	+34.75%
> 
> Please check the patch for more details about schedule balance map.
> 
> Support the NUMA domain but not well tested.
> Support the rebuild of domain but not tested.

Hi, Ingo, Peter

I've finished the test I could figure out (NUMA, domain rebuild...) , no
issue appear on my box.

I think this patch set will benefit the system, especially when there
are huge amount of cpus.

How do you think about this idea? do you have any comments on the patch set?

Regards,
Michael Wang

> 
> Comments are very welcomed.
> 
> Behind the v3:
> 	Some changes has been applied to the way of handling WAKE.
> 
> 	And that's all around one question, whether we should do load balance
> 	for WAKE or not?
> 
> 	In the old world, the only chance to do load balance for WAKE is when
> 	prev cpu and curr cpu are not cache affine, but that doesn't make sense.
> 
> 	I suppose the real meaning behind that logical is, do balance only if
> 	cache benefit nothing after changing cpu.
> 
> 	However, select_idle_sibling() is not only designed for the purpose to
> 	take care of cache, it also benefit latency, and cost less than the
> 	balance path.
> 
> 	Besides, it's impossible to estimate the benefit of doing load balance
> 	at that point of time.
> 
> 	And that's come out the v3, no load balance for WAKE.
> 
> Test with:
> 	12 cpu X86 server and linux-next 3.8.0-rc3.
> 
> Michael Wang (3):
> 	[RFC PATCH v3 1/3] sched: schedule balance map foundation
> 	[RFC PATCH v3 2/3] sched: build schedule balance map
> 	[RFC PATCH v3 3/3] sched: simplify select_task_rq_fair() with schedule balance map
> 
> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
> ---
>  b/kernel/sched/core.c  |   44 +++++++++++++++
>  b/kernel/sched/fair.c  |  135 ++++++++++++++++++++++++++-----------------------
>  b/kernel/sched/sched.h |   14 +++++
>  kernel/sched/core.c    |   67 ++++++++++++++++++++++++
>  4 files changed, 199 insertions(+), 61 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-01-29  9:08 [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair() Michael Wang
                   ` (3 preceding siblings ...)
  2013-02-18  5:52 ` [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair() Michael Wang
@ 2013-02-20 10:49 ` Ingo Molnar
  2013-02-20 13:32   ` Peter Zijlstra
  2013-02-21  4:51   ` Michael Wang
  4 siblings, 2 replies; 55+ messages in thread
From: Ingo Molnar @ 2013-02-20 10:49 UTC (permalink / raw)
  To: Michael Wang
  Cc: LKML, Peter Zijlstra, Paul Turner, Mike Galbraith, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim


* Michael Wang <wangyun@linux.vnet.ibm.com> wrote:

> v3 change log:
> 	Fix small logical issues (Thanks to Mike Galbraith).
> 	Change the way of handling WAKE.
> 
> This patch set is trying to simplify the select_task_rq_fair() 
> with schedule balance map.
> 
> After get rid of the complex code and reorganize the logical, 
> pgbench show the improvement, more the clients, bigger the 
> improvement.
> 
> 				Prev:		Post:
> 
> 	| db_size | clients |	|  tps  |	|  tps  |
>         +---------+---------+   +-------+       +-------+
>         | 22 MB   |       1 |   | 10788 |       | 10881 |
>         | 22 MB   |       2 |   | 21617 |       | 21837 |
>         | 22 MB   |       4 |   | 41597 |       | 42645 |
>         | 22 MB   |       8 |   | 54622 |       | 57808 |
>         | 22 MB   |      12 |   | 50753 |       | 54527 |
>         | 22 MB   |      16 |   | 50433 |       | 56368 |	+11.77%
>         | 22 MB   |      24 |   | 46725 |       | 54319 |	+16.25%
>         | 22 MB   |      32 |   | 43498 |       | 54650 |	+25.64%
>         | 7484 MB |       1 |   |  7894 |       |  8301 |
>         | 7484 MB |       2 |   | 19477 |       | 19622 |
>         | 7484 MB |       4 |   | 36458 |       | 38242 |
>         | 7484 MB |       8 |   | 48423 |       | 50796 |
>         | 7484 MB |      12 |   | 46042 |       | 49938 |
>         | 7484 MB |      16 |   | 46274 |       | 50507 |	+9.15%
>         | 7484 MB |      24 |   | 42583 |       | 49175 |	+15.48%
>         | 7484 MB |      32 |   | 36413 |       | 49148 |	+34.97%
>         | 15 GB   |       1 |   |  7742 |       |  7876 |
>         | 15 GB   |       2 |   | 19339 |       | 19531 |
>         | 15 GB   |       4 |   | 36072 |       | 37389 |
>         | 15 GB   |       8 |   | 48549 |       | 50570 |
>         | 15 GB   |      12 |   | 45716 |       | 49542 |
>         | 15 GB   |      16 |   | 46127 |       | 49647 |	+7.63%
>         | 15 GB   |      24 |   | 42539 |       | 48639 |	+14.34%
>         | 15 GB   |      32 |   | 36038 |       | 48560 |	+34.75%
> 
> Please check the patch for more details about schedule balance map.

The changes look clean and reasoable, any ideas exactly *why* it 
speeds up?

I.e. are there one or two key changes in the before/after logic 
and scheduling patterns that you can identify as causing the 
speedup?

Such changes also typically have a chance to cause regressions 
in other workloads - when that happens we need this kind of 
information to be able to enact plan-B.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 1/3] sched: schedule balance map foundation
  2013-01-29  9:09 ` [RFC PATCH v3 1/3] sched: schedule balance map foundation Michael Wang
@ 2013-02-20 13:21   ` Peter Zijlstra
  2013-02-21  4:52     ` Michael Wang
  2013-02-20 13:25   ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2013-02-20 13:21 UTC (permalink / raw)
  To: Michael Wang
  Cc: LKML, Ingo Molnar, Paul Turner, Mike Galbraith, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Tue, 2013-01-29 at 17:09 +0800, Michael Wang wrote:
> +       for_each_possible_cpu(cpu) {
> +               sbm = &per_cpu(sbm_array, cpu);
> +               node = cpu_to_node(cpu);
> +               size = sizeof(struct sched_domain *) * sbm_max_level;
> +
> +               for (type = 0; type < SBM_MAX_TYPE; type++) {
> +                       sbm->sd[type] = kmalloc_node(size, GFP_KERNEL,
> node);
> +                       WARN_ON(!sbm->sd[type]);
> +                       if (!sbm->sd[type])
> +                               goto failed;
> +               }
> +       }

You can't readily use kmalloc_node() here, cpu_to_node() might return an
invalid node for offline cpus here.

Also see: 2ea45800d8e1c3c51c45a233d6bd6289a297a386


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 1/3] sched: schedule balance map foundation
  2013-01-29  9:09 ` [RFC PATCH v3 1/3] sched: schedule balance map foundation Michael Wang
  2013-02-20 13:21   ` Peter Zijlstra
@ 2013-02-20 13:25   ` Peter Zijlstra
  2013-02-21  4:58     ` Michael Wang
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2013-02-20 13:25 UTC (permalink / raw)
  To: Michael Wang
  Cc: LKML, Ingo Molnar, Paul Turner, Mike Galbraith, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Tue, 2013-01-29 at 17:09 +0800, Michael Wang wrote:
> +struct sched_balance_map {
> +       struct sched_domain **sd[SBM_MAX_TYPE];
> +       int top_level[SBM_MAX_TYPE];
> +       struct sched_domain *affine_map[NR_CPUS];
> +};

Argh.. affine_map is O(n^2) in nr_cpus, that's not cool.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-20 10:49 ` Ingo Molnar
@ 2013-02-20 13:32   ` Peter Zijlstra
  2013-02-20 14:05     ` Mike Galbraith
  2013-02-21  5:14     ` Michael Wang
  2013-02-21  4:51   ` Michael Wang
  1 sibling, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2013-02-20 13:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Michael Wang, LKML, Paul Turner, Mike Galbraith, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote:

> The changes look clean and reasoable, 

I don't necessarily agree, note that O(n^2) storage requirement that
Michael failed to highlight ;-)

> any ideas exactly *why* it speeds up?

That is indeed the most interesting part.. There's two parts to
select_task_rq_fair(), the 'regular' affine wakeup path, and the
fork/exec find_idlest_goo() path. At the very least we need to quantify
which of these two parts contributes most to the speedup.

In the power balancing discussion we already noted that the
find_idlest_goo() is in need of attention.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-20 13:32   ` Peter Zijlstra
@ 2013-02-20 14:05     ` Mike Galbraith
  2013-02-21  5:21       ` Michael Wang
  2013-02-21  5:14     ` Michael Wang
  1 sibling, 1 reply; 55+ messages in thread
From: Mike Galbraith @ 2013-02-20 14:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Michael Wang, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Wed, 2013-02-20 at 14:32 +0100, Peter Zijlstra wrote: 
> On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote:
> 
> > The changes look clean and reasoable, 
> 
> I don't necessarily agree, note that O(n^2) storage requirement that
> Michael failed to highlight ;-)

(yeah, I mentioned that needs to shrink.. a lot)

> > any ideas exactly *why* it speeds up?
> 
> That is indeed the most interesting part.. There's two parts to
> select_task_rq_fair(), the 'regular' affine wakeup path, and the
> fork/exec find_idlest_goo() path. At the very least we need to quantify
> which of these two parts contributes most to the speedup.
> 
> In the power balancing discussion we already noted that the
> find_idlest_goo() is in need of attention.

Yup, even little stuff like break off the search when load is zero..
unless someone is planning on implementing anti-idle 'course ;-)

-Mike



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-20 10:49 ` Ingo Molnar
  2013-02-20 13:32   ` Peter Zijlstra
@ 2013-02-21  4:51   ` Michael Wang
  2013-02-21  6:11     ` Mike Galbraith
  2013-02-21 10:20     ` Peter Zijlstra
  1 sibling, 2 replies; 55+ messages in thread
From: Michael Wang @ 2013-02-21  4:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Peter Zijlstra, Paul Turner, Mike Galbraith, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/20/2013 06:49 PM, Ingo Molnar wrote:
[snip]
> 
> The changes look clean and reasoable, any ideas exactly *why* it 
> speeds up?
> 
> I.e. are there one or two key changes in the before/after logic 
> and scheduling patterns that you can identify as causing the 
> speedup?

Hi, Ingo

Thanks for your reply, please let me point out the key changes here
(forgive me for haven't wrote a good description in cover).

The performance improvement from this patch set is:
1. delay the invoke on wake_affine().
2. save the circle to gain proper sd.

The second point is obviously, and will benefit a lot when the sd
topology is deep (NUMA is suppose to make it deeper on large system).

So in my testing on a 12 cpu box, actually most of the benefit comes
from the first point, and please let me introduce it in detail.

The old logical when locate affine_sd is:

	if prev_cpu != curr_cpu
		if wake_affine()
			prev_cpu = curr_cpu
	new_cpu = select_idle_sibling(prev_cpu)
	return new_cpu

The new logical is same to the old one if prev_cpu == curr_cpu, so let's
simplify the old logical like:

	if wake_affine()
		new_cpu = select_idle_sibling(curr_cpu)
	else
		new_cpu = select_idle_sibling(prev_cpu)

	return new_cpu

Actually that doesn't make sense.

I think wake_affine() is trying to check whether move a task from
prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
won't break balance means curr_cpu is better than prev_cpu for searching
the idle cpu?

So the new logical in this patch set is:

	new_cpu = select_idle_sibling(prev_cpu)
	if idle_cpu(new_cpu)
		return new_cpu

	new_cpu = select_idle_sibling(curr_cpu)
	if idle_cpu(new_cpu) {
		if wake_affine()
			return new_cpu
	}

	return prev_cpu

And now, unless we are really going to move load from prev_cpu to
curr_cpu, we won't use wake_affine() any more.

So we avoid wake_affine() when system load is low or high, for middle
load, the worst cases is when failed to locate idle cpu in prev_cpu
topology but succeed to locate one in curr_cpu's, but that's rarely
happen and the benchmark results proved that point.

Some comparison below:

1. system load is low
	old logical cost:
		wake_affine()
		select_idle_sibling()
	new logical cost:
		select_idle_sibling()

2. system load is high
	old logical cost:
		wake_affine()
		select_idle_sibling()
	new logical cost:
		select_idle_sibling()
		select_idle_sibling()

3. system load is middle
	don't know

1 save the cost of wake_affine(), 3 could be proved by benchmark that no
regression at least.

For 2, it's the comparison between wake_affine() and
select_idle_sibling(), since the system load is high, wake_affine() cost
far more than select_idle_sibling(), and we saved many according to the
benchmark results.

> 
> Such changes also typically have a chance to cause regressions 
> in other workloads - when that happens we need this kind of 
> information to be able to enact plan-B.

The benefit comes from avoiding unnecessary works, and the patch set is
suppose to only reduce the cost of key function with least logical
changing, I could not promise it benefit all the workloads, but till
now, I've not found regression.

Regards,
Michael Wang

> 
> Thanks,
> 
> 	Ingo
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 1/3] sched: schedule balance map foundation
  2013-02-20 13:21   ` Peter Zijlstra
@ 2013-02-21  4:52     ` Michael Wang
  0 siblings, 0 replies; 55+ messages in thread
From: Michael Wang @ 2013-02-21  4:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Paul Turner, Mike Galbraith, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/20/2013 09:21 PM, Peter Zijlstra wrote:
> On Tue, 2013-01-29 at 17:09 +0800, Michael Wang wrote:
>> +       for_each_possible_cpu(cpu) {
>> +               sbm = &per_cpu(sbm_array, cpu);
>> +               node = cpu_to_node(cpu);
>> +               size = sizeof(struct sched_domain *) * sbm_max_level;
>> +
>> +               for (type = 0; type < SBM_MAX_TYPE; type++) {
>> +                       sbm->sd[type] = kmalloc_node(size, GFP_KERNEL,
>> node);
>> +                       WARN_ON(!sbm->sd[type]);
>> +                       if (!sbm->sd[type])
>> +                               goto failed;
>> +               }
>> +       }
> 
> You can't readily use kmalloc_node() here, cpu_to_node() might return an
> invalid node for offline cpus here.
> 
> Also see: 2ea45800d8e1c3c51c45a233d6bd6289a297a386

Hi, Peter

Thanks for your reply, I've not noticed this point, Mike had suggested
to do allocation in notifier when cpu is online, I will try to use that
idea in the formal patch set.

Regards,
Michael Wang

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 1/3] sched: schedule balance map foundation
  2013-02-20 13:25   ` Peter Zijlstra
@ 2013-02-21  4:58     ` Michael Wang
  2013-02-21 11:37       ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Michael Wang @ 2013-02-21  4:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Paul Turner, Mike Galbraith, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/20/2013 09:25 PM, Peter Zijlstra wrote:
> On Tue, 2013-01-29 at 17:09 +0800, Michael Wang wrote:
>> +struct sched_balance_map {
>> +       struct sched_domain **sd[SBM_MAX_TYPE];
>> +       int top_level[SBM_MAX_TYPE];
>> +       struct sched_domain *affine_map[NR_CPUS];
>> +};
> 
> Argh.. affine_map is O(n^2) in nr_cpus, that's not cool.

You are right, it cost space in order to accelerate the system, I've
calculated the cost once before (I'm really not good at this, please let
me know if I make any silly calculation...), the size of struct is:

SBM_MAX_TYPE * size of pointer * domain level
SBM_MAX_TYPE * size of int
NR_CPUS * size of pointer
padding

So for my 64bits box, which has 12 cpu and 3 domain level, the struct
size is:

3 * size of pointer * 3 = 9 pointer
3 * size of int		= 3 int
12 * size of pointer	= 12 pointer
padding

			= 3 int + 21 pointer + padding

And the final cost is 36 int and 252 pointer, add some padding, won't
over 5K, not a big deal.

Now suppose a big 64bits system with 1000 cpu and 10 level(I have no
idea how to calculate level from nodes, 10 is big in my mind...), the
struct size is:

3 * size of pointer * 10 = 30 pointer
3 * size of int		= 3 int
1000 * size of pointer	= 1000 pointer
padding

			= 3 int + 1030 pointer + padding

And the final cost is 3000 int and 1030000 pointer, and some padding,
but won't bigger than 10M, not a big deal for a system with 1000 cpu too.

Regards,
Michael Wang

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-20 13:32   ` Peter Zijlstra
  2013-02-20 14:05     ` Mike Galbraith
@ 2013-02-21  5:14     ` Michael Wang
  1 sibling, 0 replies; 55+ messages in thread
From: Michael Wang @ 2013-02-21  5:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Paul Turner, Mike Galbraith, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/20/2013 09:32 PM, Peter Zijlstra wrote:
> On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote:
> 
>> The changes look clean and reasoable, 
> 
> I don't necessarily agree, note that O(n^2) storage requirement that
> Michael failed to highlight ;-)

Forgive me for not explain this point in cover, but it's really not a
big deal in my opinion...

And I'm going to apply Mike's suggestion, do allocation when cpu active,
that will save some space :)

Regards,
Michael Wang

> 
>> any ideas exactly *why* it speeds up?
> 
> That is indeed the most interesting part.. There's two parts to
> select_task_rq_fair(), the 'regular' affine wakeup path, and the
> fork/exec find_idlest_goo() path. At the very least we need to quantify
> which of these two parts contributes most to the speedup.
> 
> In the power balancing discussion we already noted that the
> find_idlest_goo() is in need of attention.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-20 14:05     ` Mike Galbraith
@ 2013-02-21  5:21       ` Michael Wang
  0 siblings, 0 replies; 55+ messages in thread
From: Michael Wang @ 2013-02-21  5:21 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/20/2013 10:05 PM, Mike Galbraith wrote:
> On Wed, 2013-02-20 at 14:32 +0100, Peter Zijlstra wrote: 
>> On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote:
>>
>>> The changes look clean and reasoable, 
>>
>> I don't necessarily agree, note that O(n^2) storage requirement that
>> Michael failed to highlight ;-)
> 
> (yeah, I mentioned that needs to shrink.. a lot)

Exactly, and I'm going to apply the suggestion now :)

> 
>>> any ideas exactly *why* it speeds up?
>>
>> That is indeed the most interesting part.. There's two parts to
>> select_task_rq_fair(), the 'regular' affine wakeup path, and the
>> fork/exec find_idlest_goo() path. At the very least we need to quantify
>> which of these two parts contributes most to the speedup.
>>
>> In the power balancing discussion we already noted that the
>> find_idlest_goo() is in need of attention.
> 
> Yup, even little stuff like break off the search when load is zero..

Agree, searching in a bunch of idle cpus and their subsets doesn't make
sense...

Regards,
Michael Wang

> unless someone is planning on implementing anti-idle 'course ;-)
> 
> -Mike
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-21  4:51   ` Michael Wang
@ 2013-02-21  6:11     ` Mike Galbraith
  2013-02-21  7:00       ` Michael Wang
  2013-02-21 10:20     ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Mike Galbraith @ 2013-02-21  6:11 UTC (permalink / raw)
  To: Michael Wang
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: 
> On 02/20/2013 06:49 PM, Ingo Molnar wrote:
> [snip]
> > 
> > The changes look clean and reasoable, any ideas exactly *why* it 
> > speeds up?
> > 
> > I.e. are there one or two key changes in the before/after logic 
> > and scheduling patterns that you can identify as causing the 
> > speedup?
> 
> Hi, Ingo
> 
> Thanks for your reply, please let me point out the key changes here
> (forgive me for haven't wrote a good description in cover).
> 
> The performance improvement from this patch set is:
> 1. delay the invoke on wake_affine().
> 2. save the circle to gain proper sd.
> 
> The second point is obviously, and will benefit a lot when the sd
> topology is deep (NUMA is suppose to make it deeper on large system).
> 
> So in my testing on a 12 cpu box, actually most of the benefit comes
> from the first point, and please let me introduce it in detail.
> 
> The old logical when locate affine_sd is:
> 
> 	if prev_cpu != curr_cpu
> 		if wake_affine()
> 			prev_cpu = curr_cpu
> 	new_cpu = select_idle_sibling(prev_cpu)
> 	return new_cpu
> 
> The new logical is same to the old one if prev_cpu == curr_cpu, so let's
> simplify the old logical like:
> 
> 	if wake_affine()
> 		new_cpu = select_idle_sibling(curr_cpu)
> 	else
> 		new_cpu = select_idle_sibling(prev_cpu)
> 
> 	return new_cpu
> 
> Actually that doesn't make sense.
> 
> I think wake_affine() is trying to check whether move a task from
> prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
> won't break balance means curr_cpu is better than prev_cpu for searching
> the idle cpu?

You could argue that it's impossible to break balance by moving any task
to any idle cpu, but that would mean bouncing tasks cross node on every
wakeup is fine, which it isn't.

> So the new logical in this patch set is:
> 
> 	new_cpu = select_idle_sibling(prev_cpu)
> 	if idle_cpu(new_cpu)
> 		return new_cpu

So you tilted the scales in favor of leaving tasks in their current
package, which should benefit large footprint tasks, but should also
penalize light communicating tasks.

I suspect that much of the pgbench improvement comes from the preemption
mitigation from keeping 1:N load maximally spread, which is the perfect
thing to do with such loads.  In all the testing I ever did with it in
1:N mode, preemption dominated performance numbers.  Keep server away
from clients, it has fewer fair competition worries, can consume more
CPU preemption free, pushing the load collapse point strongly upward.

-Mike


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-21  6:11     ` Mike Galbraith
@ 2013-02-21  7:00       ` Michael Wang
  2013-02-21  8:10         ` Mike Galbraith
  0 siblings, 1 reply; 55+ messages in thread
From: Michael Wang @ 2013-02-21  7:00 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/21/2013 02:11 PM, Mike Galbraith wrote:
> On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: 
>> On 02/20/2013 06:49 PM, Ingo Molnar wrote:
>> [snip]
[snip]
>>
>> 	if wake_affine()
>> 		new_cpu = select_idle_sibling(curr_cpu)
>> 	else
>> 		new_cpu = select_idle_sibling(prev_cpu)
>>
>> 	return new_cpu
>>
>> Actually that doesn't make sense.
>>
>> I think wake_affine() is trying to check whether move a task from
>> prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
>> won't break balance means curr_cpu is better than prev_cpu for searching
>> the idle cpu?
> 
> You could argue that it's impossible to break balance by moving any task
> to any idle cpu, but that would mean bouncing tasks cross node on every
> wakeup is fine, which it isn't.

I don't get it... could you please give me more detail on how
wake_affine() related with bouncing?

> 
>> So the new logical in this patch set is:
>>
>> 	new_cpu = select_idle_sibling(prev_cpu)
>> 	if idle_cpu(new_cpu)
>> 		return new_cpu
> 
> So you tilted the scales in favor of leaving tasks in their current
> package, which should benefit large footprint tasks, but should also
> penalize light communicating tasks.

Yes, I'd prefer to wakeup the task on a cpu which:
1. idle
2. close to prev_cpu

So if both curr_cpu and prev_cpu have idle cpu in their topology, which
one is better? that depends on how task benefit from cache and the
balance situation, whatever, I don't think the benefit worth the high
cost of wake_affine() in most cases...

Regards,
Michael Wang

> 
> I suspect that much of the pgbench improvement comes from the preemption
> mitigation from keeping 1:N load maximally spread, which is the perfect
> thing to do with such loads.  In all the testing I ever did with it in
> 1:N mode, preemption dominated performance numbers.  Keep server away
> from clients, it has fewer fair competition worries, can consume more
> CPU preemption free, pushing the load collapse point strongly upward.
> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-21  7:00       ` Michael Wang
@ 2013-02-21  8:10         ` Mike Galbraith
  2013-02-21  9:08           ` Michael Wang
  2013-02-21  9:20           ` Michael Wang
  0 siblings, 2 replies; 55+ messages in thread
From: Mike Galbraith @ 2013-02-21  8:10 UTC (permalink / raw)
  To: Michael Wang
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Thu, 2013-02-21 at 15:00 +0800, Michael Wang wrote: 
> On 02/21/2013 02:11 PM, Mike Galbraith wrote:
> > On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: 
> >> On 02/20/2013 06:49 PM, Ingo Molnar wrote:
> >> [snip]
> [snip]
> >>
> >> 	if wake_affine()
> >> 		new_cpu = select_idle_sibling(curr_cpu)
> >> 	else
> >> 		new_cpu = select_idle_sibling(prev_cpu)
> >>
> >> 	return new_cpu
> >>
> >> Actually that doesn't make sense.
> >>
> >> I think wake_affine() is trying to check whether move a task from
> >> prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
> >> won't break balance means curr_cpu is better than prev_cpu for searching
> >> the idle cpu?
> > 
> > You could argue that it's impossible to break balance by moving any task
> > to any idle cpu, but that would mean bouncing tasks cross node on every
> > wakeup is fine, which it isn't.
> 
> I don't get it... could you please give me more detail on how
> wake_affine() related with bouncing?

If we didn't ever ask if it's ok, we'd always pull, and stack load up on
one package if there's the tiniest of holes to stuff a task into,
periodic balance forcibly rips buddies back apart, repeat.  At least
with wake_affine() in the loop, there's somewhat of a damper. 

> >> So the new logical in this patch set is:
> >>
> >> 	new_cpu = select_idle_sibling(prev_cpu)
> >> 	if idle_cpu(new_cpu)
> >> 		return new_cpu
> > 
> > So you tilted the scales in favor of leaving tasks in their current
> > package, which should benefit large footprint tasks, but should also
> > penalize light communicating tasks.
> 
> Yes, I'd prefer to wakeup the task on a cpu which:
> 1. idle
> 2. close to prev_cpu
> 
> So if both curr_cpu and prev_cpu have idle cpu in their topology, which
> one is better? that depends on how task benefit from cache and the
> balance situation, whatever, I don't think the benefit worth the high
> cost of wake_affine() in most cases...

We've always used wake_affine() before, yet been able to schedule at
high frequency, so I don't see that it can be _that_ expensive.  I
haven't actually measured lately (loooong time) though.

WRT cost/benefit of migration, yeah, it depends entirely on the tasks,
some will gain, some will lose.  On a modern single processor box, it
just doesn't matter, there's only one llc (two s_i_s() calls = oopsie),
but on my beloved old Q6600 or a big box, it'll matter a lot to
something.  NUMA balance will deal with big boxen, my trusty old Q6600
will likely get all upset with some localhost stuff.

-Mike


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-21  8:10         ` Mike Galbraith
@ 2013-02-21  9:08           ` Michael Wang
  2013-02-21  9:43             ` Mike Galbraith
  2013-02-21  9:20           ` Michael Wang
  1 sibling, 1 reply; 55+ messages in thread
From: Michael Wang @ 2013-02-21  9:08 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/21/2013 04:10 PM, Mike Galbraith wrote:
> On Thu, 2013-02-21 at 15:00 +0800, Michael Wang wrote: 
>> On 02/21/2013 02:11 PM, Mike Galbraith wrote:
>>> On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: 
>>>> On 02/20/2013 06:49 PM, Ingo Molnar wrote:
>>>> [snip]
>> [snip]
>>>>
>>>> 	if wake_affine()
>>>> 		new_cpu = select_idle_sibling(curr_cpu)
>>>> 	else
>>>> 		new_cpu = select_idle_sibling(prev_cpu)
>>>>
>>>> 	return new_cpu
>>>>
>>>> Actually that doesn't make sense.
>>>>
>>>> I think wake_affine() is trying to check whether move a task from
>>>> prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
>>>> won't break balance means curr_cpu is better than prev_cpu for searching
>>>> the idle cpu?
>>>
>>> You could argue that it's impossible to break balance by moving any task
>>> to any idle cpu, but that would mean bouncing tasks cross node on every
>>> wakeup is fine, which it isn't.
>>
>> I don't get it... could you please give me more detail on how
>> wake_affine() related with bouncing?
> 
> If we didn't ever ask if it's ok, we'd always pull, and stack load up on
> one package if there's the tiniest of holes to stuff a task into,
> periodic balance forcibly rips buddies back apart, repeat.  At least
> with wake_affine() in the loop, there's somewhat of a damper. 

I think I got your point, a question about the possibility to locate an
idle cpu which belong to different package from prev_cpu, and how
wake_affine() help on it.

Old logical require prev_cpu and curr_cpu to be affine, which means they
share caches on some level, usually means they are in the same package,
and the select_idle_sibling() start search from the top cache share
level, usually means search all the smp in one package, so, usually,
both of the cases will only locate idle cpu in their package.

I could not figure out what's the case that the wake_affine() could
benefit a lot, but it do harm a lot according to the testing results.

> 
>>>> So the new logical in this patch set is:
>>>>
>>>> 	new_cpu = select_idle_sibling(prev_cpu)
>>>> 	if idle_cpu(new_cpu)
>>>> 		return new_cpu
>>>
>>> So you tilted the scales in favor of leaving tasks in their current
>>> package, which should benefit large footprint tasks, but should also
>>> penalize light communicating tasks.
>>
>> Yes, I'd prefer to wakeup the task on a cpu which:
>> 1. idle
>> 2. close to prev_cpu
>>
>> So if both curr_cpu and prev_cpu have idle cpu in their topology, which
>> one is better? that depends on how task benefit from cache and the
>> balance situation, whatever, I don't think the benefit worth the high
>> cost of wake_affine() in most cases...
> 
> We've always used wake_affine() before, yet been able to schedule at
> high frequency, so I don't see that it can be _that_ expensive.  I
> haven't actually measured lately (loooong time) though.
> 
> WRT cost/benefit of migration, yeah, it depends entirely on the tasks,
> some will gain, some will lose.  On a modern single processor box, it
> just doesn't matter, there's only one llc (two s_i_s() calls = oopsie),
> but on my beloved old Q6600 or a big box, it'll matter a lot to
> something.  NUMA balance will deal with big boxen, my trusty old Q6600
> will likely get all upset with some localhost stuff.

But is this patch set really cause regression on your Q6600? It may
sacrificed some thing, but I still think it will benefit far more,
especially on huge systems.

Regards,
Michael Wang


> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-21  8:10         ` Mike Galbraith
  2013-02-21  9:08           ` Michael Wang
@ 2013-02-21  9:20           ` Michael Wang
  1 sibling, 0 replies; 55+ messages in thread
From: Michael Wang @ 2013-02-21  9:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/21/2013 04:10 PM, Mike Galbraith wrote:
> On Thu, 2013-02-21 at 15:00 +0800, Michael Wang wrote: 
>> On 02/21/2013 02:11 PM, Mike Galbraith wrote:
>>> On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: 
>>>> On 02/20/2013 06:49 PM, Ingo Molnar wrote:
>>>> [snip]
>> [snip]
>>>>
>>>> 	if wake_affine()
>>>> 		new_cpu = select_idle_sibling(curr_cpu)
>>>> 	else
>>>> 		new_cpu = select_idle_sibling(prev_cpu)
>>>>
>>>> 	return new_cpu
>>>>
>>>> Actually that doesn't make sense.
>>>>
>>>> I think wake_affine() is trying to check whether move a task from
>>>> prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
>>>> won't break balance means curr_cpu is better than prev_cpu for searching
>>>> the idle cpu?
>>>
>>> You could argue that it's impossible to break balance by moving any task
>>> to any idle cpu, but that would mean bouncing tasks cross node on every
>>> wakeup is fine, which it isn't.
>>
>> I don't get it... could you please give me more detail on how
>> wake_affine() related with bouncing?
> 
> If we didn't ever ask if it's ok, we'd always pull, and stack load up on
> one package if there's the tiniest of holes to stuff a task into,
> periodic balance forcibly rips buddies back apart, repeat.  At least
> with wake_affine() in the loop, there's somewhat of a damper. 

Oh, I think I got the reason why old logical check the affine_sd
firstly, so when prev_cpu and curr_cpu belong to different package,
there will be a chance to enter balance path, that seems like a solution
for this problem, amazing ;-)

You are right, with out this logical, chances to balance load between
packages will missed, I will apply it in next version.

Regards,
Michael Wang

> 
>>>> So the new logical in this patch set is:
>>>>
>>>> 	new_cpu = select_idle_sibling(prev_cpu)
>>>> 	if idle_cpu(new_cpu)
>>>> 		return new_cpu
>>>
>>> So you tilted the scales in favor of leaving tasks in their current
>>> package, which should benefit large footprint tasks, but should also
>>> penalize light communicating tasks.
>>
>> Yes, I'd prefer to wakeup the task on a cpu which:
>> 1. idle
>> 2. close to prev_cpu
>>
>> So if both curr_cpu and prev_cpu have idle cpu in their topology, which
>> one is better? that depends on how task benefit from cache and the
>> balance situation, whatever, I don't think the benefit worth the high
>> cost of wake_affine() in most cases...
> 
> We've always used wake_affine() before, yet been able to schedule at
> high frequency, so I don't see that it can be _that_ expensive.  I
> haven't actually measured lately (loooong time) though.
> 
> WRT cost/benefit of migration, yeah, it depends entirely on the tasks,
> some will gain, some will lose.  On a modern single processor box, it
> just doesn't matter, there's only one llc (two s_i_s() calls = oopsie),
> but on my beloved old Q6600 or a big box, it'll matter a lot to
> something.  NUMA balance will deal with big boxen, my trusty old Q6600
> will likely get all upset with some localhost stuff.
> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-21  9:08           ` Michael Wang
@ 2013-02-21  9:43             ` Mike Galbraith
  2013-02-22  2:36               ` Michael Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Mike Galbraith @ 2013-02-21  9:43 UTC (permalink / raw)
  To: Michael Wang
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote:

> But is this patch set really cause regression on your Q6600? It may
> sacrificed some thing, but I still think it will benefit far more,
> especially on huge systems.

We spread on FORK/EXEC, and will no longer will pull communicating tasks
back to a shared cache with the new logic preferring to leave wakee
remote, so while no, I haven't tested (will try to find round tuit) it
seems  it _must_ hurt.  Dragging data from one llc to the other on Q6600
hurts a LOT.  Every time a client and server are cross llc, it's a huge
hit.  The previous logic pulled communicating tasks together right when
it matters the most, intermittent load... or interactive use.

-Mike


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-21  4:51   ` Michael Wang
  2013-02-21  6:11     ` Mike Galbraith
@ 2013-02-21 10:20     ` Peter Zijlstra
  2013-02-22  2:37       ` Michael Wang
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2013-02-21 10:20 UTC (permalink / raw)
  To: Michael Wang
  Cc: Ingo Molnar, LKML, Paul Turner, Mike Galbraith, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote:
> The old logical when locate affine_sd is:
> 
>         if prev_cpu != curr_cpu
>                 if wake_affine()
>                         prev_cpu = curr_cpu
>         new_cpu = select_idle_sibling(prev_cpu)
>         return new_cpu
> 
> The new logical is same to the old one if prev_cpu == curr_cpu, so
> let's
> simplify the old logical like:
> 
>         if wake_affine()
>                 new_cpu = select_idle_sibling(curr_cpu)
>         else
>                 new_cpu = select_idle_sibling(prev_cpu)
> 
>         return new_cpu
> 
> Actually that doesn't make sense.

It does :-)

> I think wake_affine() is trying to check whether move a task from
> prev_cpu to curr_cpu will break the balance in affine_sd or not, but
> why
> won't break balance means curr_cpu is better than prev_cpu for
> searching
> the idle cpu?

It doesn't, the whole affine wakeup stuff is meant to pull waking tasks
towards the cpu that does the wakeup, we limit this by putting bounds on
the imbalance this is may create.

The reason we want to run tasks on the cpu that does the wakeup is
because that cpu 'obviously' is running something related and it seems
like a good idea to run related tasks close together.

So look at affine wakeups as a force that groups related tasks.

> So the new logical in this patch set is:
> 
>         new_cpu = select_idle_sibling(prev_cpu)
>         if idle_cpu(new_cpu)
>                 return new_cpu
> 
>         new_cpu = select_idle_sibling(curr_cpu)
>         if idle_cpu(new_cpu) {
>                 if wake_affine()
>                         return new_cpu
>         }
> 
>         return prev_cpu
> 
> And now, unless we are really going to move load from prev_cpu to
> curr_cpu, we won't use wake_affine() any more.

That's completely breaks stuff, not cool.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 1/3] sched: schedule balance map foundation
  2013-02-21  4:58     ` Michael Wang
@ 2013-02-21 11:37       ` Peter Zijlstra
  2013-02-22  2:53         ` Michael Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2013-02-21 11:37 UTC (permalink / raw)
  To: Michael Wang
  Cc: LKML, Ingo Molnar, Paul Turner, Mike Galbraith, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Thu, 2013-02-21 at 12:58 +0800, Michael Wang wrote:
> 
> You are right, it cost space in order to accelerate the system, I've
> calculated the cost once before (I'm really not good at this, please
> let
> me know if I make any silly calculation...), 

The exact size isn't that important, but its trivial to see its quadric.
You have a NR_CPUS array per-cpu, thus O(n^2).

( side note; invest in getting good at complexity analysis -- or at
least competent, its the single most important aspect of programming. )

...

> And the final cost is 3000 int and 1030000 pointer, and some padding,
> but won't bigger than 10M, not a big deal for a system with 1000 cpu
> too.

Maybe, but quadric stuff should be frowned upon at all times, these
things tend to explode when you least expect it.

For instance, IIRC the biggest single image system SGI booted had 16k
cpus in there, that ends up at something like 14+14+3=31 aka as 2G of
storage just for your lookup -- that seems somewhat preposterous.

The domain levels are roughly O(log n) related to the total cpus, so
what you're doing is replacing an O(log n) lookup with an O(1) lookup,
but at the cost of O(n^2) storage. 




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-21  9:43             ` Mike Galbraith
@ 2013-02-22  2:36               ` Michael Wang
  2013-02-22  5:02                 ` Mike Galbraith
  2013-02-22  8:21                 ` Peter Zijlstra
  0 siblings, 2 replies; 55+ messages in thread
From: Michael Wang @ 2013-02-22  2:36 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/21/2013 05:43 PM, Mike Galbraith wrote:
> On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote:
> 
>> But is this patch set really cause regression on your Q6600? It may
>> sacrificed some thing, but I still think it will benefit far more,
>> especially on huge systems.
> 
> We spread on FORK/EXEC, and will no longer will pull communicating tasks
> back to a shared cache with the new logic preferring to leave wakee
> remote, so while no, I haven't tested (will try to find round tuit) it
> seems  it _must_ hurt.  Dragging data from one llc to the other on Q6600
> hurts a LOT.  Every time a client and server are cross llc, it's a huge
> hit.  The previous logic pulled communicating tasks together right when
> it matters the most, intermittent load... or interactive use.

I agree that this is a problem need to be solved, but don't agree that
wake_affine() is the solution.

According to my understanding, in the old world, wake_affine() will only
be used if curr_cpu and prev_cpu share cache, which means they are in
one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't
have the chance to spread the task out of that package.

I'm going to recover the logical that only do select_idle_sibling() when
prev_cpu and curr_cpu are affine, so now the new logical will only
prefer leaving task in old package if both prev_cpu and curr_cpu are in
that package, I think this could solve the problem, isn't it?

Regards,
Michael Wang



> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-21 10:20     ` Peter Zijlstra
@ 2013-02-22  2:37       ` Michael Wang
  2013-02-22  5:08         ` Mike Galbraith
  2013-02-22  8:36         ` Peter Zijlstra
  0 siblings, 2 replies; 55+ messages in thread
From: Michael Wang @ 2013-02-22  2:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Paul Turner, Mike Galbraith, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/21/2013 06:20 PM, Peter Zijlstra wrote:
> On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote:
>> The old logical when locate affine_sd is:
>>
>>         if prev_cpu != curr_cpu
>>                 if wake_affine()
>>                         prev_cpu = curr_cpu
>>         new_cpu = select_idle_sibling(prev_cpu)
>>         return new_cpu
>>
>> The new logical is same to the old one if prev_cpu == curr_cpu, so
>> let's
>> simplify the old logical like:
>>
>>         if wake_affine()
>>                 new_cpu = select_idle_sibling(curr_cpu)
>>         else
>>                 new_cpu = select_idle_sibling(prev_cpu)
>>
>>         return new_cpu
>>
>> Actually that doesn't make sense.
> 
> It does :-)
> 
>> I think wake_affine() is trying to check whether move a task from
>> prev_cpu to curr_cpu will break the balance in affine_sd or not, but
>> why
>> won't break balance means curr_cpu is better than prev_cpu for
>> searching
>> the idle cpu?
> 
> It doesn't, the whole affine wakeup stuff is meant to pull waking tasks
> towards the cpu that does the wakeup, we limit this by putting bounds on
> the imbalance this is may create.
> 
> The reason we want to run tasks on the cpu that does the wakeup is
> because that cpu 'obviously' is running something related and it seems
> like a good idea to run related tasks close together.
> 
> So look at affine wakeups as a force that groups related tasks.

That's right, and it's one point I've missed when judging the
wake_affine()...

But that's really some benefit hardly to be estimate, especially when
the workload is heavy, the cost of wake_affine() is very high to
calculated se one by one, is that worth for some benefit we could not
promise?

According to the testing result, I could not agree this purpose of
wake_affine() benefit us, but I'm sure that wake_affine() is a terrible
performance killer when system is busy.

> 
>> So the new logical in this patch set is:
>>
>>         new_cpu = select_idle_sibling(prev_cpu)
>>         if idle_cpu(new_cpu)
>>                 return new_cpu
>>
>>         new_cpu = select_idle_sibling(curr_cpu)
>>         if idle_cpu(new_cpu) {
>>                 if wake_affine()
>>                         return new_cpu
>>         }
>>
>>         return prev_cpu
>>
>> And now, unless we are really going to move load from prev_cpu to
>> curr_cpu, we won't use wake_affine() any more.
> 
> That's completely breaks stuff, not cool.

Could you please give more details on what's the point you think is bad?

Regards,
Michael Wang

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 1/3] sched: schedule balance map foundation
  2013-02-21 11:37       ` Peter Zijlstra
@ 2013-02-22  2:53         ` Michael Wang
  2013-02-22  3:33           ` Alex Shi
  0 siblings, 1 reply; 55+ messages in thread
From: Michael Wang @ 2013-02-22  2:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Paul Turner, Mike Galbraith, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/21/2013 07:37 PM, Peter Zijlstra wrote:
> On Thu, 2013-02-21 at 12:58 +0800, Michael Wang wrote:
>>
>> You are right, it cost space in order to accelerate the system, I've
>> calculated the cost once before (I'm really not good at this, please
>> let
>> me know if I make any silly calculation...), 
> 
> The exact size isn't that important, but its trivial to see its quadric.
> You have a NR_CPUS array per-cpu, thus O(n^2).
> 
> ( side note; invest in getting good at complexity analysis -- or at
> least competent, its the single most important aspect of programming. )
> 
> ...
> 
>> And the final cost is 3000 int and 1030000 pointer, and some padding,
>> but won't bigger than 10M, not a big deal for a system with 1000 cpu
>> too.
> 
> Maybe, but quadric stuff should be frowned upon at all times, these
> things tend to explode when you least expect it.
> 
> For instance, IIRC the biggest single image system SGI booted had 16k
> cpus in there, that ends up at something like 14+14+3=31 aka as 2G of
> storage just for your lookup -- that seems somewhat preposterous.

Honestly, if I'm a admin who own 16k cpus system (I could not even image
how many memory it could have...), I really prefer to exchange 2G memory
to gain some performance.

I see your point here, the cost of space will grow exponentially, but
the memory of system will also grow, and according to my understanding ,
it's faster.

Regards,
Michael Wang

> 
> The domain levels are roughly O(log n) related to the total cpus, so
> what you're doing is replacing an O(log n) lookup with an O(1) lookup,
> but at the cost of O(n^2) storage. 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 1/3] sched: schedule balance map foundation
  2013-02-22  2:53         ` Michael Wang
@ 2013-02-22  3:33           ` Alex Shi
  2013-02-22  4:19             ` Michael Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Alex Shi @ 2013-02-22  3:33 UTC (permalink / raw)
  To: Michael Wang
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Turner, Mike Galbraith,
	Andrew Morton, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/22/2013 10:53 AM, Michael Wang wrote:
>> > 
>>> >> And the final cost is 3000 int and 1030000 pointer, and some padding,
>>> >> but won't bigger than 10M, not a big deal for a system with 1000 cpu
>>> >> too.
>> > 
>> > Maybe, but quadric stuff should be frowned upon at all times, these
>> > things tend to explode when you least expect it.
>> > 
>> > For instance, IIRC the biggest single image system SGI booted had 16k
>> > cpus in there, that ends up at something like 14+14+3=31 aka as 2G of
>> > storage just for your lookup -- that seems somewhat preposterous.
> Honestly, if I'm a admin who own 16k cpus system (I could not even image
> how many memory it could have...), I really prefer to exchange 2G memory
> to gain some performance.
> 
> I see your point here, the cost of space will grow exponentially, but
> the memory of system will also grow, and according to my understanding ,
> it's faster.

Why not seek other way to change O(n^2) to O(n)?

Access 2G memory is unbelievable performance cost.

There are too many jokes on the short-sight of compute scalability, like
Gates' 64K memory in 2000.

-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 1/3] sched: schedule balance map foundation
  2013-02-22  3:33           ` Alex Shi
@ 2013-02-22  4:19             ` Michael Wang
  2013-02-22  4:46               ` Alex Shi
  0 siblings, 1 reply; 55+ messages in thread
From: Michael Wang @ 2013-02-22  4:19 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Turner, Mike Galbraith,
	Andrew Morton, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/22/2013 11:33 AM, Alex Shi wrote:
> On 02/22/2013 10:53 AM, Michael Wang wrote:
>>>>
>>>>>> And the final cost is 3000 int and 1030000 pointer, and some padding,
>>>>>> but won't bigger than 10M, not a big deal for a system with 1000 cpu
>>>>>> too.
>>>>
>>>> Maybe, but quadric stuff should be frowned upon at all times, these
>>>> things tend to explode when you least expect it.
>>>>
>>>> For instance, IIRC the biggest single image system SGI booted had 16k
>>>> cpus in there, that ends up at something like 14+14+3=31 aka as 2G of
>>>> storage just for your lookup -- that seems somewhat preposterous.
>> Honestly, if I'm a admin who own 16k cpus system (I could not even image
>> how many memory it could have...), I really prefer to exchange 2G memory
>> to gain some performance.
>>
>> I see your point here, the cost of space will grow exponentially, but
>> the memory of system will also grow, and according to my understanding ,
>> it's faster.
> 

Hi, Alex

Thanks for your reply.

> Why not seek other way to change O(n^2) to O(n)?
> 
> Access 2G memory is unbelievable performance cost.

Not access 2G memory, but (2G / 16K) memory, the sbm size is O(N).

And please notice that on 16k cpus system, topology will be deep if NUMA
enabled (O(log N) as Peter said), and that's really a good stage for
this idea to perform on, we could save lot's of recursed 'for' cycles.

> 
> There are too many jokes on the short-sight of compute scalability, like
> Gates' 64K memory in 2000.

Please do believe me that I won't give up any chance to solve or lighten
this issue (like apply Mike's suggestion), and please let me know if you
have any suggestions to reduce the memory cost.

May be I could make this idea as an option, override the
select_task_rq_fair() when people want the new logical, and if they
don't want to trade with memory, just !CONFIG.

Regards,
Michael Wang

> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 1/3] sched: schedule balance map foundation
  2013-02-22  4:19             ` Michael Wang
@ 2013-02-22  4:46               ` Alex Shi
  2013-02-22  5:05                 ` Michael Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Alex Shi @ 2013-02-22  4:46 UTC (permalink / raw)
  To: Michael Wang
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Turner, Mike Galbraith,
	Andrew Morton, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/22/2013 12:19 PM, Michael Wang wrote:
> 
>> > Why not seek other way to change O(n^2) to O(n)?
>> > 
>> > Access 2G memory is unbelievable performance cost.
> Not access 2G memory, but (2G / 16K) memory, the sbm size is O(N).
> 
> And please notice that on 16k cpus system, topology will be deep if NUMA
> enabled (O(log N) as Peter said), and that's really a good stage for
> this idea to perform on, we could save lot's of recursed 'for' cycles.
> 

CPU execute part is very very fast compare to the memory access, the
'for' cycles cost is most on the memory access for many domain/groups
data, not instruction execution.

In a hot patch, several KB memory access will cause clear cpu cache
pollution then make kernel slowly.

-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  2:36               ` Michael Wang
@ 2013-02-22  5:02                 ` Mike Galbraith
  2013-02-22  5:26                   ` Michael Wang
  2013-02-22  6:42                   ` Michael Wang
  2013-02-22  8:21                 ` Peter Zijlstra
  1 sibling, 2 replies; 55+ messages in thread
From: Mike Galbraith @ 2013-02-22  5:02 UTC (permalink / raw)
  To: Michael Wang
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: 
> On 02/21/2013 05:43 PM, Mike Galbraith wrote:
> > On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote:
> > 
> >> But is this patch set really cause regression on your Q6600? It may
> >> sacrificed some thing, but I still think it will benefit far more,
> >> especially on huge systems.
> > 
> > We spread on FORK/EXEC, and will no longer will pull communicating tasks
> > back to a shared cache with the new logic preferring to leave wakee
> > remote, so while no, I haven't tested (will try to find round tuit) it
> > seems  it _must_ hurt.  Dragging data from one llc to the other on Q6600
> > hurts a LOT.  Every time a client and server are cross llc, it's a huge
> > hit.  The previous logic pulled communicating tasks together right when
> > it matters the most, intermittent load... or interactive use.
> 
> I agree that this is a problem need to be solved, but don't agree that
> wake_affine() is the solution.

It's not perfect, but it's better than no countering force at all.  It's
a relic of the dark ages, when affine meant L2, ie this cpu.  Now days,
affine has a whole new meaning, L3, so it could be done differently, but
_some_ kind of opposing force is required.

> According to my understanding, in the old world, wake_affine() will only
> be used if curr_cpu and prev_cpu share cache, which means they are in
> one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't
> have the chance to spread the task out of that package.

? affine_sd is the first domain spanning both cpus, that may be NODE.
True we won't ever spread in the wakeup path unless SD_WAKE_BALANCE is
set that is.  Would be nice to be able to do that without shredding
performance.

Off the top of my pointy head, I can think of a way to _maybe_ improve
the "affine" wakeup criteria:  Add a small (package size? and very fast)
FIFO queue to task struct, record waker/wakee relationship.  If
relationship exists in that queue (rbtree), try to wake local, if not,
wake remote.  The thought is to identify situations ala 1:N pgbench
where you really need to keep the load spread.  That need arises when
the sum wakees + waker won't fit in one cache.  True buddies would
always hit (hm, hit rate), always try to become affine where they
thrive.  1:N stuff starts missing when client count exceeds package
size, starts expanding it's horizons. 'Course you would still need to
NAK if imbalanced too badly, and let NUMA stuff NAK touching lard-balls
and whatnot.  With a little more smarts, we could have happy 1:N, and
buddies don't have to chat through 2m thick walls to make 1:N scale as
well as it can before it dies of stupidity.

-Mike


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 1/3] sched: schedule balance map foundation
  2013-02-22  4:46               ` Alex Shi
@ 2013-02-22  5:05                 ` Michael Wang
  0 siblings, 0 replies; 55+ messages in thread
From: Michael Wang @ 2013-02-22  5:05 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Turner, Mike Galbraith,
	Andrew Morton, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/22/2013 12:46 PM, Alex Shi wrote:
> On 02/22/2013 12:19 PM, Michael Wang wrote:
>>
>>>> Why not seek other way to change O(n^2) to O(n)?
>>>>
>>>> Access 2G memory is unbelievable performance cost.
>> Not access 2G memory, but (2G / 16K) memory, the sbm size is O(N).
>>
>> And please notice that on 16k cpus system, topology will be deep if NUMA
>> enabled (O(log N) as Peter said), and that's really a good stage for
>> this idea to perform on, we could save lot's of recursed 'for' cycles.
>>
> 
> CPU execute part is very very fast compare to the memory access, the
> 'for' cycles cost is most on the memory access for many domain/groups
> data, not instruction execution.
> 
> In a hot patch, several KB memory access will cause clear cpu cache
> pollution then make kernel slowly.

Hmm...that's a good catch.

Comparison between memory access and cpu execution, no doubt the latter
will win, you are right.

But that was same in the old world when access the struct sched_domain,
isn't it?

                for_each_domain(cpu, tmp) {
                        if (weight <= tmp->span_weight)
                                break;
                        if (tmp->flags & sd_flag)
                                sd = tmp;
                }

Both old and new may access data across nodes, but the old one will
access several times more, isn't it?

Regards,
Michael Wang

> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  2:37       ` Michael Wang
@ 2013-02-22  5:08         ` Mike Galbraith
  2013-02-22  6:06           ` Michael Wang
  2013-02-22  8:36         ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Mike Galbraith @ 2013-02-22  5:08 UTC (permalink / raw)
  To: Michael Wang
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:

> According to the testing result, I could not agree this purpose of
> wake_affine() benefit us, but I'm sure that wake_affine() is a terrible
> performance killer when system is busy.

(hm, result is singular.. pgbench in 1:N mode only?)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  5:02                 ` Mike Galbraith
@ 2013-02-22  5:26                   ` Michael Wang
  2013-02-22  6:13                     ` Mike Galbraith
  2013-02-22  6:42                   ` Michael Wang
  1 sibling, 1 reply; 55+ messages in thread
From: Michael Wang @ 2013-02-22  5:26 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/22/2013 01:02 PM, Mike Galbraith wrote:
> On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: 
>> On 02/21/2013 05:43 PM, Mike Galbraith wrote:
>>> On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote:
>>>
>>>> But is this patch set really cause regression on your Q6600? It may
>>>> sacrificed some thing, but I still think it will benefit far more,
>>>> especially on huge systems.
>>>
>>> We spread on FORK/EXEC, and will no longer will pull communicating tasks
>>> back to a shared cache with the new logic preferring to leave wakee
>>> remote, so while no, I haven't tested (will try to find round tuit) it
>>> seems  it _must_ hurt.  Dragging data from one llc to the other on Q6600
>>> hurts a LOT.  Every time a client and server are cross llc, it's a huge
>>> hit.  The previous logic pulled communicating tasks together right when
>>> it matters the most, intermittent load... or interactive use.
>>
>> I agree that this is a problem need to be solved, but don't agree that
>> wake_affine() is the solution.
> 
> It's not perfect, but it's better than no countering force at all.  It's
> a relic of the dark ages, when affine meant L2, ie this cpu.  Now days,
> affine has a whole new meaning, L3, so it could be done differently, but
> _some_ kind of opposing force is required.
> 
>> According to my understanding, in the old world, wake_affine() will only
>> be used if curr_cpu and prev_cpu share cache, which means they are in
>> one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't
>> have the chance to spread the task out of that package.
> 
> ? affine_sd is the first domain spanning both cpus, that may be NODE.
> True we won't ever spread in the wakeup path unless SD_WAKE_BALANCE is
> set that is.  Would be nice to be able to do that without shredding
> performance.
> 
> Off the top of my pointy head, I can think of a way to _maybe_ improve
> the "affine" wakeup criteria:  Add a small (package size? and very fast)
> FIFO queue to task struct, record waker/wakee relationship.  If
> relationship exists in that queue (rbtree), try to wake local, if not,
> wake remote.  The thought is to identify situations ala 1:N pgbench
> where you really need to keep the load spread.  That need arises when
> the sum wakees + waker won't fit in one cache.  True buddies would
> always hit (hm, hit rate), always try to become affine where they
> thrive.  1:N stuff starts missing when client count exceeds package
> size, starts expanding it's horizons. 'Course you would still need to
> NAK if imbalanced too badly, and let NUMA stuff NAK touching lard-balls
> and whatnot.  With a little more smarts, we could have happy 1:N, and
> buddies don't have to chat through 2m thick walls to make 1:N scale as
> well as it can before it dies of stupidity.

Just confirm that I'm not on the wrong way, did the 1:N mode here means
1 task forked N threads, and child always talk with father?

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  5:08         ` Mike Galbraith
@ 2013-02-22  6:06           ` Michael Wang
  2013-02-22  6:19             ` Mike Galbraith
  0 siblings, 1 reply; 55+ messages in thread
From: Michael Wang @ 2013-02-22  6:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/22/2013 01:08 PM, Mike Galbraith wrote:
> On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
> 
>> According to the testing result, I could not agree this purpose of
>> wake_affine() benefit us, but I'm sure that wake_affine() is a terrible
>> performance killer when system is busy.
> 
> (hm, result is singular.. pgbench in 1:N mode only?)

I'm not sure about how pgbench implemented, all I know is it will create
several instance and access the database, I suppose no different from
several threads access database (1 server and N clients?).

There are improvement since when system busy, wake_affine() will be skipped.

And in old world, when system is busy, wake_affine() will only be
skipped if prev_cpu and curr_cpu belong to different nodes.

Regards,
Michael Wang

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  5:26                   ` Michael Wang
@ 2013-02-22  6:13                     ` Mike Galbraith
  0 siblings, 0 replies; 55+ messages in thread
From: Mike Galbraith @ 2013-02-22  6:13 UTC (permalink / raw)
  To: Michael Wang
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Fri, 2013-02-22 at 13:26 +0800, Michael Wang wrote:

> Just confirm that I'm not on the wrong way, did the 1:N mode here means
> 1 task forked N threads, and child always talk with father?

Yes, one server, many clients.

-Mike


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  6:06           ` Michael Wang
@ 2013-02-22  6:19             ` Mike Galbraith
  0 siblings, 0 replies; 55+ messages in thread
From: Mike Galbraith @ 2013-02-22  6:19 UTC (permalink / raw)
  To: Michael Wang
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Fri, 2013-02-22 at 14:06 +0800, Michael Wang wrote: 
> On 02/22/2013 01:08 PM, Mike Galbraith wrote:
> > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
> > 
> >> According to the testing result, I could not agree this purpose of
> >> wake_affine() benefit us, but I'm sure that wake_affine() is a terrible
> >> performance killer when system is busy.
> > 
> > (hm, result is singular.. pgbench in 1:N mode only?)
> 
> I'm not sure about how pgbench implemented, all I know is it will create
> several instance and access the database, I suppose no different from
> several threads access database (1 server and N clients?).

It's user switchable.

-Mike


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  5:02                 ` Mike Galbraith
  2013-02-22  5:26                   ` Michael Wang
@ 2013-02-22  6:42                   ` Michael Wang
  2013-02-22  8:17                     ` Mike Galbraith
  1 sibling, 1 reply; 55+ messages in thread
From: Michael Wang @ 2013-02-22  6:42 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/22/2013 01:02 PM, Mike Galbraith wrote:
> On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: 
>> On 02/21/2013 05:43 PM, Mike Galbraith wrote:
>>> On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote:
>>>
>>>> But is this patch set really cause regression on your Q6600? It may
>>>> sacrificed some thing, but I still think it will benefit far more,
>>>> especially on huge systems.
>>>
>>> We spread on FORK/EXEC, and will no longer will pull communicating tasks
>>> back to a shared cache with the new logic preferring to leave wakee
>>> remote, so while no, I haven't tested (will try to find round tuit) it
>>> seems  it _must_ hurt.  Dragging data from one llc to the other on Q6600
>>> hurts a LOT.  Every time a client and server are cross llc, it's a huge
>>> hit.  The previous logic pulled communicating tasks together right when
>>> it matters the most, intermittent load... or interactive use.
>>
>> I agree that this is a problem need to be solved, but don't agree that
>> wake_affine() is the solution.
> 
> It's not perfect, but it's better than no countering force at all.  It's
> a relic of the dark ages, when affine meant L2, ie this cpu.  Now days,
> affine has a whole new meaning, L3, so it could be done differently, but
> _some_ kind of opposing force is required.
> 
>> According to my understanding, in the old world, wake_affine() will only
>> be used if curr_cpu and prev_cpu share cache, which means they are in
>> one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't
>> have the chance to spread the task out of that package.
> 
> ? affine_sd is the first domain spanning both cpus, that may be NODE.
> True we won't ever spread in the wakeup path unless SD_WAKE_BALANCE is
> set that is.  Would be nice to be able to do that without shredding
> performance.

That's right, we need two conditions in each select instance:
1. prev_cpu and curr_cpu are not affine
2. SD_WAKE_BALANCE

> 
> Off the top of my pointy head, I can think of a way to _maybe_ improve
> the "affine" wakeup criteria:  Add a small (package size? and very fast)
> FIFO queue to task struct, record waker/wakee relationship.  If
> relationship exists in that queue (rbtree), try to wake local, if not,
> wake remote.  The thought is to identify situations ala 1:N pgbench
> where you really need to keep the load spread.  That need arises when
> the sum wakees + waker won't fit in one cache.  True buddies would
> always hit (hm, hit rate), always try to become affine where they
> thrive.  1:N stuff starts missing when client count exceeds package
> size, starts expanding it's horizons. 'Course you would still need to
> NAK if imbalanced too badly, and let NUMA stuff NAK touching lard-balls
> and whatnot.  With a little more smarts, we could have happy 1:N, and
> buddies don't have to chat through 2m thick walls to make 1:N scale as
> well as it can before it dies of stupidity.

So this is trying to take care the condition when curr_cpu(local) and
prev_cpu(remote) are on different nodes, which in the old world,
wake_affine() won't be invoked, correct?

Hmm...I think this maybe a good additional checking before enter balance
path, but I could not estimate the cost to record the relationship at
this moment of time...

Whatever, after applied the affine logical into new world, it will gain
the ability to spread tasks cross nodes just like the old world, your
idea may be an optimization, but the logical is out of the changing in
this patch set, which means if it benefits, the beneficiary will be not
only new but also old.

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  6:42                   ` Michael Wang
@ 2013-02-22  8:17                     ` Mike Galbraith
  2013-02-22  8:35                       ` Michael Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Mike Galbraith @ 2013-02-22  8:17 UTC (permalink / raw)
  To: Michael Wang
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Fri, 2013-02-22 at 14:42 +0800, Michael Wang wrote:

> So this is trying to take care the condition when curr_cpu(local) and
> prev_cpu(remote) are on different nodes, which in the old world,
> wake_affine() won't be invoked, correct?

It'll be called any time this_cpu and prev_cpu aren't one and the same.
It'd be pretty silly to asking whether to pull_here or leave_there when
here and there are identical.

> Hmm...I think this maybe a good additional checking before enter balance
> path, but I could not estimate the cost to record the relationship at
> this moment of time...

It'd be pretty cheap, but I'd hate adding any cycles to the fast path
unless those cycles have one hell of a good payoff, so the caching would
have to show most excellent cold hard numbers (talk crazy ideas walk;).

-Mike


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  2:36               ` Michael Wang
  2013-02-22  5:02                 ` Mike Galbraith
@ 2013-02-22  8:21                 ` Peter Zijlstra
  2013-02-22  9:10                   ` Michael Wang
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2013-02-22  8:21 UTC (permalink / raw)
  To: Michael Wang
  Cc: Mike Galbraith, Ingo Molnar, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote:
> According to my understanding, in the old world, wake_affine() will
> only
> be used if curr_cpu and prev_cpu share cache, which means they are in
> one package, whatever search in llc sd of curr_cpu or prev_cpu, we
> won't
> have the chance to spread the task out of that package.

Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains
don't have it set, but 'small' NUMA systems will have it set over the
entire domain tree.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  8:17                     ` Mike Galbraith
@ 2013-02-22  8:35                       ` Michael Wang
  0 siblings, 0 replies; 55+ messages in thread
From: Michael Wang @ 2013-02-22  8:35 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/22/2013 04:17 PM, Mike Galbraith wrote:
> On Fri, 2013-02-22 at 14:42 +0800, Michael Wang wrote:
> 
>> So this is trying to take care the condition when curr_cpu(local) and
>> prev_cpu(remote) are on different nodes, which in the old world,
>> wake_affine() won't be invoked, correct?
> 
> It'll be called any time this_cpu and prev_cpu aren't one and the same.
> It'd be pretty silly to asking whether to pull_here or leave_there when
> here and there are identical.

Agree :)

> 
>> Hmm...I think this maybe a good additional checking before enter balance
>> path, but I could not estimate the cost to record the relationship at
>> this moment of time...
> 
> It'd be pretty cheap, but I'd hate adding any cycles to the fast path
> unless those cycles have one hell of a good payoff, so the caching would
> have to show most excellent cold hard numbers (talk crazy ideas walk;).

It sounds like a good idea, I'm not sure whether it's cheap and how many
benefit we could gain, but it worth some research.

I will thinking more about it after finished the sbm work.

Regards,
Michael Wang

> 
> -Mike
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  2:37       ` Michael Wang
  2013-02-22  5:08         ` Mike Galbraith
@ 2013-02-22  8:36         ` Peter Zijlstra
  2013-02-22  9:11           ` Michael Wang
  2013-02-22  9:40           ` Mike Galbraith
  1 sibling, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2013-02-22  8:36 UTC (permalink / raw)
  To: Michael Wang
  Cc: Ingo Molnar, LKML, Paul Turner, Mike Galbraith, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
> But that's really some benefit hardly to be estimate, especially when
> the workload is heavy, the cost of wake_affine() is very high to
> calculated se one by one, is that worth for some benefit we could not
> promise?

Look at something like pipe-test.. wake_affine() used to ensure both
client/server ran on the same cpu, but then I think we added
select_idle_sibling() and wrecked it again :/

$ taskset 1 perf bench sched pipe
# Running sched/pipe benchmark...
# Extecuted 1000000 pipe operations between two tasks

     Total time: 3.761 [sec]

       3.761725 usecs/op
         265835 ops/sec

$ perf bench sched pipe
# Running sched/pipe benchmark...
# Extecuted 1000000 pipe operations between two tasks

     Total time: 29.809 [sec]

      29.809720 usecs/op
          33546 ops/sec


Now as far as I can see there's two options, either we find there's
absolutely no benefit in wake_affine() as it stands today and we simply
disable/remove it, or we go fix it. What we don't do is completely
wreck it at atrocious cost.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  8:21                 ` Peter Zijlstra
@ 2013-02-22  9:10                   ` Michael Wang
  2013-02-22  9:39                     ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Michael Wang @ 2013-02-22  9:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Ingo Molnar, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/22/2013 04:21 PM, Peter Zijlstra wrote:
> On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote:
>> According to my understanding, in the old world, wake_affine() will
>> only
>> be used if curr_cpu and prev_cpu share cache, which means they are in
>> one package, whatever search in llc sd of curr_cpu or prev_cpu, we
>> won't
>> have the chance to spread the task out of that package.
> 
> Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains
> don't have it set, but 'small' NUMA systems will have it set over the
> entire domain tree.

Oh, I missed that point...

But I don't get the reason to make NUMA level affine, cpus in different
nodes share cache? doesn't make sense...

Regards,
Michael Wang

> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  8:36         ` Peter Zijlstra
@ 2013-02-22  9:11           ` Michael Wang
  2013-02-22  9:57             ` Peter Zijlstra
  2013-02-22  9:40           ` Mike Galbraith
  1 sibling, 1 reply; 55+ messages in thread
From: Michael Wang @ 2013-02-22  9:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Paul Turner, Mike Galbraith, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/22/2013 04:36 PM, Peter Zijlstra wrote:
> On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
>> But that's really some benefit hardly to be estimate, especially when
>> the workload is heavy, the cost of wake_affine() is very high to
>> calculated se one by one, is that worth for some benefit we could not
>> promise?
> 
> Look at something like pipe-test.. wake_affine() used to ensure both
> client/server ran on the same cpu, but then I think we added
> select_idle_sibling() and wrecked it again :/
> 
> $ taskset 1 perf bench sched pipe
> # Running sched/pipe benchmark...
> # Extecuted 1000000 pipe operations between two tasks
> 
>      Total time: 3.761 [sec]
> 
>        3.761725 usecs/op
>          265835 ops/sec
> 
> $ perf bench sched pipe
> # Running sched/pipe benchmark...
> # Extecuted 1000000 pipe operations between two tasks
> 
>      Total time: 29.809 [sec]
> 
>       29.809720 usecs/op
>           33546 ops/sec
> 

Ok, it do looks like wake_affine() lost it's value...

> 
> Now as far as I can see there's two options, either we find there's
> absolutely no benefit in wake_affine() as it stands today and we simply
> disable/remove it, or we go fix it. What we don't do is completely
> wreck it at atrocious cost.

I get your point, we should replace wake_affine() with some feature
which could really achieve the goal to make client and server on same cpu.

But is the logical that the waker/wakee are server/client(or reversed)
still works now? that sounds a little arbitrary to me...

Regards,
Michael Wang

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  9:10                   ` Michael Wang
@ 2013-02-22  9:39                     ` Peter Zijlstra
  2013-02-22  9:58                       ` Michael Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2013-02-22  9:39 UTC (permalink / raw)
  To: Michael Wang
  Cc: Mike Galbraith, Ingo Molnar, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Fri, 2013-02-22 at 17:10 +0800, Michael Wang wrote:
> On 02/22/2013 04:21 PM, Peter Zijlstra wrote:
> > On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote:
> >> According to my understanding, in the old world, wake_affine() will
> >> only
> >> be used if curr_cpu and prev_cpu share cache, which means they are in
> >> one package, whatever search in llc sd of curr_cpu or prev_cpu, we
> >> won't
> >> have the chance to spread the task out of that package.
> > 
> > Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains
> > don't have it set, but 'small' NUMA systems will have it set over the
> > entire domain tree.
> 
> Oh, I missed that point...
> 
> But I don't get the reason to make NUMA level affine, cpus in different
> nodes share cache? doesn't make sense...

Contrary, it makes more sense, the more expensive it is to run 'remote'
the better it is to pull 'related' tasks together.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  8:36         ` Peter Zijlstra
  2013-02-22  9:11           ` Michael Wang
@ 2013-02-22  9:40           ` Mike Galbraith
  2013-02-22  9:54             ` Ingo Molnar
  1 sibling, 1 reply; 55+ messages in thread
From: Mike Galbraith @ 2013-02-22  9:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Wang, Ingo Molnar, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: 
> On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
> > But that's really some benefit hardly to be estimate, especially when
> > the workload is heavy, the cost of wake_affine() is very high to
> > calculated se one by one, is that worth for some benefit we could not
> > promise?
> 
> Look at something like pipe-test.. wake_affine() used to ensure both
> client/server ran on the same cpu, but then I think we added
> select_idle_sibling() and wrecked it again :/

Yeah, that's the absolute worst case for select_idle_sibling(), 100%
synchronous, absolutely nothing to be gained by cross cpu scheduling.
Fortunately, most tasks do more than that, but nonetheless,
select_idle_sibling() definitely is a two faced little b*tch.  I'd like
to see the evil b*tch die, but something needs to replace it's pretty
face.  One thing that you can do is simply don't call it when the
context switch rate is incredible.. its job is to recover overlap, if
you're scheduling near your max, there's no win worth the cost.

> $ taskset 1 perf bench sched pipe
> # Running sched/pipe benchmark...
> # Extecuted 1000000 pipe operations between two tasks
> 
>      Total time: 3.761 [sec]
> 
>        3.761725 usecs/op
>          265835 ops/sec
> 
> $ perf bench sched pipe
> # Running sched/pipe benchmark...
> # Extecuted 1000000 pipe operations between two tasks
> 
>      Total time: 29.809 [sec]
> 
>       29.809720 usecs/op
>           33546 ops/sec

Gak!  Hm, are you running a kernel without the thinko fix?  It's not
good for this extreme testcase, but it doesn't suck _that_ bad ;-)

nohz isn't exactly your friend with ultra switchers either.

Q6600:
marge:~ # taskset -c 3 perf bench sched pipe
# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two tasks

     Total time: 3.395 [sec]

       3.395357 usecs/op
         294519 ops/sec
marge:~ # perf bench sched pipe
# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two tasks

     Total time: 4.212 [sec]

       4.212413 usecs/op
         237393 ops/sec

E5620:
rtbox:~ # taskset -c 0 perf bench sched pipe
# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two tasks

     Total time: 2.558 [sec]

       2.558237 usecs/op
         390894 ops/sec
rtbox:~ # perf bench sched pipe
# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two tasks

     Total time: 4.588 [sec]

       4.588702 usecs/op
         217926 ops/sec


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  9:40           ` Mike Galbraith
@ 2013-02-22  9:54             ` Ingo Molnar
  2013-02-22 10:01               ` Mike Galbraith
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2013-02-22  9:54 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Michael Wang, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim


* Mike Galbraith <efault@gmx.de> wrote:

> On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: 
> > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
> > > But that's really some benefit hardly to be estimate, especially when
> > > the workload is heavy, the cost of wake_affine() is very high to
> > > calculated se one by one, is that worth for some benefit we could not
> > > promise?
> > 
> > Look at something like pipe-test.. wake_affine() used to 
> > ensure both client/server ran on the same cpu, but then I 
> > think we added select_idle_sibling() and wrecked it again :/
> 
> Yeah, that's the absolute worst case for 
> select_idle_sibling(), 100% synchronous, absolutely nothing to 
> be gained by cross cpu scheduling. Fortunately, most tasks do 
> more than that, but nonetheless, select_idle_sibling() 
> definitely is a two faced little b*tch.  I'd like to see the 
> evil b*tch die, but something needs to replace it's pretty 
> face.  One thing that you can do is simply don't call it when 
> the context switch rate is incredible.. its job is to recover 
> overlap, if you're scheduling near your max, there's no win 
> worth the cost.

Couldn't we make the cutoff dependent on sched_migration_cost? 
If the wakeup comes in faster than that then don't spread.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  9:11           ` Michael Wang
@ 2013-02-22  9:57             ` Peter Zijlstra
  2013-02-22 10:08               ` Michael Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2013-02-22  9:57 UTC (permalink / raw)
  To: Michael Wang
  Cc: Ingo Molnar, LKML, Paul Turner, Mike Galbraith, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Fri, 2013-02-22 at 17:11 +0800, Michael Wang wrote:

> Ok, it do looks like wake_affine() lost it's value...

I'm not sure we can say that on this one benchmark, there's a
preemption advantage to running on a single cpu for pipe-test as well.
We'd need to create a better benchmark to test this, one that has some
actual data payload and control over the initial spread of the tasks or
so.

> > Now as far as I can see there's two options, either we find there's
> > absolutely no benefit in wake_affine() as it stands today and we simply
> > disable/remove it, or we go fix it. What we don't do is completely
> > wreck it at atrocious cost.
> 
> I get your point, we should replace wake_affine() with some feature
> which could really achieve the goal to make client and server on same cpu.
> 
> But is the logical that the waker/wakee are server/client(or reversed)
> still works now? that sounds a little arbitrary to me...

Ah, its never really been about server/client per-se. Its just a
specific example -- one that breaks down with the 1:n pgbench
situation.

Wakeups in general can be considered to be a relation, suppose a
hardware interrupt that received some data from a device and issues a
wakeup to a task to consume this data. What CPU would be better suited
to process this data then the one where its already cache hot.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  9:39                     ` Peter Zijlstra
@ 2013-02-22  9:58                       ` Michael Wang
  0 siblings, 0 replies; 55+ messages in thread
From: Michael Wang @ 2013-02-22  9:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Ingo Molnar, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/22/2013 05:39 PM, Peter Zijlstra wrote:
> On Fri, 2013-02-22 at 17:10 +0800, Michael Wang wrote:
>> On 02/22/2013 04:21 PM, Peter Zijlstra wrote:
>>> On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote:
>>>> According to my understanding, in the old world, wake_affine() will
>>>> only
>>>> be used if curr_cpu and prev_cpu share cache, which means they are in
>>>> one package, whatever search in llc sd of curr_cpu or prev_cpu, we
>>>> won't
>>>> have the chance to spread the task out of that package.
>>>
>>> Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains
>>> don't have it set, but 'small' NUMA systems will have it set over the
>>> entire domain tree.
>>
>> Oh, I missed that point...
>>
>> But I don't get the reason to make NUMA level affine, cpus in different
>> nodes share cache? doesn't make sense...
> 
> Contrary, it makes more sense, the more expensive it is to run 'remote'
> the better it is to pull 'related' tasks together.

It increase the range to bound task, from one node to several, but also
increase the range of target cpus, from one node's to several's, I still
can't estimate the benefit, but I think I get the purpose, trying to
make related tasks as close as possible, is that right?

Let me think about this point, I do believe there will be better way to
take care of this purpose.

Regards,
Michael Wang


> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  9:54             ` Ingo Molnar
@ 2013-02-22 10:01               ` Mike Galbraith
  2013-02-22 12:11                 ` Ingo Molnar
  0 siblings, 1 reply; 55+ messages in thread
From: Mike Galbraith @ 2013-02-22 10:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Michael Wang, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Fri, 2013-02-22 at 10:54 +0100, Ingo Molnar wrote: 
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: 
> > > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
> > > > But that's really some benefit hardly to be estimate, especially when
> > > > the workload is heavy, the cost of wake_affine() is very high to
> > > > calculated se one by one, is that worth for some benefit we could not
> > > > promise?
> > > 
> > > Look at something like pipe-test.. wake_affine() used to 
> > > ensure both client/server ran on the same cpu, but then I 
> > > think we added select_idle_sibling() and wrecked it again :/
> > 
> > Yeah, that's the absolute worst case for 
> > select_idle_sibling(), 100% synchronous, absolutely nothing to 
> > be gained by cross cpu scheduling. Fortunately, most tasks do 
> > more than that, but nonetheless, select_idle_sibling() 
> > definitely is a two faced little b*tch.  I'd like to see the 
> > evil b*tch die, but something needs to replace it's pretty 
> > face.  One thing that you can do is simply don't call it when 
> > the context switch rate is incredible.. its job is to recover 
> > overlap, if you're scheduling near your max, there's no win 
> > worth the cost.
> 
> Couldn't we make the cutoff dependent on sched_migration_cost? 
> If the wakeup comes in faster than that then don't spread.

No, that's too high, you loose too much of the pretty face.  It's a real
problem.  On AMD, the breakeven is much higher than Intel it seems as
well.  My E5620 can turn in a win on both tbench and even netperf
TCP_RR!! iff nohz is throttled.  For the Opterons I've played with, it's
a loser at even tbench context switch rate, needs to be cut off earlier.

-Mike


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22  9:57             ` Peter Zijlstra
@ 2013-02-22 10:08               ` Michael Wang
  0 siblings, 0 replies; 55+ messages in thread
From: Michael Wang @ 2013-02-22 10:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Paul Turner, Mike Galbraith, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On 02/22/2013 05:57 PM, Peter Zijlstra wrote:
> On Fri, 2013-02-22 at 17:11 +0800, Michael Wang wrote:
> 
>> Ok, it do looks like wake_affine() lost it's value...
> 
> I'm not sure we can say that on this one benchmark, there's a
> preemption advantage to running on a single cpu for pipe-test as well.
> We'd need to create a better benchmark to test this, one that has some
> actual data payload and control over the initial spread of the tasks or
> so.
> 
>>> Now as far as I can see there's two options, either we find there's
>>> absolutely no benefit in wake_affine() as it stands today and we simply
>>> disable/remove it, or we go fix it. What we don't do is completely
>>> wreck it at atrocious cost.
>>
>> I get your point, we should replace wake_affine() with some feature
>> which could really achieve the goal to make client and server on same cpu.
>>
>> But is the logical that the waker/wakee are server/client(or reversed)
>> still works now? that sounds a little arbitrary to me...
> 
> Ah, its never really been about server/client per-se. Its just a
> specific example -- one that breaks down with the 1:n pgbench
> situation.
> 
> Wakeups in general can be considered to be a relation, suppose a
> hardware interrupt that received some data from a device and issues a
> wakeup to a task to consume this data. What CPU would be better suited
> to process this data then the one where its already cache hot.

I see, honestly, I realized that I have underestimated the benefit we
gain from it when saw your testing results...

We do need some better approach to replace wake_affine(), hmm...I need a
draft board now...

Regards,
Michael Wang

> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22 10:01               ` Mike Galbraith
@ 2013-02-22 12:11                 ` Ingo Molnar
  2013-02-22 12:35                   ` Mike Galbraith
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2013-02-22 12:11 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Michael Wang, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim


* Mike Galbraith <efault@gmx.de> wrote:

> On Fri, 2013-02-22 at 10:54 +0100, Ingo Molnar wrote: 
> > * Mike Galbraith <efault@gmx.de> wrote:
> > 
> > > On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: 
> > > > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
> > > > > But that's really some benefit hardly to be estimate, especially when
> > > > > the workload is heavy, the cost of wake_affine() is very high to
> > > > > calculated se one by one, is that worth for some benefit we could not
> > > > > promise?
> > > > 
> > > > Look at something like pipe-test.. wake_affine() used to 
> > > > ensure both client/server ran on the same cpu, but then I 
> > > > think we added select_idle_sibling() and wrecked it again :/
> > > 
> > > Yeah, that's the absolute worst case for 
> > > select_idle_sibling(), 100% synchronous, absolutely nothing to 
> > > be gained by cross cpu scheduling. Fortunately, most tasks do 
> > > more than that, but nonetheless, select_idle_sibling() 
> > > definitely is a two faced little b*tch.  I'd like to see the 
> > > evil b*tch die, but something needs to replace it's pretty 
> > > face.  One thing that you can do is simply don't call it when 
> > > the context switch rate is incredible.. its job is to recover 
> > > overlap, if you're scheduling near your max, there's no win 
> > > worth the cost.
> > 
> > Couldn't we make the cutoff dependent on sched_migration_cost? 
> > If the wakeup comes in faster than that then don't spread.
> 
> No, that's too high, you loose too much of the pretty face. 
> [...]

Then a logical proportion of it - such as half of it?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22 12:11                 ` Ingo Molnar
@ 2013-02-22 12:35                   ` Mike Galbraith
  2013-02-22 13:06                     ` Ingo Molnar
  0 siblings, 1 reply; 55+ messages in thread
From: Mike Galbraith @ 2013-02-22 12:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Michael Wang, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Fri, 2013-02-22 at 13:11 +0100, Ingo Molnar wrote: 
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > On Fri, 2013-02-22 at 10:54 +0100, Ingo Molnar wrote: 
> > > * Mike Galbraith <efault@gmx.de> wrote:
> > > 
> > > > On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: 
> > > > > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
> > > > > > But that's really some benefit hardly to be estimate, especially when
> > > > > > the workload is heavy, the cost of wake_affine() is very high to
> > > > > > calculated se one by one, is that worth for some benefit we could not
> > > > > > promise?
> > > > > 
> > > > > Look at something like pipe-test.. wake_affine() used to 
> > > > > ensure both client/server ran on the same cpu, but then I 
> > > > > think we added select_idle_sibling() and wrecked it again :/
> > > > 
> > > > Yeah, that's the absolute worst case for 
> > > > select_idle_sibling(), 100% synchronous, absolutely nothing to 
> > > > be gained by cross cpu scheduling. Fortunately, most tasks do 
> > > > more than that, but nonetheless, select_idle_sibling() 
> > > > definitely is a two faced little b*tch.  I'd like to see the 
> > > > evil b*tch die, but something needs to replace it's pretty 
> > > > face.  One thing that you can do is simply don't call it when 
> > > > the context switch rate is incredible.. its job is to recover 
> > > > overlap, if you're scheduling near your max, there's no win 
> > > > worth the cost.
> > > 
> > > Couldn't we make the cutoff dependent on sched_migration_cost? 
> > > If the wakeup comes in faster than that then don't spread.
> > 
> > No, that's too high, you loose too much of the pretty face. 
> > [...]
> 
> Then a logical proportion of it - such as half of it?

Hm.  Better would maybe be a quick boot time benchmark, and use some
multiple of your cross core pipe ping-pong time?  That we know is a
complete waste of cycles, because almost all cycles are scheduler cycles
with no other work to be done, making firing up another scheduler rather
pointless.  If we're approaching that rate, we're approaching bad idea.

-Mike


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22 12:35                   ` Mike Galbraith
@ 2013-02-22 13:06                     ` Ingo Molnar
  2013-02-22 14:30                       ` Mike Galbraith
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2013-02-22 13:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Michael Wang, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim


* Mike Galbraith <efault@gmx.de> wrote:

> > > No, that's too high, you loose too much of the pretty 
> > > face. [...]
> > 
> > Then a logical proportion of it - such as half of it?
> 
> Hm.  Better would maybe be a quick boot time benchmark, and 
> use some multiple of your cross core pipe ping-pong time?  
> That we know is a complete waste of cycles, because almost all 
> cycles are scheduler cycles with no other work to be done, 
> making firing up another scheduler rather pointless.  If we're 
> approaching that rate, we're approaching bad idea.

Well, one problem with such dynamic boot time measurements is 
that it introduces a certain amount of uncertainty that persists 
for the whole lifetime of the booted up box - and it also sucks 
in any sort of non-deterministic execution environment, such as 
virtualized systems.

I think it might be better to measure the scheduling rate all 
the time, and save the _shortest_ cross-cpu-wakeup and 
same-cpu-wakeup latencies (since bootup) as a reference number. 

We might be able to pull this off pretty cheaply as the 
scheduler clock is running all the time and we have all the 
timestamps needed.

Pretty quickly after bootup this 'shortest latency' would settle 
down to a very system specific (and pretty accurate) value.

[ One downside would be an increased sensitivity to the accuracy
  and monotonicity of the scheduler clock - but that's something 
  we want to improve on anyway - and 'worst case' we get too 
  short latencies and we are where we are today. So it can only 
  improve the situation IMO. ]

Would you be interested in trying to hack on an auto-tuning 
feature like this?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22 13:06                     ` Ingo Molnar
@ 2013-02-22 14:30                       ` Mike Galbraith
  2013-02-22 14:42                         ` Mike Galbraith
  0 siblings, 1 reply; 55+ messages in thread
From: Mike Galbraith @ 2013-02-22 14:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Michael Wang, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Fri, 2013-02-22 at 14:06 +0100, Ingo Molnar wrote: 
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > > > No, that's too high, you loose too much of the pretty 
> > > > face. [...]
> > > 
> > > Then a logical proportion of it - such as half of it?
> > 
> > Hm.  Better would maybe be a quick boot time benchmark, and 
> > use some multiple of your cross core pipe ping-pong time?  
> > That we know is a complete waste of cycles, because almost all 
> > cycles are scheduler cycles with no other work to be done, 
> > making firing up another scheduler rather pointless.  If we're 
> > approaching that rate, we're approaching bad idea.
> 
> Well, one problem with such dynamic boot time measurements is 
> that it introduces a certain amount of uncertainty that persists 
> for the whole lifetime of the booted up box - and it also sucks 
> in any sort of non-deterministic execution environment, such as 
> virtualized systems.

Ok, bad idea.

> I think it might be better to measure the scheduling rate all 
> the time, and save the _shortest_ cross-cpu-wakeup and 
> same-cpu-wakeup latencies (since bootup) as a reference number. 
> 
> We might be able to pull this off pretty cheaply as the 
> scheduler clock is running all the time and we have all the 
> timestamps needed.

Yeah, that might work.  We have some quick kthreads, so saving ctx
distance may get close enough to scheduler cost to be good enough.

> Pretty quickly after bootup this 'shortest latency' would settle 
> down to a very system specific (and pretty accurate) value.
> 
> [ One downside would be an increased sensitivity to the accuracy
>   and monotonicity of the scheduler clock - but that's something 
>   we want to improve on anyway - and 'worst case' we get too 
>   short latencies and we are where we are today. So it can only 
>   improve the situation IMO. ]
> 
> Would you be interested in trying to hack on an auto-tuning 
> feature like this?

Yeah, should be easy, but rainy day has to happen so I have time to
measure twiddle measure measure <curse> tweak.. ;-)

-Mike


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
  2013-02-22 14:30                       ` Mike Galbraith
@ 2013-02-22 14:42                         ` Mike Galbraith
  0 siblings, 0 replies; 55+ messages in thread
From: Mike Galbraith @ 2013-02-22 14:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Michael Wang, LKML, Paul Turner, Andrew Morton,
	alex.shi, Ram Pai, Nikunj A. Dadhania, Namhyung Kim

On Fri, 2013-02-22 at 15:30 +0100, Mike Galbraith wrote: 
> On Fri, 2013-02-22 at 14:06 +0100, Ingo Molnar wrote: 

> > I think it might be better to measure the scheduling rate all 
> > the time, and save the _shortest_ cross-cpu-wakeup and 
> > same-cpu-wakeup latencies (since bootup) as a reference number. 
> > 
> > We might be able to pull this off pretty cheaply as the 
> > scheduler clock is running all the time and we have all the 
> > timestamps needed.
> 
> Yeah, that might work.  We have some quick kthreads, so saving ctx
> distance may get close enough to scheduler cost to be good enough.

Or better, shortest idle to idle, that would include the current (bad)
nohz cost, and automatically shrink away when that cost shrinks.

-Mike


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2013-02-22 14:42 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-29  9:08 [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair() Michael Wang
2013-01-29  9:09 ` [RFC PATCH v3 1/3] sched: schedule balance map foundation Michael Wang
2013-02-20 13:21   ` Peter Zijlstra
2013-02-21  4:52     ` Michael Wang
2013-02-20 13:25   ` Peter Zijlstra
2013-02-21  4:58     ` Michael Wang
2013-02-21 11:37       ` Peter Zijlstra
2013-02-22  2:53         ` Michael Wang
2013-02-22  3:33           ` Alex Shi
2013-02-22  4:19             ` Michael Wang
2013-02-22  4:46               ` Alex Shi
2013-02-22  5:05                 ` Michael Wang
2013-01-29  9:09 ` [RFC PATCH v3 2/3] sched: build schedule balance map Michael Wang
2013-01-29  9:10 ` [RFC PATCH v3 3/3] sched: simplify select_task_rq_fair() with " Michael Wang
2013-02-18  5:52 ` [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair() Michael Wang
2013-02-20 10:49 ` Ingo Molnar
2013-02-20 13:32   ` Peter Zijlstra
2013-02-20 14:05     ` Mike Galbraith
2013-02-21  5:21       ` Michael Wang
2013-02-21  5:14     ` Michael Wang
2013-02-21  4:51   ` Michael Wang
2013-02-21  6:11     ` Mike Galbraith
2013-02-21  7:00       ` Michael Wang
2013-02-21  8:10         ` Mike Galbraith
2013-02-21  9:08           ` Michael Wang
2013-02-21  9:43             ` Mike Galbraith
2013-02-22  2:36               ` Michael Wang
2013-02-22  5:02                 ` Mike Galbraith
2013-02-22  5:26                   ` Michael Wang
2013-02-22  6:13                     ` Mike Galbraith
2013-02-22  6:42                   ` Michael Wang
2013-02-22  8:17                     ` Mike Galbraith
2013-02-22  8:35                       ` Michael Wang
2013-02-22  8:21                 ` Peter Zijlstra
2013-02-22  9:10                   ` Michael Wang
2013-02-22  9:39                     ` Peter Zijlstra
2013-02-22  9:58                       ` Michael Wang
2013-02-21  9:20           ` Michael Wang
2013-02-21 10:20     ` Peter Zijlstra
2013-02-22  2:37       ` Michael Wang
2013-02-22  5:08         ` Mike Galbraith
2013-02-22  6:06           ` Michael Wang
2013-02-22  6:19             ` Mike Galbraith
2013-02-22  8:36         ` Peter Zijlstra
2013-02-22  9:11           ` Michael Wang
2013-02-22  9:57             ` Peter Zijlstra
2013-02-22 10:08               ` Michael Wang
2013-02-22  9:40           ` Mike Galbraith
2013-02-22  9:54             ` Ingo Molnar
2013-02-22 10:01               ` Mike Galbraith
2013-02-22 12:11                 ` Ingo Molnar
2013-02-22 12:35                   ` Mike Galbraith
2013-02-22 13:06                     ` Ingo Molnar
2013-02-22 14:30                       ` Mike Galbraith
2013-02-22 14:42                         ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).