All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Improve numa load balancing
@ 2015-06-16 11:55 Srikar Dronamraju
  2015-06-16 11:55 ` [PATCH v2 1/4] sched/tip:Prefer numa hotness over cache hotness Srikar Dronamraju
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Srikar Dronamraju @ 2015-06-16 11:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, srikar, Rik van Riel, Mel Gorman

Here is a patchset that improves numa load balancing. It prefers
numa over cache during regular balancing. It also takes imbalance_pct into
consideration in numa_has_capacity() and provides few fixes to
task_numa_migrate().

Here are results from 5 runs of autonuma-benchmark
(https://github.com/pholasek/autonuma-benchmark)

KernelVersion: 4.1.0-rc7-tip
	Testcase:         Min         Max         Avg      StdDev
  elapsed_numa01:      858.85      949.18      915.64       33.06
  elapsed_numa02:       23.09       29.89       26.43        2.18
	Testcase:         Min         Max         Avg      StdDev
   system_numa01:     1516.72     1855.08     1686.24      113.95
   system_numa02:       63.69       79.06       70.35        5.87
	Testcase:         Min         Max         Avg      StdDev
     user_numa01:    73284.76    80818.21    78060.88     2773.60
     user_numa02:     1690.18     2071.07     1821.64      140.25
	Testcase:         Min         Max         Avg      StdDev
    total_numa01:    74801.50    82572.60    79747.12     2875.61
    total_numa02:     1753.87     2142.77     1891.99      143.59

KernelVersion: 4.1.0-rc7-tip + patchset

	Testcase:         Min         Max         Avg      StdDev     %Change
  elapsed_numa01:      718.47      916.32      867.63       75.47       5.24%
  elapsed_numa02:       19.12       28.79       24.78        3.46       5.73%
	Testcase:         Min         Max         Avg      StdDev     %Change
   system_numa01:      870.48     1493.04     1208.64      202.34      31.99%
   system_numa02:       41.25       64.56       55.12        7.96      23.59%
	Testcase:         Min         Max         Avg      StdDev     %Change
     user_numa01:    57736.89    74336.14    70092.39     6213.45      10.72%
     user_numa02:     1478.86     2122.50     1768.01      221.06       2.53%
	Testcase:         Min         Max         Avg      StdDev     %Change
    total_numa01:    58607.40    75628.90    71301.04     6380.18      11.17%
    total_numa02:     1520.11     2183.44     1823.13      227.16       3.15%




 Performance counter stats for 'system wide':

numa01
KernelVersion: 4.1.0-rc7-tip
          7,58,804      cs                                                           [100.00%]
          1,29,961      migrations                                                   [100.00%]
          7,08,643      faults
    3,97,97,92,613      cache-misses

     949.188997751 seconds time elapsed

KernelVersion: 4.1.0-rc7-tip + patchset
         11,38,993      cs                                                           [100.00%]
          2,05,490      migrations                                                   [100.00%]
          8,72,140      faults
    3,97,61,99,522      cache-misses

     907.583509759 seconds time elapsed


numa02
KernelVersion: 4.1.0-rc7-tip
            29,573      cs                                                           [100.00%]
             4,671      migrations                                                   [100.00%]
            33,236      faults
      12,68,88,021      cache-misses

      29.897733416 seconds time elapsed

KernelVersion: 4.1.0-rc7-tip + patchset
            23,422      cs                                                           [100.00%]
             4,306      migrations                                                   [100.00%]
            23,858      faults
       9,98,87,326      cache-misses

      22.776749102 seconds time elapsed

# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 32425 MB
node 0 free: 31352 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 31711 MB
node 1 free: 30147 MB
node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 2 size: 30431 MB
node 2 free: 29636 MB
node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 3 size: 32219 MB
node 3 free: 31053 MB
node distances:
node   0   1   2   3
  0:  10  20  40  40
  1:  20  10  40  40
  2:  40  40  10  20
  3:  40  40  20  10

Srikar Dronamraju (4):
  sched/tip:Prefer numa hotness over cache hotness
  sched: Consider imbalance_pct when comparing loads in numa_has_capacity
  sched: Fix task_numa_migrate to always update preferred node
  sched: Use correct nid while evaluating task weights

 kernel/sched/fair.c     | 109 ++++++++++++++----------------------------------
 kernel/sched/features.h |  18 +++-----
 2 files changed, 37 insertions(+), 90 deletions(-)

--
1.8.3.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/4] sched/tip:Prefer numa hotness over cache hotness
  2015-06-16 11:55 [PATCH v2 0/4] Improve numa load balancing Srikar Dronamraju
@ 2015-06-16 11:55 ` Srikar Dronamraju
  2015-07-06 15:50   ` [tip:sched/core] sched/numa: Prefer NUMA " tip-bot for Srikar Dronamraju
  2015-07-07  6:49   ` tip-bot for Srikar Dronamraju
  2015-06-16 11:56 ` [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity Srikar Dronamraju
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 23+ messages in thread
From: Srikar Dronamraju @ 2015-06-16 11:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, srikar, Rik van Riel, Mel Gorman

The current load balancer may not try to prevent a task from moving out
of a preferred node to a less preferred node. The reason for this being:

- Since sched features NUMA and NUMA_RESIST_LOWER are disabled by
  default, migrate_degrades_locality() always returns false.

- Even if NUMA_RESIST_LOWER were to be enabled, if its cache hot,
  migrate_degrades_locality() never gets called.

The above behaviour can mean that tasks can move out of their preferred
node but they may be eventually be brought back to their preferred node
by numa balancer (due to higher numa faults).

To avoid the above, this commit merges migrate_degrades_locality() and
migrate_improves_locality(). It also replaces 3 sched features NUMA,
NUMA_FAVOUR_HIGHER and NUMA_RESIST_LOWER by a single sched feature NUMA.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changes from previous version:
- Rebased to tip.

Added Ack from Rik based on the
http://lkml.kernel.org/r/557845D5.6060800@redhat.com.
Rik, Please let me know otherwise.

 kernel/sched/fair.c     | 88 ++++++++++++++-----------------------------------
 kernel/sched/features.h | 18 +++-------
 2 files changed, 30 insertions(+), 76 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7210ae8..5ecc43da7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5662,72 +5662,40 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 
 #ifdef CONFIG_NUMA_BALANCING
 /*
- * Returns true if the destination node is the preferred node.
- * Needs to match fbq_classify_rq(): if there is a runnable task
- * that is not on its preferred node, we should identify it.
+ * Returns 1, if task migration degrades locality
+ * Returns 0, if task migration improves locality i.e migration preferred.
+ * Returns -1, if task migration is not affected by locality.
  */
-static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
-{
-	struct numa_group *numa_group = rcu_dereference(p->numa_group);
-	unsigned long src_faults, dst_faults;
-	int src_nid, dst_nid;
-
-	if (!sched_feat(NUMA_FAVOUR_HIGHER) || !p->numa_faults ||
-	    !(env->sd->flags & SD_NUMA)) {
-		return false;
-	}
-
-	src_nid = cpu_to_node(env->src_cpu);
-	dst_nid = cpu_to_node(env->dst_cpu);
-
-	if (src_nid == dst_nid)
-		return false;
-
-	/* Encourage migration to the preferred node. */
-	if (dst_nid == p->numa_preferred_nid)
-		return true;
-
-	/* Migrating away from the preferred node is bad. */
-	if (src_nid == p->numa_preferred_nid)
-		return false;
-
-	if (numa_group) {
-		src_faults = group_faults(p, src_nid);
-		dst_faults = group_faults(p, dst_nid);
-	} else {
-		src_faults = task_faults(p, src_nid);
-		dst_faults = task_faults(p, dst_nid);
-	}
-
-	return dst_faults > src_faults;
-}
 
-
-static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 {
 	struct numa_group *numa_group = rcu_dereference(p->numa_group);
 	unsigned long src_faults, dst_faults;
 	int src_nid, dst_nid;
 
-	if (!sched_feat(NUMA) || !sched_feat(NUMA_RESIST_LOWER))
-		return false;
-
 	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
-		return false;
+		return -1;
+
+	if (!sched_feat(NUMA))
+		return -1;
 
 	src_nid = cpu_to_node(env->src_cpu);
 	dst_nid = cpu_to_node(env->dst_cpu);
 
 	if (src_nid == dst_nid)
-		return false;
+		return -1;
 
-	/* Migrating away from the preferred node is bad. */
-	if (src_nid == p->numa_preferred_nid)
-		return true;
+	/* Migrating away from the preferred node is always bad. */
+	if (src_nid == p->numa_preferred_nid) {
+		if (env->src_rq->nr_running > env->src_rq->nr_preferred_running)
+			return 1;
+		else
+			return -1;
+	}
 
 	/* Encourage migration to the preferred node. */
 	if (dst_nid == p->numa_preferred_nid)
-		return false;
+		return 0;
 
 	if (numa_group) {
 		src_faults = group_faults(p, src_nid);
@@ -5741,16 +5709,10 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 }
 
 #else
-static inline bool migrate_improves_locality(struct task_struct *p,
+static inline int migrate_degrades_locality(struct task_struct *p,
 					     struct lb_env *env)
 {
-	return false;
-}
-
-static inline bool migrate_degrades_locality(struct task_struct *p,
-					     struct lb_env *env)
-{
-	return false;
+	return -1;
 }
 #endif
 
@@ -5760,7 +5722,7 @@ static inline bool migrate_degrades_locality(struct task_struct *p,
 static
 int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
-	int tsk_cache_hot = 0;
+	int tsk_cache_hot;
 
 	lockdep_assert_held(&env->src_rq->lock);
 
@@ -5818,13 +5780,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 2) task is cache cold, or
 	 * 3) too many balance attempts have failed.
 	 */
-	tsk_cache_hot = task_hot(p, env);
-	if (!tsk_cache_hot)
-		tsk_cache_hot = migrate_degrades_locality(p, env);
+	tsk_cache_hot = migrate_degrades_locality(p, env);
+	if (tsk_cache_hot == -1)
+		tsk_cache_hot = task_hot(p, env);
 
-	if (migrate_improves_locality(p, env) || !tsk_cache_hot ||
+	if (tsk_cache_hot <= 0 ||
 	    env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
-		if (tsk_cache_hot) {
+		if (tsk_cache_hot == 1) {
 			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
 			schedstat_inc(p, se.statistics.nr_forced_migrations);
 		}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 91e33cd..83a50e7 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -79,20 +79,12 @@ SCHED_FEAT(LB_MIN, false)
  * numa_balancing=
  */
 #ifdef CONFIG_NUMA_BALANCING
-SCHED_FEAT(NUMA,	false)
 
 /*
- * NUMA_FAVOUR_HIGHER will favor moving tasks towards nodes where a
- * higher number of hinting faults are recorded during active load
- * balancing.
+ * NUMA will favor moving tasks towards nodes where a higher number of
+ * hinting faults are recorded during active load balancing. It will
+ * resist moving tasks towards nodes where a lower number of hinting
+ * faults have been recorded.
  */
-SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
-
-/*
- * NUMA_RESIST_LOWER will resist moving tasks towards nodes where a
- * lower number of hinting faults have been recorded. As this has
- * the potential to prevent a task ever migrating to a new node
- * due to CPU overload it is disabled by default.
- */
-SCHED_FEAT(NUMA_RESIST_LOWER, false)
+SCHED_FEAT(NUMA,	true)
 #endif
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity
  2015-06-16 11:55 [PATCH v2 0/4] Improve numa load balancing Srikar Dronamraju
  2015-06-16 11:55 ` [PATCH v2 1/4] sched/tip:Prefer numa hotness over cache hotness Srikar Dronamraju
@ 2015-06-16 11:56 ` Srikar Dronamraju
  2015-06-16 14:39   ` Rik van Riel
                     ` (2 more replies)
  2015-06-16 11:56 ` [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node Srikar Dronamraju
  2015-06-16 11:56 ` [PATCH v2 4/4] sched:Use correct nid while evaluating task weights Srikar Dronamraju
  3 siblings, 3 replies; 23+ messages in thread
From: Srikar Dronamraju @ 2015-06-16 11:56 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, srikar, Rik van Riel, Mel Gorman

This is consistent with all other load balancing instances where we
absorb unfairness upto env->imbalance_pct. Absorbing unfairness upto
env->imbalance_pct allows to pull and retain task to their preferred
nodes.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5ecc43da7..7b23efa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1415,8 +1415,9 @@ static bool numa_has_capacity(struct task_numa_env *env)
 	 * --------------------- vs ---------------------
 	 * src->compute_capacity    dst->compute_capacity
 	 */
-	if (src->load * dst->compute_capacity >
-	    dst->load * src->compute_capacity)
+	if (src->load * dst->compute_capacity * env->imbalance_pct >
+
+	    dst->load * src->compute_capacity * 100)
 		return true;
 
 	return false;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node
  2015-06-16 11:55 [PATCH v2 0/4] Improve numa load balancing Srikar Dronamraju
  2015-06-16 11:55 ` [PATCH v2 1/4] sched/tip:Prefer numa hotness over cache hotness Srikar Dronamraju
  2015-06-16 11:56 ` [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity Srikar Dronamraju
@ 2015-06-16 11:56 ` Srikar Dronamraju
  2015-06-16 14:54   ` Rik van Riel
  2015-06-16 17:18   ` Rik van Riel
  2015-06-16 11:56 ` [PATCH v2 4/4] sched:Use correct nid while evaluating task weights Srikar Dronamraju
  3 siblings, 2 replies; 23+ messages in thread
From: Srikar Dronamraju @ 2015-06-16 11:56 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, srikar, Rik van Riel, Mel Gorman

In task_numa_migrate(), env.dst_nid points to either a preferred node or
a node that has free capacity and has more task weight than the current
node.

Currently in such a scenario, there are checks to see if tasks in the
numa_group have previously run on the node that has free capacity before
updating the preferred node. Commit (c1ceac62: "sched/numa: Reduce
conflict between fbq_classify_rq() and migration") gives preferance to
preferred node while load balancing.  Hence if setting the
preferred_node after evaluating is skipped, then the task might miss
opportunity later at load balancing time to move to the preferred node.

In such a scenario, it makes sense to unconditionally set env.dst_nid as
the preferred node unless the said node is already the preferred node.

While here, update env.dst_nid only when both task and groups benefit.
This is as per the comment in the code.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7b23efa..d1aa374 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1503,7 +1503,7 @@ static int task_numa_migrate(struct task_struct *p)
 			/* Only consider nodes where both task and groups benefit */
 			taskimp = task_weight(p, nid, dist) - taskweight;
 			groupimp = group_weight(p, nid, dist) - groupweight;
-			if (taskimp < 0 && groupimp < 0)
+			if (taskimp < 0 || groupimp < 0)
 				continue;
 
 			env.dist = dist;
@@ -1519,16 +1519,9 @@ static int task_numa_migrate(struct task_struct *p)
 	 * and is migrating into one of the workload's active nodes, remember
 	 * this node as the task's preferred numa node, so the workload can
 	 * settle down.
-	 * A task that migrated to a second choice node will be better off
-	 * trying for a better one later. Do not set the preferred node here.
 	 */
 	if (p->numa_group) {
-		if (env.best_cpu == -1)
-			nid = env.src_nid;
-		else
-			nid = env.dst_nid;
-
-		if (node_isset(nid, p->numa_group->active_nodes))
+		if (env.dst_nid != p->numa_preferred_nid)
 			sched_setnuma(p, env.dst_nid);
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 4/4] sched:Use correct nid while evaluating task weights
  2015-06-16 11:55 [PATCH v2 0/4] Improve numa load balancing Srikar Dronamraju
                   ` (2 preceding siblings ...)
  2015-06-16 11:56 ` [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node Srikar Dronamraju
@ 2015-06-16 11:56 ` Srikar Dronamraju
  2015-06-16 15:00   ` Rik van Riel
  3 siblings, 1 reply; 23+ messages in thread
From: Srikar Dronamraju @ 2015-06-16 11:56 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, srikar, Rik van Riel, Mel Gorman

In task_numa_migrate(), while evaluating other nodes for group
consolidation, env.dst_nid is used instead of using the iterator nid.
Using env.dst_nid would mean dist is always the same. Infact the same
dist was calculated above while evaluating the preferred node.

Fix the above to use the iterator nid.

Also the task/group weights from the src_nid should be calculated
irrespective of numa topology type.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d1aa374..e1b3393 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1493,9 +1493,8 @@ static int task_numa_migrate(struct task_struct *p)
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
 
-			dist = node_distance(env.src_nid, env.dst_nid);
-			if (sched_numa_topology_type == NUMA_BACKPLANE &&
-						dist != env.dist) {
+			dist = node_distance(env.src_nid, nid);
+			if (dist != env.dist) {
 				taskweight = task_weight(p, env.src_nid, dist);
 				groupweight = group_weight(p, env.src_nid, dist);
 			}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity
  2015-06-16 11:56 ` [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity Srikar Dronamraju
@ 2015-06-16 14:39   ` Rik van Riel
  2015-06-22 16:29     ` Srikar Dronamraju
  2015-07-06 15:50   ` [tip:sched/core] sched/numa: Consider 'imbalance_pct' when comparing loads in numa_has_capacity() tip-bot for Srikar Dronamraju
  2015-07-07  6:49   ` tip-bot for Srikar Dronamraju
  2 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2015-06-16 14:39 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra; +Cc: linux-kernel, Mel Gorman

On 06/16/2015 07:56 AM, Srikar Dronamraju wrote:
> This is consistent with all other load balancing instances where we
> absorb unfairness upto env->imbalance_pct. Absorbing unfairness upto
> env->imbalance_pct allows to pull and retain task to their preferred
> nodes.
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

How does this work with other workloads, eg.
single instance SPECjbb2005, or two SPECjbb2005
instances on a four node system?

Is the load still balanced evenly between nodes
with this patch?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node
  2015-06-16 11:56 ` [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node Srikar Dronamraju
@ 2015-06-16 14:54   ` Rik van Riel
  2015-06-16 17:19     ` Srikar Dronamraju
  2015-06-16 17:18   ` Rik van Riel
  1 sibling, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2015-06-16 14:54 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra; +Cc: linux-kernel, Mel Gorman

On 06/16/2015 07:56 AM, Srikar Dronamraju wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7b23efa..d1aa374 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1503,7 +1503,7 @@ static int task_numa_migrate(struct task_struct *p)
>  			/* Only consider nodes where both task and groups benefit */
>  			taskimp = task_weight(p, nid, dist) - taskweight;
>  			groupimp = group_weight(p, nid, dist) - groupweight;
> -			if (taskimp < 0 && groupimp < 0)
> +			if (taskimp < 0 || groupimp < 0)
>  				continue;

NAK

Here is the simplest possible example of a workload where
the above change breaks things.

1) A two node system
2) Two processes, with two threads each
3) Two thirds of memory accesses in private faults,
   the other third in shared faults (shared memory
   with the other thread)

Within workload A, we have threads T1 and T2, each
on different NUMA nodes, N1 and N2. The same is true
inside workload B.

They each access most of their memory on their own node,
but also have a lot of shared memory access with each
other.

Things would run faster if each process had both of
it threads on the same NUMA node.

This involves swapping around the threads of the two
workloads, moving a thread of each process to the node
where the whole process has the highest group_weight.

However, because a majority of memory accesses for
each thread are local, your above change will result
in the kernel not evaluating the best node for the
process (due to a negative taskimp).

Your patch works fine with the autonuma benchmark
tests, because they do essentially all shared
faults, not a more realistic mix of shared and private
faults like you would see in eg. SPECjbb2005, or in a
database or virtual machine workload.

> @@ -1519,16 +1519,9 @@ static int task_numa_migrate(struct task_struct *p)
>  	 * and is migrating into one of the workload's active nodes, remember
>  	 * this node as the task's preferred numa node, so the workload can
>  	 * settle down.
> -	 * A task that migrated to a second choice node will be better off
> -	 * trying for a better one later. Do not set the preferred node here.
>  	 */
>  	if (p->numa_group) {
> -		if (env.best_cpu == -1)
> -			nid = env.src_nid;
> -		else
> -			nid = env.dst_nid;
> -
> -		if (node_isset(nid, p->numa_group->active_nodes))
> +		if (env.dst_nid != p->numa_preferred_nid)
>  			sched_setnuma(p, env.dst_nid);
>  	}

NAK

This can also break group convergence, by setting the
task's preferred nid to the node of a CPU it may not
even be migrating to.

Setting the preferred nid to a node the task should
not be migrating to (but may, because the better ones
are currently overloaded) prevents the task from moving
to its preferred nid at a later time (when the good
nodes are no longer overloaded).

Have you tested this patch with any workload that does
not consist of tasks that are running at 100% cpu time
for the duration of the test?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/4] sched:Use correct nid while evaluating task weights
  2015-06-16 11:56 ` [PATCH v2 4/4] sched:Use correct nid while evaluating task weights Srikar Dronamraju
@ 2015-06-16 15:00   ` Rik van Riel
  2015-06-16 17:26     ` Srikar Dronamraju
  0 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2015-06-16 15:00 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra; +Cc: linux-kernel, Mel Gorman

On 06/16/2015 07:56 AM, Srikar Dronamraju wrote:
> In task_numa_migrate(), while evaluating other nodes for group
> consolidation, env.dst_nid is used instead of using the iterator nid.
> Using env.dst_nid would mean dist is always the same. Infact the same
> dist was calculated above while evaluating the preferred node.
> 
> Fix the above to use the iterator nid.

Good catch.

> Also the task/group weights from the src_nid should be calculated
> irrespective of numa topology type.

If you look at score_nearby_nodes(), you will see that
maxdist is only used when the topology is NUMA_BACKPLANE.

The source score never changes when having a directly
connected or NUMA_GLUELESS_MESH type system, so the source
score does not need to be recalculated unless we are actually
dealing with a NUMA_BACKPLANE topology.

Looking forward to a v2 with just the first fix.

> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  kernel/sched/fair.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d1aa374..e1b3393 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1493,9 +1493,8 @@ static int task_numa_migrate(struct task_struct *p)
>  			if (nid == env.src_nid || nid == p->numa_preferred_nid)
>  				continue;
>  
> -			dist = node_distance(env.src_nid, env.dst_nid);
> -			if (sched_numa_topology_type == NUMA_BACKPLANE &&
> -						dist != env.dist) {
> +			dist = node_distance(env.src_nid, nid);
> +			if (dist != env.dist) {
>  				taskweight = task_weight(p, env.src_nid, dist);
>  				groupweight = group_weight(p, env.src_nid, dist);
>  			}
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node
  2015-06-16 11:56 ` [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node Srikar Dronamraju
  2015-06-16 14:54   ` Rik van Riel
@ 2015-06-16 17:18   ` Rik van Riel
  1 sibling, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2015-06-16 17:18 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra; +Cc: linux-kernel, Mel Gorman

On 06/16/2015 07:56 AM, Srikar Dronamraju wrote:

> @@ -1519,16 +1519,9 @@ static int task_numa_migrate(struct task_struct *p)
>  	 * and is migrating into one of the workload's active nodes, remember
>  	 * this node as the task's preferred numa node, so the workload can
>  	 * settle down.
> -	 * A task that migrated to a second choice node will be better off
> -	 * trying for a better one later. Do not set the preferred node here.
>  	 */
>  	if (p->numa_group) {
> -		if (env.best_cpu == -1)
> -			nid = env.src_nid;
> -		else
> -			nid = env.dst_nid;
> -
> -		if (node_isset(nid, p->numa_group->active_nodes))
> +		if (env.dst_nid != p->numa_preferred_nid)
>  			sched_setnuma(p, env.dst_nid);
>  	}

Looking at the original code again, it looks like my code
has a potential bug (or at least downside), too.

We set p->numa_group->active_nodes depending on which nodes
a group triggers many NUMA faults happen from (the CPUs the
tasks in the group were running on when they had NUMA faults).

This means if a workload is not yet converged, the active_nodes
mask may be much larger than desired, and we can end up setting
p->numa_preferred_nid to a node that is currently in the active_nodes
mask, but really shouldn't be...

I have no ideas on how to improve that situation, though :)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node
  2015-06-16 14:54   ` Rik van Riel
@ 2015-06-16 17:19     ` Srikar Dronamraju
  2015-06-16 18:25       ` Rik van Riel
  0 siblings, 1 reply; 23+ messages in thread
From: Srikar Dronamraju @ 2015-06-16 17:19 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Mel Gorman

> > @@ -1503,7 +1503,7 @@ static int task_numa_migrate(struct task_struct *p)
> >  			/* Only consider nodes where both task and groups benefit */
> >  			taskimp = task_weight(p, nid, dist) - taskweight;
> >  			groupimp = group_weight(p, nid, dist) - groupweight;
> > -			if (taskimp < 0 && groupimp < 0)
> > +			if (taskimp < 0 || groupimp < 0)
> >  				continue;
>
> NAK
>
> Here is the simplest possible example of a workload where
> the above change breaks things.

In which case, shouldnt we be updating the comment above.
The comment says. "/* Only consider nodes where both task and groups
benefit */". From the comment, I felt taskimp and groupimp should be
positive.

The current code is a bit confusing. There are 3 possible cases where
the current code allows us to do further updates.
1. task and group faults improving which is a good case.
2. task faults improve, group faults decrease.
3. task faults decrease, group faults increase. (mostly a good case)

case 2 and case 3 are somewhat contradictory. But our actions seem to be
the same for both. Shouldnt we have given preference to groupimp here ..
i.e shouldnt we be not updating in case 2.

Would a below check be okay?
if (groupimp < 0 || (taskimp < 0 && !groupimp))

<snipped>
>
> However, because a majority of memory accesses for
> each thread are local, your above change will result
> in the kernel not evaluating the best node for the
> process (due to a negative taskimp).
>
> Your patch works fine with the autonuma benchmark
> tests, because they do essentially all shared
> faults, not a more realistic mix of shared and private
> faults like you would see in eg. SPECjbb2005, or in a
> database or virtual machine workload.

IIUC numa02 has only private accesses while numa01 has mostly shared
faults.  I plan to run SPECjbb2005 and see how they vary with the
changes.

>
> > @@ -1519,16 +1519,9 @@ static int task_numa_migrate(struct task_struct *p)
> >  	 * and is migrating into one of the workload's active nodes, remember
> >  	 * this node as the task's preferred numa node, so the workload can
> >  	 * settle down.
> > -	 * A task that migrated to a second choice node will be better off
> > -	 * trying for a better one later. Do not set the preferred node here.
> >  	 */
> >  	if (p->numa_group) {
> > -		if (env.best_cpu == -1)
> > -			nid = env.src_nid;
> > -		else
> > -			nid = env.dst_nid;
> > -
> > -		if (node_isset(nid, p->numa_group->active_nodes))
> > +		if (env.dst_nid != p->numa_preferred_nid)
> >  			sched_setnuma(p, env.dst_nid);
> >  	}
>
> NAK
>
> This can also break group convergence, by setting the
> task's preferred nid to the node of a CPU it may not
> even be migrating to.

I havent changed the parameters to sched_setnuma. Previously also
sched_setnuma was getting set to env.dst_nid. Whats changed is
evaluating nid based on env.best_cpu and using the nid to see if we
should be setting env.dst_nid as preferred node.

With this change, we are setting the preferred node more often, but we
are not changing the logic of which node gets set as preferred node.

I did wonder if the intent was to set nid as the preferred node. i.e

+			sched_setnuma(p, env.dst_nid);
-			sched_setnuma(p, nid);


>
> Setting the preferred nid to a node the task should
> not be migrating to (but may, because the better ones
> are currently overloaded) prevents the task from moving
> to its preferred nid at a later time (when the good
> nodes are no longer overloaded).
>
> Have you tested this patch with any workload that does
> not consist of tasks that are running at 100% cpu time
> for the duration of the test?
>

Okay, I will try to see if I can run some workloads and get back to you.

--
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/4] sched:Use correct nid while evaluating task weights
  2015-06-16 15:00   ` Rik van Riel
@ 2015-06-16 17:26     ` Srikar Dronamraju
  0 siblings, 0 replies; 23+ messages in thread
From: Srikar Dronamraju @ 2015-06-16 17:26 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Mel Gorman

* Rik van Riel <riel@redhat.com> [2015-06-16 11:00:18]:

> On 06/16/2015 07:56 AM, Srikar Dronamraju wrote:
> > In task_numa_migrate(), while evaluating other nodes for group
> > consolidation, env.dst_nid is used instead of using the iterator nid.
> > Using env.dst_nid would mean dist is always the same. Infact the same
> > dist was calculated above while evaluating the preferred node.
> > 
> > Fix the above to use the iterator nid.
> 
> Good catch.
> 
> > Also the task/group weights from the src_nid should be calculated
> > irrespective of numa topology type.
> 
> If you look at score_nearby_nodes(), you will see that
> maxdist is only used when the topology is NUMA_BACKPLANE.
> 
> The source score never changes when having a directly
> connected or NUMA_GLUELESS_MESH type system, so the source
> score does not need to be recalculated unless we are actually
> dealing with a NUMA_BACKPLANE topology.
> 
> Looking forward to a v2 with just the first fix.

Okay, will fix accordingly.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node
  2015-06-16 17:19     ` Srikar Dronamraju
@ 2015-06-16 18:25       ` Rik van Riel
  0 siblings, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2015-06-16 18:25 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Mel Gorman

On 06/16/2015 01:19 PM, Srikar Dronamraju wrote:
>>> @@ -1503,7 +1503,7 @@ static int task_numa_migrate(struct task_struct *p)
>>>  			/* Only consider nodes where both task and groups benefit */
>>>  			taskimp = task_weight(p, nid, dist) - taskweight;
>>>  			groupimp = group_weight(p, nid, dist) - groupweight;
>>> -			if (taskimp < 0 && groupimp < 0)
>>> +			if (taskimp < 0 || groupimp < 0)
>>>  				continue;
>>
>> NAK
>>
>> Here is the simplest possible example of a workload where
>> the above change breaks things.
> 
> In which case, shouldnt we be updating the comment above.
> The comment says. "/* Only consider nodes where both task and groups
> benefit */". From the comment, I felt taskimp and groupimp should be
> positive.

You are correct. That comment does need fixing.

> The current code is a bit confusing. There are 3 possible cases where
> the current code allows us to do further updates.
> 1. task and group faults improving which is a good case.
> 2. task faults improve, group faults decrease.
> 3. task faults decrease, group faults increase. (mostly a good case)
> 
> case 2 and case 3 are somewhat contradictory. But our actions seem to be
> the same for both. Shouldnt we have given preference to groupimp here ..
> i.e shouldnt we be not updating in case 2.

If you look at task_numa_compare(), you will find that
case (2) and (3) are handled for different situations.

If the task faults improve, but group faults decrease,
then a task swap is only done if we are considering
the task score for the task, which is done when comparing
it against another task in the same group.

In other words, if task A and task B are part of the
same numa_group NG, then we will look at the task score
for the tasks.

Case (3), where the numa group placement improves, is
important to help tasks in a numa_group move towards
each other in the system.

If we have a 2-node system with 2 large active numa groups,
and multiple tasks in each group, we want each group to
end up on its own numa node.

This can only happen if we sometimes move tasks to a place
where they have a lower task score, but a higher group score.


> Would a below check be okay?
> if (groupimp < 0 || (taskimp < 0 && !groupimp))

No, for reasons described above.

We want to retain the ability to place tasks correctly within
a numa_group.

> <snipped>
>>
>> However, because a majority of memory accesses for
>> each thread are local, your above change will result
>> in the kernel not evaluating the best node for the
>> process (due to a negative taskimp).
>>
>> Your patch works fine with the autonuma benchmark
>> tests, because they do essentially all shared
>> faults, not a more realistic mix of shared and private
>> faults like you would see in eg. SPECjbb2005, or in a
>> database or virtual machine workload.
> 
> IIUC numa02 has only private accesses while numa01 has mostly shared
> faults.  I plan to run SPECjbb2005 and see how they vary with the
> changes.

In this context, "private" and "shared" refers to the NUMA faults
experienced by the kernel.

"private" means the same thread is accessing the same pages over
and over again, without (many) other accesses to those pages

"shared" means the same pages are accessed by various threads (in
the numa group)

>>> @@ -1519,16 +1519,9 @@ static int task_numa_migrate(struct task_struct *p)
>>>  	 * and is migrating into one of the workload's active nodes, remember
>>>  	 * this node as the task's preferred numa node, so the workload can
>>>  	 * settle down.
>>> -	 * A task that migrated to a second choice node will be better off
>>> -	 * trying for a better one later. Do not set the preferred node here.
>>>  	 */
>>>  	if (p->numa_group) {
>>> -		if (env.best_cpu == -1)
>>> -			nid = env.src_nid;
>>> -		else
>>> -			nid = env.dst_nid;
>>> -
>>> -		if (node_isset(nid, p->numa_group->active_nodes))
>>> +		if (env.dst_nid != p->numa_preferred_nid)
>>>  			sched_setnuma(p, env.dst_nid);
>>>  	}
>>
>> NAK
>>
>> This can also break group convergence, by setting the
>> task's preferred nid to the node of a CPU it may not
>> even be migrating to.
> 
> I havent changed the parameters to sched_setnuma. Previously also
> sched_setnuma was getting set to env.dst_nid. Whats changed is
> evaluating nid based on env.best_cpu and using the nid to see if we
> should be setting env.dst_nid as preferred node.
> 
> With this change, we are setting the preferred node more often, 

That is exactly the problem.

Setting p->numa_preferred_nid means the task will not try to move
to a better numa node frequently, with this code from task_numa_fault():

        /*
         * Retry task to preferred node migration periodically, in case it
         * case it previously failed, or the scheduler moved us.
         */
        if (time_after(jiffies, p->numa_migrate_retry))
                numa_migrate_preferred(p);

Your change will cause numa_migrate_preferred() to do nothing, despite
the fact that task_numa_migrate placed the task in a location that is
known to be sub-optimal.

The only situation in which we do want to settle for a sub-optimal
location, is if that sub-optimal node is part of the numa_group's
active_nodes set.

The normal place where p->numa_preferred_nid gets set should be
task_numa_placement(), not task_numa_migrate(), which only sets
it in the above exceptional circumstance.

> I did wonder if the intent was to set nid as the preferred node. i.e
> 
> +			sched_setnuma(p, env.dst_nid);
> -			sched_setnuma(p, nid);

Yes, that change would be correct.  You did identify a real
bug in task_numa_migrate().

Thank you for going over the code with a fine comb, trying to
find issues.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity
  2015-06-16 14:39   ` Rik van Riel
@ 2015-06-22 16:29     ` Srikar Dronamraju
  2015-06-23  1:18       ` Rik van Riel
  2015-06-23  8:10       ` Ingo Molnar
  0 siblings, 2 replies; 23+ messages in thread
From: Srikar Dronamraju @ 2015-06-22 16:29 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Mel Gorman

* Rik van Riel <riel@redhat.com> [2015-06-16 10:39:13]:

> On 06/16/2015 07:56 AM, Srikar Dronamraju wrote:
> > This is consistent with all other load balancing instances where we
> > absorb unfairness upto env->imbalance_pct. Absorbing unfairness upto
> > env->imbalance_pct allows to pull and retain task to their preferred
> > nodes.
> >
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>
> How does this work with other workloads, eg.
> single instance SPECjbb2005, or two SPECjbb2005
> instances on a four node system?
>
> Is the load still balanced evenly between nodes
> with this patch?
>

Yes, I have looked at mpstat logs while running SPECjbb2005 for 1JVMper
System, 2 JVMs per System and 4 JVMs per System and observed that the
load spreading was similar with and without this patch.

Also I have visualized using htop when running 0.5X (i.e 48 threads on
96 cpu system) cpu stress workloads to see that the spread is similar
before and after the patch.

Please let me know if there are any better ways to observe the
spread. In a slightly loaded or less loaded system, the chance of
migrating threads to their home node by way of calling migrate_task_to
and migrate_swap might be curtailed without this patch. i.e 2 process
each having N/2 threads may converge slower without this change.

--
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity
  2015-06-22 16:29     ` Srikar Dronamraju
@ 2015-06-23  1:18       ` Rik van Riel
  2015-06-23  8:10       ` Ingo Molnar
  1 sibling, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2015-06-23  1:18 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Mel Gorman

On 06/22/2015 12:29 PM, Srikar Dronamraju wrote:
> * Rik van Riel <riel@redhat.com> [2015-06-16 10:39:13]:
> 
>> On 06/16/2015 07:56 AM, Srikar Dronamraju wrote:
>>> This is consistent with all other load balancing instances where we
>>> absorb unfairness upto env->imbalance_pct. Absorbing unfairness upto
>>> env->imbalance_pct allows to pull and retain task to their preferred
>>> nodes.
>>>
>>> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>
>> How does this work with other workloads, eg.
>> single instance SPECjbb2005, or two SPECjbb2005
>> instances on a four node system?
>>
>> Is the load still balanced evenly between nodes
>> with this patch?
>>
> 
> Yes, I have looked at mpstat logs while running SPECjbb2005 for 1JVMper
> System, 2 JVMs per System and 4 JVMs per System and observed that the
> load spreading was similar with and without this patch.
> 
> Also I have visualized using htop when running 0.5X (i.e 48 threads on
> 96 cpu system) cpu stress workloads to see that the spread is similar
> before and after the patch.
> 
> Please let me know if there are any better ways to observe the
> spread. In a slightly loaded or less loaded system, the chance of
> migrating threads to their home node by way of calling migrate_task_to
> and migrate_swap might be curtailed without this patch. i.e 2 process
> each having N/2 threads may converge slower without this change.

Awesome.  Feel free to put my Acked-by: on this patch.

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity
  2015-06-22 16:29     ` Srikar Dronamraju
  2015-06-23  1:18       ` Rik van Riel
@ 2015-06-23  8:10       ` Ingo Molnar
  2015-06-23 13:01         ` Srikar Dronamraju
  1 sibling, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2015-06-23  8:10 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: Rik van Riel, Peter Zijlstra, linux-kernel, Mel Gorman


* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:

> * Rik van Riel <riel@redhat.com> [2015-06-16 10:39:13]:
> 
> > On 06/16/2015 07:56 AM, Srikar Dronamraju wrote:
> > > This is consistent with all other load balancing instances where we
> > > absorb unfairness upto env->imbalance_pct. Absorbing unfairness upto
> > > env->imbalance_pct allows to pull and retain task to their preferred
> > > nodes.
> > >
> > > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> >
> > How does this work with other workloads, eg.
> > single instance SPECjbb2005, or two SPECjbb2005
> > instances on a four node system?
> >
> > Is the load still balanced evenly between nodes
> > with this patch?
> >
> 
> Yes, I have looked at mpstat logs while running SPECjbb2005 for 1JVMper
> System, 2 JVMs per System and 4 JVMs per System and observed that the
> load spreading was similar with and without this patch.
> 
> Also I have visualized using htop when running 0.5X (i.e 48 threads on
> 96 cpu system) cpu stress workloads to see that the spread is similar
> before and after the patch.
> 
> Please let me know if there are any better ways to observe the
> spread. [...]

There are. I see you are using prehistoric tooling, but see the various NUMA 
convergence latency measurement utilities in 'perf bench numa':

triton:~/tip> perf bench numa mem -h
# Running 'numa/mem' benchmark:

 # Running main, "perf bench numa numa-mem -h"

 usage: perf bench numa <options>

    -p, --nr_proc <n>     number of processes
    -t, --nr_threads <n>  number of threads per process
    -G, --mb_global <MB>  global  memory (MBs)
    -P, --mb_proc <MB>    process memory (MBs)
    -L, --mb_proc_locked <MB>
                          process serialized/locked memory access (MBs), <= process_memory
    -T, --mb_thread <MB>  thread  memory (MBs)
    -l, --nr_loops <n>    max number of loops to run
    -s, --nr_secs <n>     max number of seconds to run
    -u, --usleep <n>      usecs to sleep per loop iteration
    -R, --data_reads      access the data via writes (can be mixed with -W)
    -W, --data_writes     access the data via writes (can be mixed with -R)
    -B, --data_backwards  access the data backwards as well
    -Z, --data_zero_memset
                          access the data via glibc bzero only
    -r, --data_rand_walk  access the data with random (32bit LFSR) walk
    -z, --init_zero       bzero the initial allocations
    -I, --init_random     randomize the contents of the initial allocations
    -0, --init_cpu0       do the initial allocations on CPU#0
    -x, --perturb_secs <n>
                          perturb thread 0/0 every X secs, to test convergence stability
    -d, --show_details    Show details
    -a, --all             Run all tests in the suite
    -H, --thp <n>         MADV_NOHUGEPAGE < 0 < MADV_HUGEPAGE
    -c, --show_convergence
                          show convergence details
    -m, --measure_convergence
                          measure convergence latency
    -q, --quiet           quiet mode
    -S, --serialize-startup
                          serialize thread startup
    -C, --cpus <cpu[,cpu2,...cpuN]>
                          bind the first N tasks to these specific cpus (the rest is unbound)
    -M, --memnodes <node[,node2,...nodeN]>
                          bind the first N tasks to these specific memory nodes (the rest is unbound)

'-m' will measure convergence.
'-c' will visualize it.
'--thp' can be used to turn hugepages on/off

For example you can create a 'numa02' work-alike by doing:

  vega:~> cat numa02
  #!/bin/bash

  perf bench numa mem --no-data_rand_walk -p 1 -t 32 -G 0 -P 0 -T 32 -l 800 -zZ0c $@

this perf bench numa command mimics numa02 pretty exactly on a 32 CPU system.

This will run it in a loop:

  vega:~> cat numa02-loop 

  while :; do
    ./numa02 2>&1 | grep runtime-max/thread
    sleep 1
  done

Or here are various numa01 work-alikes:

  vega:~> cat numa01
  perf bench numa mem --no-data_rand_walk -p 2 -t 16 -G 0 -P 3072 -T 0 -l 50 -zZ0c $@

  vega:~> cat numa01-hard-bind
  ./numa01 --cpus=0-16_16x16#16 --memnodes=0x16,2x16

or numa01-thread-alloc:

  vega:~> cat numa01-THREAD_ALLOC

  perf bench numa mem --no-data_rand_walk -p 2 -t 16 -G 0 -P 0 -T 192 -l 1000 -zZ0c $@

You can generate very flexible setups of NUMA access patterns, and measure their 
behavior accurately.

It's all so much more capable and more flexible than autonumabench ...

Also, when you are trying to report numbers for multiple runs, please use 
something like:

   perf stat --null --repeat 3 ...

This will run the workload 3 times (doing only time measurement) and report the 
stddev in a human readable form.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity
  2015-06-23  8:10       ` Ingo Molnar
@ 2015-06-23 13:01         ` Srikar Dronamraju
  2015-06-23 14:36           ` Ingo Molnar
  0 siblings, 1 reply; 23+ messages in thread
From: Srikar Dronamraju @ 2015-06-23 13:01 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Rik van Riel, Peter Zijlstra, linux-kernel, Mel Gorman

* Ingo Molnar <mingo@kernel.org> [2015-06-23 10:10:39]:
> > Please let me know if there are any better ways to observe the
> > spread. [...]
> 
> There are. I see you are using prehistoric tooling, but see the various NUMA 
> convergence latency measurement utilities in 'perf bench numa':
> 
>   vega:~> cat numa01-THREAD_ALLOC
> 
>   perf bench numa mem --no-data_rand_walk -p 2 -t 16 -G 0 -P 0 -T 192 -l 1000 -zZ0c $@
> 
> You can generate very flexible setups of NUMA access patterns, and measure their 
> behavior accurately.
> 
> It's all so much more capable and more flexible than autonumabench ...

Okay, thanks for the hint, I will try this out in future.

> 
> Also, when you are trying to report numbers for multiple runs, please use 
> something like:
> 
>    perf stat --null --repeat 3 ...
> 
> This will run the workload 3 times (doing only time measurement) and report the 
> stddev in a human readable form.
> 

Thanks again for this hint. Wouldnt system time/ user time also matter?
I guess once Mel did point out that it was important to make sure that
system time and user time dont increase when elapsed time decreases. But
I cant find the email though.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity
  2015-06-23 13:01         ` Srikar Dronamraju
@ 2015-06-23 14:36           ` Ingo Molnar
  0 siblings, 0 replies; 23+ messages in thread
From: Ingo Molnar @ 2015-06-23 14:36 UTC (permalink / raw)
  To: Srikar Dronamraju, Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Rik van Riel, Peter Zijlstra, linux-kernel, Mel Gorman


* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:

> * Ingo Molnar <mingo@kernel.org> [2015-06-23 10:10:39]:
> > > Please let me know if there are any better ways to observe the
> > > spread. [...]
> > 
> > There are. I see you are using prehistoric tooling, but see the various NUMA 
> > convergence latency measurement utilities in 'perf bench numa':
> > 
> >   vega:~> cat numa01-THREAD_ALLOC
> > 
> >   perf bench numa mem --no-data_rand_walk -p 2 -t 16 -G 0 -P 0 -T 192 -l 1000 -zZ0c $@
> > 
> > You can generate very flexible setups of NUMA access patterns, and measure their 
> > behavior accurately.
> > 
> > It's all so much more capable and more flexible than autonumabench ...
> 
> Okay, thanks for the hint, I will try this out in future.
> 
> > 
> > Also, when you are trying to report numbers for multiple runs, please use 
> > something like:
> > 
> >    perf stat --null --repeat 3 ...
> > 
> > This will run the workload 3 times (doing only time measurement) and report the 
> > stddev in a human readable form.
> > 
> 
> Thanks again for this hint. Wouldnt system time/ user time also matter?

Yeah, would be nice to add stime/utime output to 'perf stat', so that it's an easy 
replacement for /usr/bin/time.

I've Cc:-ed perf folks who might be able to help out.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [tip:sched/core] sched/numa: Prefer NUMA hotness over cache hotness
  2015-06-16 11:55 ` [PATCH v2 1/4] sched/tip:Prefer numa hotness over cache hotness Srikar Dronamraju
@ 2015-07-06 15:50   ` tip-bot for Srikar Dronamraju
  2015-07-07  0:19     ` Srikar Dronamraju
  2015-07-07  6:49   ` tip-bot for Srikar Dronamraju
  1 sibling, 1 reply; 23+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2015-07-06 15:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, riel, linux-kernel, srikar, hpa, tglx, mgorman, peterz,
	efault, mingo

Commit-ID:  8a9e62a238a3033158e0084d8df42ea116d69ce1
Gitweb:     http://git.kernel.org/tip/8a9e62a238a3033158e0084d8df42ea116d69ce1
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Tue, 16 Jun 2015 17:25:59 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 6 Jul 2015 15:29:55 +0200

sched/numa: Prefer NUMA hotness over cache hotness

The current load balancer may not try to prevent a task from moving
out of a preferred node to a less preferred node. The reason for this
being:

 - Since sched features NUMA and NUMA_RESIST_LOWER are disabled by
   default, migrate_degrades_locality() always returns false.

 - Even if NUMA_RESIST_LOWER were to be enabled, if its cache hot,
   migrate_degrades_locality() never gets called.

The above behaviour can mean that tasks can move out of their
preferred node but they may be eventually be brought back to their
preferred node by numa balancer (due to higher numa faults).

To avoid the above, this commit merges migrate_degrades_locality() and
migrate_improves_locality(). It also replaces 3 sched features NUMA,
NUMA_FAVOUR_HIGHER and NUMA_RESIST_LOWER by a single sched feature
NUMA.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/1434455762-30857-2-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c     | 89 ++++++++++++++-----------------------------------
 kernel/sched/features.h | 18 +++-------
 2 files changed, 30 insertions(+), 77 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98b2b96..43ee84f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5670,72 +5670,39 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 
 #ifdef CONFIG_NUMA_BALANCING
 /*
- * Returns true if the destination node is the preferred node.
- * Needs to match fbq_classify_rq(): if there is a runnable task
- * that is not on its preferred node, we should identify it.
+ * Returns 1, if task migration degrades locality
+ * Returns 0, if task migration improves locality i.e migration preferred.
+ * Returns -1, if task migration is not affected by locality.
  */
-static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 {
 	struct numa_group *numa_group = rcu_dereference(p->numa_group);
 	unsigned long src_faults, dst_faults;
 	int src_nid, dst_nid;
 
-	if (!sched_feat(NUMA) || !sched_feat(NUMA_FAVOUR_HIGHER) ||
-	    !p->numa_faults || !(env->sd->flags & SD_NUMA)) {
-		return false;
-	}
-
-	src_nid = cpu_to_node(env->src_cpu);
-	dst_nid = cpu_to_node(env->dst_cpu);
-
-	if (src_nid == dst_nid)
-		return false;
-
-	/* Encourage migration to the preferred node. */
-	if (dst_nid == p->numa_preferred_nid)
-		return true;
-
-	/* Migrating away from the preferred node is bad. */
-	if (src_nid == p->numa_preferred_nid)
-		return false;
-
-	if (numa_group) {
-		src_faults = group_faults(p, src_nid);
-		dst_faults = group_faults(p, dst_nid);
-	} else {
-		src_faults = task_faults(p, src_nid);
-		dst_faults = task_faults(p, dst_nid);
-	}
-
-	return dst_faults > src_faults;
-}
-
-
-static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
-{
-	struct numa_group *numa_group = rcu_dereference(p->numa_group);
-	unsigned long src_faults, dst_faults;
-	int src_nid, dst_nid;
-
-	if (!sched_feat(NUMA) || !sched_feat(NUMA_RESIST_LOWER))
-		return false;
-
 	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
-		return false;
+		return -1;
+
+	if (!sched_feat(NUMA))
+		return -1;
 
 	src_nid = cpu_to_node(env->src_cpu);
 	dst_nid = cpu_to_node(env->dst_cpu);
 
 	if (src_nid == dst_nid)
-		return false;
+		return -1;
 
-	/* Migrating away from the preferred node is bad. */
-	if (src_nid == p->numa_preferred_nid)
-		return true;
+	/* Migrating away from the preferred node is always bad. */
+	if (src_nid == p->numa_preferred_nid) {
+		if (env->src_rq->nr_running > env->src_rq->nr_preferred_running)
+			return 1;
+		else
+			return -1;
+	}
 
 	/* Encourage migration to the preferred node. */
 	if (dst_nid == p->numa_preferred_nid)
-		return false;
+		return 0;
 
 	if (numa_group) {
 		src_faults = group_faults(p, src_nid);
@@ -5749,16 +5716,10 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 }
 
 #else
-static inline bool migrate_improves_locality(struct task_struct *p,
+static inline int migrate_degrades_locality(struct task_struct *p,
 					     struct lb_env *env)
 {
-	return false;
-}
-
-static inline bool migrate_degrades_locality(struct task_struct *p,
-					     struct lb_env *env)
-{
-	return false;
+	return -1;
 }
 #endif
 
@@ -5768,7 +5729,7 @@ static inline bool migrate_degrades_locality(struct task_struct *p,
 static
 int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
-	int tsk_cache_hot = 0;
+	int tsk_cache_hot;
 
 	lockdep_assert_held(&env->src_rq->lock);
 
@@ -5826,13 +5787,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 2) task is cache cold, or
 	 * 3) too many balance attempts have failed.
 	 */
-	tsk_cache_hot = task_hot(p, env);
-	if (!tsk_cache_hot)
-		tsk_cache_hot = migrate_degrades_locality(p, env);
+	tsk_cache_hot = migrate_degrades_locality(p, env);
+	if (tsk_cache_hot == -1)
+		tsk_cache_hot = task_hot(p, env);
 
-	if (migrate_improves_locality(p, env) || !tsk_cache_hot ||
+	if (tsk_cache_hot <= 0 ||
 	    env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
-		if (tsk_cache_hot) {
+		if (tsk_cache_hot == 1) {
 			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
 			schedstat_inc(p, se.statistics.nr_forced_migrations);
 		}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 91e33cd..83a50e7 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -79,20 +79,12 @@ SCHED_FEAT(LB_MIN, false)
  * numa_balancing=
  */
 #ifdef CONFIG_NUMA_BALANCING
-SCHED_FEAT(NUMA,	false)
 
 /*
- * NUMA_FAVOUR_HIGHER will favor moving tasks towards nodes where a
- * higher number of hinting faults are recorded during active load
- * balancing.
+ * NUMA will favor moving tasks towards nodes where a higher number of
+ * hinting faults are recorded during active load balancing. It will
+ * resist moving tasks towards nodes where a lower number of hinting
+ * faults have been recorded.
  */
-SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
-
-/*
- * NUMA_RESIST_LOWER will resist moving tasks towards nodes where a
- * lower number of hinting faults have been recorded. As this has
- * the potential to prevent a task ever migrating to a new node
- * due to CPU overload it is disabled by default.
- */
-SCHED_FEAT(NUMA_RESIST_LOWER, false)
+SCHED_FEAT(NUMA,	true)
 #endif

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip:sched/core] sched/numa: Consider 'imbalance_pct' when comparing loads in numa_has_capacity()
  2015-06-16 11:56 ` [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity Srikar Dronamraju
  2015-06-16 14:39   ` Rik van Riel
@ 2015-07-06 15:50   ` tip-bot for Srikar Dronamraju
  2015-07-07  6:49   ` tip-bot for Srikar Dronamraju
  2 siblings, 0 replies; 23+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2015-07-06 15:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: riel, tglx, srikar, mingo, hpa, torvalds, peterz, linux-kernel, mgorman

Commit-ID:  2d9f7144b84aac8be63e1c45cd248a5f7f67ed24
Gitweb:     http://git.kernel.org/tip/2d9f7144b84aac8be63e1c45cd248a5f7f67ed24
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Tue, 16 Jun 2015 17:26:00 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 6 Jul 2015 15:29:55 +0200

sched/numa: Consider 'imbalance_pct' when comparing loads in numa_has_capacity()

This is consistent with all other load balancing instances where we
absorb unfairness upto env->imbalance_pct. Absorbing unfairness upto
env->imbalance_pct allows to pull and retain task to their preferred
nodes.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434455762-30857-3-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43ee84f..a53a610 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1415,8 +1415,9 @@ static bool numa_has_capacity(struct task_numa_env *env)
 	 * --------------------- vs ---------------------
 	 * src->compute_capacity    dst->compute_capacity
 	 */
-	if (src->load * dst->compute_capacity >
-	    dst->load * src->compute_capacity)
+	if (src->load * dst->compute_capacity * env->imbalance_pct >
+
+	    dst->load * src->compute_capacity * 100)
 		return true;
 
 	return false;

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [tip:sched/core] sched/numa:  Prefer NUMA hotness over cache hotness
  2015-07-06 15:50   ` [tip:sched/core] sched/numa: Prefer NUMA " tip-bot for Srikar Dronamraju
@ 2015-07-07  0:19     ` Srikar Dronamraju
  2015-07-08 13:31       ` Srikar Dronamraju
  0 siblings, 1 reply; 23+ messages in thread
From: Srikar Dronamraju @ 2015-07-07  0:19 UTC (permalink / raw)
  To: torvalds, riel, linux-kernel, tglx, hpa, mgorman, peterz, efault, mingo

* tip-bot for Srikar Dronamraju <tipbot@zytor.com> [2015-07-06 08:50:28]:

> Commit-ID:  8a9e62a238a3033158e0084d8df42ea116d69ce1
> Gitweb:     http://git.kernel.org/tip/8a9e62a238a3033158e0084d8df42ea116d69ce1
> Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> AuthorDate: Tue, 16 Jun 2015 17:25:59 +0530
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Mon, 6 Jul 2015 15:29:55 +0200
>
> sched/numa: Prefer NUMA hotness over cache hotness

In the above commit, I missed a fact that sched feature NUMA was used to
enable/disable NUMA_BALANCING. The below version of the same patch takes
care of this fact. While I am posting the fixed version, it would need a
revert of the above commit. Please let me know if you just want the
differential patch that can apply on top of the above commit.

---------->8--------------------------------------------------------------8<---------------

>From 6dd3a7253c42665393f900fda4e6b4193e8114a3 Mon Sep 17 00:00:00 2001
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Date: Wed, 3 Jun 2015 02:57:26 +0530
Subject: [PATCH] sched/tip:Prefer numa hotness over cache hotness

The current load balancer may not try to prevent a task from moving out
of a preferred node to a less preferred node. The reason for this being:

- Since sched features NUMA and NUMA_RESIST_LOWER are disabled by
  default, migrate_degrades_locality() always returns false.

- Even if NUMA_RESIST_LOWER were to be enabled, if its cache hot,
  migrate_degrades_locality() never gets called.

The above behaviour can mean that tasks can move out of their preferred
node but they may be eventually be brought back to their preferred node
by numa balancer (due to higher numa faults).

To avoid the above, this commit merges migrate_degrades_locality() and
migrate_improves_locality(). It combines 2 sched features NUMA_FAVOUR_HIGHER
and NUMA_RESIST_LOWER into a single sched feature NUMA_FAVOUR_HIGHER.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changes from previous version:
- Rebased to tip.

Added Ack from Rik based on the
http://lkml.kernel.org/r/557845D5.6060800@redhat.com.
Please let me know otherwise.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7210ae8..8a8ce95 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5662,72 +5662,40 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 
 #ifdef CONFIG_NUMA_BALANCING
 /*
- * Returns true if the destination node is the preferred node.
- * Needs to match fbq_classify_rq(): if there is a runnable task
- * that is not on its preferred node, we should identify it.
+ * Returns 1, if task migration degrades locality
+ * Returns 0, if task migration improves locality i.e migration preferred.
+ * Returns -1, if task migration is not affected by locality.
  */
-static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
-{
-	struct numa_group *numa_group = rcu_dereference(p->numa_group);
-	unsigned long src_faults, dst_faults;
-	int src_nid, dst_nid;
-
-	if (!sched_feat(NUMA_FAVOUR_HIGHER) || !p->numa_faults ||
-	    !(env->sd->flags & SD_NUMA)) {
-		return false;
-	}
-
-	src_nid = cpu_to_node(env->src_cpu);
-	dst_nid = cpu_to_node(env->dst_cpu);
-
-	if (src_nid == dst_nid)
-		return false;
-
-	/* Encourage migration to the preferred node. */
-	if (dst_nid == p->numa_preferred_nid)
-		return true;
-
-	/* Migrating away from the preferred node is bad. */
-	if (src_nid == p->numa_preferred_nid)
-		return false;
-
-	if (numa_group) {
-		src_faults = group_faults(p, src_nid);
-		dst_faults = group_faults(p, dst_nid);
-	} else {
-		src_faults = task_faults(p, src_nid);
-		dst_faults = task_faults(p, dst_nid);
-	}
-
-	return dst_faults > src_faults;
-}
-
 
-static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 {
 	struct numa_group *numa_group = rcu_dereference(p->numa_group);
 	unsigned long src_faults, dst_faults;
 	int src_nid, dst_nid;
 
-	if (!sched_feat(NUMA) || !sched_feat(NUMA_RESIST_LOWER))
-		return false;
+	if (!sched_feat(NUMA) || !sched_feat(NUMA_FAVOUR_HIGHER))
+		return -1;
 
 	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
-		return false;
+		return -1;
 
 	src_nid = cpu_to_node(env->src_cpu);
 	dst_nid = cpu_to_node(env->dst_cpu);
 
 	if (src_nid == dst_nid)
-		return false;
+		return -1;
 
-	/* Migrating away from the preferred node is bad. */
-	if (src_nid == p->numa_preferred_nid)
-		return true;
+	/* Migrating away from the preferred node is always bad. */
+	if (src_nid == p->numa_preferred_nid) {
+		if (env->src_rq->nr_running > env->src_rq->nr_preferred_running)
+			return 1;
+		else
+			return -1;
+	}
 
 	/* Encourage migration to the preferred node. */
 	if (dst_nid == p->numa_preferred_nid)
-		return false;
+		return 0;
 
 	if (numa_group) {
 		src_faults = group_faults(p, src_nid);
@@ -5741,16 +5709,10 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 }
 
 #else
-static inline bool migrate_improves_locality(struct task_struct *p,
+static inline int migrate_degrades_locality(struct task_struct *p,
 					     struct lb_env *env)
 {
-	return false;
-}
-
-static inline bool migrate_degrades_locality(struct task_struct *p,
-					     struct lb_env *env)
-{
-	return false;
+	return -1;
 }
 #endif
 
@@ -5760,7 +5722,7 @@ static inline bool migrate_degrades_locality(struct task_struct *p,
 static
 int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
-	int tsk_cache_hot = 0;
+	int tsk_cache_hot;
 
 	lockdep_assert_held(&env->src_rq->lock);
 
@@ -5818,13 +5780,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 2) task is cache cold, or
 	 * 3) too many balance attempts have failed.
 	 */
-	tsk_cache_hot = task_hot(p, env);
-	if (!tsk_cache_hot)
-		tsk_cache_hot = migrate_degrades_locality(p, env);
+	tsk_cache_hot = migrate_degrades_locality(p, env);
+	if (tsk_cache_hot == -1)
+		tsk_cache_hot = task_hot(p, env);
 
-	if (migrate_improves_locality(p, env) || !tsk_cache_hot ||
+	if (tsk_cache_hot <= 0 ||
 	    env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
-		if (tsk_cache_hot) {
+		if (tsk_cache_hot == 1) {
 			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
 			schedstat_inc(p, se.statistics.nr_forced_migrations);
 		}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 91e33cd..d4d4726 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -84,15 +84,8 @@ SCHED_FEAT(NUMA,	false)
 /*
  * NUMA_FAVOUR_HIGHER will favor moving tasks towards nodes where a
  * higher number of hinting faults are recorded during active load
- * balancing.
+ * balancing. It will resist moving tasks towards nodes where a lower
+ * number of hinting faults have been recorded.
  */
 SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
-
-/*
- * NUMA_RESIST_LOWER will resist moving tasks towards nodes where a
- * lower number of hinting faults have been recorded. As this has
- * the potential to prevent a task ever migrating to a new node
- * due to CPU overload it is disabled by default.
- */
-SCHED_FEAT(NUMA_RESIST_LOWER, false)
 #endif


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip:sched/core] sched/numa: Prefer NUMA hotness over cache hotness
  2015-06-16 11:55 ` [PATCH v2 1/4] sched/tip:Prefer numa hotness over cache hotness Srikar Dronamraju
  2015-07-06 15:50   ` [tip:sched/core] sched/numa: Prefer NUMA " tip-bot for Srikar Dronamraju
@ 2015-07-07  6:49   ` tip-bot for Srikar Dronamraju
  1 sibling, 0 replies; 23+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2015-07-07  6:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, torvalds, mgorman, efault, mingo, tglx, linux-kernel,
	srikar, hpa, riel

Commit-ID:  2a1ed24ce94036d00a7c5d5e99a77a80f0aa556a
Gitweb:     http://git.kernel.org/tip/2a1ed24ce94036d00a7c5d5e99a77a80f0aa556a
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Tue, 16 Jun 2015 17:25:59 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 7 Jul 2015 08:46:10 +0200

sched/numa: Prefer NUMA hotness over cache hotness

The current load balancer may not try to prevent a task from moving
out of a preferred node to a less preferred node. The reason for this
being:

 - Since sched features NUMA and NUMA_RESIST_LOWER are disabled by
   default, migrate_degrades_locality() always returns false.

 - Even if NUMA_RESIST_LOWER were to be enabled, if its cache hot,
   migrate_degrades_locality() never gets called.

The above behaviour can mean that tasks can move out of their
preferred node but they may be eventually be brought back to their
preferred node by numa balancer (due to higher numa faults).

To avoid the above, this commit merges migrate_degrades_locality() and
migrate_improves_locality(). It also replaces 3 sched features NUMA,
NUMA_FAVOUR_HIGHER and NUMA_RESIST_LOWER by a single sched feature
NUMA.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/1434455762-30857-2-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c     | 89 ++++++++++++++-----------------------------------
 kernel/sched/features.h | 18 +++-------
 2 files changed, 30 insertions(+), 77 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98b2b96..43ee84f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5670,72 +5670,39 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 
 #ifdef CONFIG_NUMA_BALANCING
 /*
- * Returns true if the destination node is the preferred node.
- * Needs to match fbq_classify_rq(): if there is a runnable task
- * that is not on its preferred node, we should identify it.
+ * Returns 1, if task migration degrades locality
+ * Returns 0, if task migration improves locality i.e migration preferred.
+ * Returns -1, if task migration is not affected by locality.
  */
-static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 {
 	struct numa_group *numa_group = rcu_dereference(p->numa_group);
 	unsigned long src_faults, dst_faults;
 	int src_nid, dst_nid;
 
-	if (!sched_feat(NUMA) || !sched_feat(NUMA_FAVOUR_HIGHER) ||
-	    !p->numa_faults || !(env->sd->flags & SD_NUMA)) {
-		return false;
-	}
-
-	src_nid = cpu_to_node(env->src_cpu);
-	dst_nid = cpu_to_node(env->dst_cpu);
-
-	if (src_nid == dst_nid)
-		return false;
-
-	/* Encourage migration to the preferred node. */
-	if (dst_nid == p->numa_preferred_nid)
-		return true;
-
-	/* Migrating away from the preferred node is bad. */
-	if (src_nid == p->numa_preferred_nid)
-		return false;
-
-	if (numa_group) {
-		src_faults = group_faults(p, src_nid);
-		dst_faults = group_faults(p, dst_nid);
-	} else {
-		src_faults = task_faults(p, src_nid);
-		dst_faults = task_faults(p, dst_nid);
-	}
-
-	return dst_faults > src_faults;
-}
-
-
-static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
-{
-	struct numa_group *numa_group = rcu_dereference(p->numa_group);
-	unsigned long src_faults, dst_faults;
-	int src_nid, dst_nid;
-
-	if (!sched_feat(NUMA) || !sched_feat(NUMA_RESIST_LOWER))
-		return false;
-
 	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
-		return false;
+		return -1;
+
+	if (!sched_feat(NUMA))
+		return -1;
 
 	src_nid = cpu_to_node(env->src_cpu);
 	dst_nid = cpu_to_node(env->dst_cpu);
 
 	if (src_nid == dst_nid)
-		return false;
+		return -1;
 
-	/* Migrating away from the preferred node is bad. */
-	if (src_nid == p->numa_preferred_nid)
-		return true;
+	/* Migrating away from the preferred node is always bad. */
+	if (src_nid == p->numa_preferred_nid) {
+		if (env->src_rq->nr_running > env->src_rq->nr_preferred_running)
+			return 1;
+		else
+			return -1;
+	}
 
 	/* Encourage migration to the preferred node. */
 	if (dst_nid == p->numa_preferred_nid)
-		return false;
+		return 0;
 
 	if (numa_group) {
 		src_faults = group_faults(p, src_nid);
@@ -5749,16 +5716,10 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 }
 
 #else
-static inline bool migrate_improves_locality(struct task_struct *p,
+static inline int migrate_degrades_locality(struct task_struct *p,
 					     struct lb_env *env)
 {
-	return false;
-}
-
-static inline bool migrate_degrades_locality(struct task_struct *p,
-					     struct lb_env *env)
-{
-	return false;
+	return -1;
 }
 #endif
 
@@ -5768,7 +5729,7 @@ static inline bool migrate_degrades_locality(struct task_struct *p,
 static
 int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
-	int tsk_cache_hot = 0;
+	int tsk_cache_hot;
 
 	lockdep_assert_held(&env->src_rq->lock);
 
@@ -5826,13 +5787,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 2) task is cache cold, or
 	 * 3) too many balance attempts have failed.
 	 */
-	tsk_cache_hot = task_hot(p, env);
-	if (!tsk_cache_hot)
-		tsk_cache_hot = migrate_degrades_locality(p, env);
+	tsk_cache_hot = migrate_degrades_locality(p, env);
+	if (tsk_cache_hot == -1)
+		tsk_cache_hot = task_hot(p, env);
 
-	if (migrate_improves_locality(p, env) || !tsk_cache_hot ||
+	if (tsk_cache_hot <= 0 ||
 	    env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
-		if (tsk_cache_hot) {
+		if (tsk_cache_hot == 1) {
 			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
 			schedstat_inc(p, se.statistics.nr_forced_migrations);
 		}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 91e33cd..83a50e7 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -79,20 +79,12 @@ SCHED_FEAT(LB_MIN, false)
  * numa_balancing=
  */
 #ifdef CONFIG_NUMA_BALANCING
-SCHED_FEAT(NUMA,	false)
 
 /*
- * NUMA_FAVOUR_HIGHER will favor moving tasks towards nodes where a
- * higher number of hinting faults are recorded during active load
- * balancing.
+ * NUMA will favor moving tasks towards nodes where a higher number of
+ * hinting faults are recorded during active load balancing. It will
+ * resist moving tasks towards nodes where a lower number of hinting
+ * faults have been recorded.
  */
-SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
-
-/*
- * NUMA_RESIST_LOWER will resist moving tasks towards nodes where a
- * lower number of hinting faults have been recorded. As this has
- * the potential to prevent a task ever migrating to a new node
- * due to CPU overload it is disabled by default.
- */
-SCHED_FEAT(NUMA_RESIST_LOWER, false)
+SCHED_FEAT(NUMA,	true)
 #endif

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip:sched/core] sched/numa: Consider 'imbalance_pct' when comparing loads in numa_has_capacity()
  2015-06-16 11:56 ` [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity Srikar Dronamraju
  2015-06-16 14:39   ` Rik van Riel
  2015-07-06 15:50   ` [tip:sched/core] sched/numa: Consider 'imbalance_pct' when comparing loads in numa_has_capacity() tip-bot for Srikar Dronamraju
@ 2015-07-07  6:49   ` tip-bot for Srikar Dronamraju
  2 siblings, 0 replies; 23+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2015-07-07  6:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, hpa, linux-kernel, srikar, tglx, mingo, peterz, mgorman, riel

Commit-ID:  44dcb04f0ea8eaac3b9c9d3172416efc5a950214
Gitweb:     http://git.kernel.org/tip/44dcb04f0ea8eaac3b9c9d3172416efc5a950214
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Tue, 16 Jun 2015 17:26:00 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 7 Jul 2015 08:46:10 +0200

sched/numa: Consider 'imbalance_pct' when comparing loads in numa_has_capacity()

This is consistent with all other load balancing instances where we
absorb unfairness upto env->imbalance_pct. Absorbing unfairness upto
env->imbalance_pct allows to pull and retain task to their preferred
nodes.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434455762-30857-3-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43ee84f..a53a610 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1415,8 +1415,9 @@ static bool numa_has_capacity(struct task_numa_env *env)
 	 * --------------------- vs ---------------------
 	 * src->compute_capacity    dst->compute_capacity
 	 */
-	if (src->load * dst->compute_capacity >
-	    dst->load * src->compute_capacity)
+	if (src->load * dst->compute_capacity * env->imbalance_pct >
+
+	    dst->load * src->compute_capacity * 100)
 		return true;
 
 	return false;

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [tip:sched/core] sched/numa:  Prefer NUMA hotness over cache hotness
  2015-07-07  0:19     ` Srikar Dronamraju
@ 2015-07-08 13:31       ` Srikar Dronamraju
  0 siblings, 0 replies; 23+ messages in thread
From: Srikar Dronamraju @ 2015-07-08 13:31 UTC (permalink / raw)
  To: torvalds, riel, linux-kernel, tglx, hpa, mgorman, peterz, efault, mingo

* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2015-07-07 05:49:31]:

> * tip-bot for Srikar Dronamraju <tipbot@zytor.com> [2015-07-06 08:50:28]:
> 
> > Commit-ID:  8a9e62a238a3033158e0084d8df42ea116d69ce1
> > Gitweb:     http://git.kernel.org/tip/8a9e62a238a3033158e0084d8df42ea116d69ce1
> > Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > AuthorDate: Tue, 16 Jun 2015 17:25:59 +0530
> > Committer:  Ingo Molnar <mingo@kernel.org>
> > CommitDate: Mon, 6 Jul 2015 15:29:55 +0200
> >
> > sched/numa: Prefer NUMA hotness over cache hotness
> 
> In the above commit, I missed a fact that sched feature NUMA was used to
> enable/disable NUMA_BALANCING. The below version of the same patch takes
> care of this fact. While I am posting the fixed version, it would need a
> revert of the above commit. Please let me know if you just want the
> differential patch that can apply on top of the above commit.

Posted the differential patch just in case you are looking for it.
http://mid.gmane.org/1436361633-4970-1-git-send-email-srikar@linux.vnet.ibm.com


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-07-08 13:32 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-16 11:55 [PATCH v2 0/4] Improve numa load balancing Srikar Dronamraju
2015-06-16 11:55 ` [PATCH v2 1/4] sched/tip:Prefer numa hotness over cache hotness Srikar Dronamraju
2015-07-06 15:50   ` [tip:sched/core] sched/numa: Prefer NUMA " tip-bot for Srikar Dronamraju
2015-07-07  0:19     ` Srikar Dronamraju
2015-07-08 13:31       ` Srikar Dronamraju
2015-07-07  6:49   ` tip-bot for Srikar Dronamraju
2015-06-16 11:56 ` [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity Srikar Dronamraju
2015-06-16 14:39   ` Rik van Riel
2015-06-22 16:29     ` Srikar Dronamraju
2015-06-23  1:18       ` Rik van Riel
2015-06-23  8:10       ` Ingo Molnar
2015-06-23 13:01         ` Srikar Dronamraju
2015-06-23 14:36           ` Ingo Molnar
2015-07-06 15:50   ` [tip:sched/core] sched/numa: Consider 'imbalance_pct' when comparing loads in numa_has_capacity() tip-bot for Srikar Dronamraju
2015-07-07  6:49   ` tip-bot for Srikar Dronamraju
2015-06-16 11:56 ` [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node Srikar Dronamraju
2015-06-16 14:54   ` Rik van Riel
2015-06-16 17:19     ` Srikar Dronamraju
2015-06-16 18:25       ` Rik van Riel
2015-06-16 17:18   ` Rik van Riel
2015-06-16 11:56 ` [PATCH v2 4/4] sched:Use correct nid while evaluating task weights Srikar Dronamraju
2015-06-16 15:00   ` Rik van Riel
2015-06-16 17:26     ` Srikar Dronamraju

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.