All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched,numa: document and fix numa_preferred_nid setting
@ 2015-06-16 19:54 Rik van Riel
  2015-06-18 15:55 ` Srikar Dronamraju
  2015-06-22 16:13 ` Srikar Dronamraju
  0 siblings, 2 replies; 14+ messages in thread
From: Rik van Riel @ 2015-06-16 19:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, srikar, mingo, mgorman

There are two places where the numa balancing code sets a task's
numa_preferred_nid.

The primary location is task_numa_placement(), where the kernel
examines the NUMA fault statistics to determine the location where
most of the memory that the task (or numa_group) accesses is.

The second location is only used for large workloads, where a
numa_group has enough tasks that the tasks are spread out over
several NUMA nodes, and multiple nodes are in the numa group's
active_nodes mask.

In order to allow those workloads to settle down, we pretend
that any node inside the numa_group's active_nodes mask is the
task's new preferred node. This dissuades task_numa_fault()
from continuously retrying to migrate the task to the group's
preferred node, and allows a multi-node workload to settle down.
This in turn improves locality of private faults inside a numa
group.

Reported-by: Shrikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2980e8733bc..54bb57f09e75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1485,7 +1485,12 @@ static int task_numa_migrate(struct task_struct *p)
 				groupweight = group_weight(p, env.src_nid, dist);
 			}
 
-			/* Only consider nodes where both task and groups benefit */
+			/*
+			 * Only consider nodes where placement is better for
+			 * either the group (help large workloads converge),
+			 * or the task (placement of tasks within a numa group,
+			 * and single threaded processes).
+			 */
 			taskimp = task_weight(p, nid, dist) - taskweight;
 			groupimp = group_weight(p, nid, dist) - groupweight;
 			if (taskimp < 0 && groupimp < 0)
@@ -1499,12 +1504,14 @@ static int task_numa_migrate(struct task_struct *p)
 	}
 
 	/*
-	 * If the task is part of a workload that spans multiple NUMA nodes,
-	 * and is migrating into one of the workload's active nodes, remember
-	 * this node as the task's preferred numa node, so the workload can
-	 * settle down.
-	 * A task that migrated to a second choice node will be better off
-	 * trying for a better one later. Do not set the preferred node here.
+	 * The primary place for setting a task's numa_preferred_nid is in
+	 * task_numa_placement(). If a task is moved to a sub-optimal node,
+	 * leave numa_preferred_nid alone, so task_numa_fault() will retry
+	 * migrating the task to where it really belongs.
+	 * The exception is a task that belongs to a large numa_group, which
+	 * spans multiple NUMA nodes. If that task migrates into one of the
+	 * workload's active nodes, remember that node as the task's
+	 * numa_preferred_nid, so the workload can settle down.
 	 */
 	if (p->numa_group) {
 		if (env.best_cpu == -1)
@@ -1513,7 +1520,7 @@ static int task_numa_migrate(struct task_struct *p)
 			nid = env.dst_nid;
 
 		if (node_isset(nid, p->numa_group->active_nodes))
-			sched_setnuma(p, env.dst_nid);
+			sched_setnuma(p, nid);
 	}
 
 	/* No better CPU than the current one was found. */


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched,numa: document and fix numa_preferred_nid setting
  2015-06-16 19:54 [PATCH] sched,numa: document and fix numa_preferred_nid setting Rik van Riel
@ 2015-06-18 15:55 ` Srikar Dronamraju
  2015-06-18 16:06   ` Rik van Riel
  2015-06-18 16:12   ` Ingo Molnar
  2015-06-22 16:13 ` Srikar Dronamraju
  1 sibling, 2 replies; 14+ messages in thread
From: Srikar Dronamraju @ 2015-06-18 15:55 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, peterz, mingo, mgorman

>  	if (p->numa_group) {
>  		if (env.best_cpu == -1)
> @@ -1513,7 +1520,7 @@ static int task_numa_migrate(struct task_struct *p)
>  			nid = env.dst_nid;
> 
>  		if (node_isset(nid, p->numa_group->active_nodes))
> -			sched_setnuma(p, env.dst_nid);
> +			sched_setnuma(p, nid);
>  	}
> 
>  	/* No better CPU than the current one was found. */
> 

Overall this patch does seem to produce better results. However numa02
gets affected -vely.

KernelVersion: 4.1.0-rc7-tip
	Testcase:         Min         Max         Avg      StdDev
  elapsed_numa01:      858.85      949.18      915.64       33.06
  elapsed_numa02:       23.09       29.89       26.43        2.18
	Testcase:         Min         Max         Avg      StdDev
   system_numa01:     1516.72     1855.08     1686.24      113.95
   system_numa02:       63.69       79.06       70.35        5.87
	Testcase:         Min         Max         Avg      StdDev
     user_numa01:    73284.76    80818.21    78060.88     2773.60
     user_numa02:     1690.18     2071.07     1821.64      140.25
	Testcase:         Min         Max         Avg      StdDev
    total_numa01:    74801.50    82572.60    79747.12     2875.61
    total_numa02:     1753.87     2142.77     1891.99      143.59

KernelVersion: 4.1.0-rc7-tip + your patch

	Testcase:         Min         Max         Avg      StdDev     %Change
  elapsed_numa01:      665.26      877.47      776.77       79.23      15.83%
  elapsed_numa02:       24.59       31.30       28.17        2.48      -5.56%
	Testcase:         Min         Max         Avg      StdDev     %Change
   system_numa01:      659.57     1220.99      942.36      234.92      60.92%
   system_numa02:       44.62       86.01       64.64       14.24       6.64%
	Testcase:         Min         Max         Avg      StdDev     %Change
     user_numa01:    56280.95    75908.81    64993.57     7764.30      17.21%
     user_numa02:     1790.35     2155.02     1916.12      132.57      -4.38%
	Testcase:         Min         Max         Avg      StdDev     %Change
    total_numa01:    56940.50    77128.20    65935.92     7993.49      17.91%
    total_numa02:     1834.97     2227.03     1980.76      136.51      -3.99%

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched,numa: document and fix numa_preferred_nid setting
  2015-06-18 15:55 ` Srikar Dronamraju
@ 2015-06-18 16:06   ` Rik van Riel
  2015-06-18 16:41     ` Srikar Dronamraju
  2015-06-18 16:12   ` Ingo Molnar
  1 sibling, 1 reply; 14+ messages in thread
From: Rik van Riel @ 2015-06-18 16:06 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: linux-kernel, peterz, mingo, mgorman

On 06/18/2015 11:55 AM, Srikar Dronamraju wrote:
>>  	if (p->numa_group) {
>>  		if (env.best_cpu == -1)
>> @@ -1513,7 +1520,7 @@ static int task_numa_migrate(struct task_struct *p)
>>  			nid = env.dst_nid;
>>
>>  		if (node_isset(nid, p->numa_group->active_nodes))
>> -			sched_setnuma(p, env.dst_nid);
>> +			sched_setnuma(p, nid);
>>  	}
>>
>>  	/* No better CPU than the current one was found. */
>>
> 
> Overall this patch does seem to produce better results. However numa02
> gets affected -vely.

OK, that is kind of expected.

The way numa02 runs means that we are essentially guaranteed
that, on a two node system, both nodes end up in the numa_group's
active_mask.

What the above change does is slow down migration if a task ends
up in a NUMA node in p->numa_group->active_nodes.

This is necessary if a very large workload has already converged
on a set of NUMA nodes, but it does slow down convergence for such
workloads.

I can't think of any obvious way to both slow down movement once
things have converged, yet keep speedy movement of tasks when they
have not yet converged.

It is worth noting that all the numa01 and numa02 benchmarks
measure is the speed at which the workloads converge. It does not
measure the overhead of making things converge, or how fast an
actual workload runs (NUMA locality benefit, minus NUMA placement
overhead).

> KernelVersion: 4.1.0-rc7-tip
> 	Testcase:         Min         Max         Avg      StdDev
>   elapsed_numa01:      858.85      949.18      915.64       33.06
>   elapsed_numa02:       23.09       29.89       26.43        2.18
> 	Testcase:         Min         Max         Avg      StdDev
>    system_numa01:     1516.72     1855.08     1686.24      113.95
>    system_numa02:       63.69       79.06       70.35        5.87
> 	Testcase:         Min         Max         Avg      StdDev
>      user_numa01:    73284.76    80818.21    78060.88     2773.60
>      user_numa02:     1690.18     2071.07     1821.64      140.25
> 	Testcase:         Min         Max         Avg      StdDev
>     total_numa01:    74801.50    82572.60    79747.12     2875.61
>     total_numa02:     1753.87     2142.77     1891.99      143.59
> 
> KernelVersion: 4.1.0-rc7-tip + your patch
> 
> 	Testcase:         Min         Max         Avg      StdDev     %Change
>   elapsed_numa01:      665.26      877.47      776.77       79.23      15.83%
>   elapsed_numa02:       24.59       31.30       28.17        2.48      -5.56%
> 	Testcase:         Min         Max         Avg      StdDev     %Change
>    system_numa01:      659.57     1220.99      942.36      234.92      60.92%
>    system_numa02:       44.62       86.01       64.64       14.24       6.64%
> 	Testcase:         Min         Max         Avg      StdDev     %Change
>      user_numa01:    56280.95    75908.81    64993.57     7764.30      17.21%
>      user_numa02:     1790.35     2155.02     1916.12      132.57      -4.38%
> 	Testcase:         Min         Max         Avg      StdDev     %Change
>     total_numa01:    56940.50    77128.20    65935.92     7993.49      17.91%
>     total_numa02:     1834.97     2227.03     1980.76      136.51      -3.99%
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched,numa: document and fix numa_preferred_nid setting
  2015-06-18 15:55 ` Srikar Dronamraju
  2015-06-18 16:06   ` Rik van Riel
@ 2015-06-18 16:12   ` Ingo Molnar
  2015-06-18 18:16     ` Rik van Riel
  1 sibling, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2015-06-18 16:12 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: Rik van Riel, linux-kernel, peterz, mgorman


* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:

> >  	if (p->numa_group) {
> >  		if (env.best_cpu == -1)
> > @@ -1513,7 +1520,7 @@ static int task_numa_migrate(struct task_struct *p)
> >  			nid = env.dst_nid;
> > 
> >  		if (node_isset(nid, p->numa_group->active_nodes))
> > -			sched_setnuma(p, env.dst_nid);
> > +			sched_setnuma(p, nid);
> >  	}
> > 
> >  	/* No better CPU than the current one was found. */
> > 
> 
> Overall this patch does seem to produce better results. However numa02
> gets affected -vely.

Huh?

numa02 is the more important benchmark of the two. 'numa01' is a conflicting 
workload that is a lot more sensitive to balancing details - while 'numa02' is a 
nicely partitioned workload that should converge as fast as possible.

So if numa02 got worse then it's a bad change.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched,numa: document and fix numa_preferred_nid setting
  2015-06-18 16:06   ` Rik van Riel
@ 2015-06-18 16:41     ` Srikar Dronamraju
  2015-06-18 17:00       ` Rik van Riel
  0 siblings, 1 reply; 14+ messages in thread
From: Srikar Dronamraju @ 2015-06-18 16:41 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, peterz, mingo, mgorman

* Rik van Riel <riel@redhat.com> [2015-06-18 12:06:49]:

> >>
> > 
> > Overall this patch does seem to produce better results. However numa02
> > gets affected -vely.
> 
> OK, that is kind of expected.
> 
> The way numa02 runs means that we are essentially guaranteed
> that, on a two node system, both nodes end up in the numa_group's
> active_mask.
> 

Just to add this was on a 4 node machine.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched,numa: document and fix numa_preferred_nid setting
  2015-06-18 16:41     ` Srikar Dronamraju
@ 2015-06-18 17:00       ` Rik van Riel
  2015-06-18 17:11         ` Srikar Dronamraju
  2015-06-19 17:16         ` Srikar Dronamraju
  0 siblings, 2 replies; 14+ messages in thread
From: Rik van Riel @ 2015-06-18 17:00 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: linux-kernel, peterz, mingo, mgorman

On 06/18/2015 12:41 PM, Srikar Dronamraju wrote:
> * Rik van Riel <riel@redhat.com> [2015-06-18 12:06:49]:
> 
>>>>
>>>
>>> Overall this patch does seem to produce better results. However numa02
>>> gets affected -vely.
>>
>> OK, that is kind of expected.
>>
>> The way numa02 runs means that we are essentially guaranteed
>> that, on a two node system, both nodes end up in the numa_group's
>> active_mask.
>>
> 
> Just to add this was on a 4 node machine.

OK, so we are looking at two multi-threaded processes
on a 4 node system, and waiting for them to converge?

It may make sense to add my patch in with your patch
1/4 from last week, as well as the correct part of
your patch 4/4, and see how they all work together.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched,numa: document and fix numa_preferred_nid setting
  2015-06-18 17:00       ` Rik van Riel
@ 2015-06-18 17:11         ` Srikar Dronamraju
  2015-06-19 17:16         ` Srikar Dronamraju
  1 sibling, 0 replies; 14+ messages in thread
From: Srikar Dronamraju @ 2015-06-18 17:11 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, peterz, mingo, mgorman

> 
> OK, so we are looking at two multi-threaded processes
> on a 4 node system, and waiting for them to converge?
> 
> It may make sense to add my patch in with your patch
> 1/4 from last week, as well as the correct part of
> your patch 4/4, and see how they all work together.
> --

Okay, I will do the needful and come back to you.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched,numa: document and fix numa_preferred_nid setting
  2015-06-18 16:12   ` Ingo Molnar
@ 2015-06-18 18:16     ` Rik van Riel
  0 siblings, 0 replies; 14+ messages in thread
From: Rik van Riel @ 2015-06-18 18:16 UTC (permalink / raw)
  To: Ingo Molnar, Srikar Dronamraju; +Cc: linux-kernel, peterz, mgorman

On 06/18/2015 12:12 PM, Ingo Molnar wrote:
> * Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> 
>>>  	if (p->numa_group) {
>>>  		if (env.best_cpu == -1)
>>> @@ -1513,7 +1520,7 @@ static int task_numa_migrate(struct task_struct *p)
>>>  			nid = env.dst_nid;
>>>
>>>  		if (node_isset(nid, p->numa_group->active_nodes))
>>> -			sched_setnuma(p, env.dst_nid);
>>> +			sched_setnuma(p, nid);
>>>  	}
>>>
>>>  	/* No better CPU than the current one was found. */
>>>
>>
>> Overall this patch does seem to produce better results. However numa02
>> gets affected -vely.
> 
> Huh?
> 
> numa02 is the more important benchmark of the two. 'numa01' is a conflicting 
> workload that is a lot more sensitive to balancing details - while 'numa02' is a 
> nicely partitioned workload that should converge as fast as possible.
> 
> So if numa02 got worse then it's a bad change.

It slows down convergence.

However, for a benchmark that spends much of its time
actually doing something with the memory it accesses,
after things have converged, slowing down task movement
after convergence may be a good thing.

It will be good to get SPECjbb2005 numbers, both with
1 instance, 2 instances, and 4 instances on the same
4 node system.

If it turns out that single instance and two instance
workloads do not benefit from trying to slow down task
movement after the workloads have converged, we can
change the code above, and simply retry task migration
much, much more often after the workload has converged
onto several nodes.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched,numa: document and fix numa_preferred_nid setting
  2015-06-18 17:00       ` Rik van Riel
  2015-06-18 17:11         ` Srikar Dronamraju
@ 2015-06-19 17:16         ` Srikar Dronamraju
  2015-06-19 17:52           ` Rik van Riel
  2015-06-22 16:48           ` Srikar Dronamraju
  1 sibling, 2 replies; 14+ messages in thread
From: Srikar Dronamraju @ 2015-06-19 17:16 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, peterz, mingo, mgorman

> 
> OK, so we are looking at two multi-threaded processes
> on a 4 node system, and waiting for them to converge?
> 
> It may make sense to add my patch in with your patch
> 1/4 from last week, as well as the correct part of
> your patch 4/4, and see how they all work together.
> 

Tested specjbb and autonumabenchmark on 4 kernels.

Plain 4.1.0-rc7-tip (i)
tip + only Rik's patch (ii)
tip + Rik's ++ (iii)
tip + Srikar's ++ (iv)

(i) = Plain 4.1.0-rc7-tip = tip = 4.1.0-rc7 (b7ca96b)

(ii) =  tip + only Rik's patch =  (i) + Rik's fix numa_preferred_nid setting

(iii)  =  tip + Rik's ++ (iii) = (ii) + Srikar's numa hotness + correct nid for evaluating task weight

(iv) =  tip + Srikar's ++ (iv) = (i) + Srikar's  numa hotness + correct nid for evaluating task weight +
       numa_has_capacity fix +  always update preferred node


Plain 4.1.0-rc7-tip (i)
		Testcase:         Min         Max         Avg      StdDev
	  elapsed_numa01:      858.85      949.18      915.64       33.06
	  elapsed_numa02:       23.09       29.89       26.43        2.18
		Testcase:         Min         Max         Avg      StdDev
	   system_numa01:     1516.72     1855.08     1686.24      113.95
	   system_numa02:       63.69       79.06       70.35        5.87
		Testcase:         Min         Max         Avg      StdDev
	     user_numa01:    73284.76    80818.21    78060.88     2773.60
	     user_numa02:     1690.18     2071.07     1821.64      140.25
		Testcase:         Min         Max         Avg      StdDev
	    total_numa01:    74801.50    82572.60    79747.12     2875.61
	    total_numa02:     1753.87     2142.77     1891.99      143.59

tip + only Rik's patch (ii)
		Testcase:         Min         Max         Avg      StdDev     %Change
	  elapsed_numa01:      665.26      877.47      776.77       79.23      15.83%
	  elapsed_numa02:       24.59       31.30       28.17        2.48      -5.56%
		Testcase:         Min         Max         Avg      StdDev     %Change
	   system_numa01:      659.57     1220.99      942.36      234.92      60.92%
	   system_numa02:       44.62       86.01       64.64       14.24       6.64%
		Testcase:         Min         Max         Avg      StdDev     %Change
	     user_numa01:    56280.95    75908.81    64993.57     7764.30      17.21%
	     user_numa02:     1790.35     2155.02     1916.12      132.57      -4.38%
		Testcase:         Min         Max         Avg      StdDev     %Change
	    total_numa01:    56940.50    77128.20    65935.92     7993.49      17.91%
	    total_numa02:     1834.97     2227.03     1980.76      136.51      -3.99%

tip + Rik's ++ (iii)
		Testcase:         Min         Max         Avg      StdDev     %Change
	  elapsed_numa01:      630.60      860.06      760.07       74.33      18.09%
	  elapsed_numa02:       21.92       34.42       27.72        4.49      -3.75%
		Testcase:         Min         Max         Avg      StdDev     %Change
	   system_numa01:      474.31     1379.49      870.12      296.35      59.16%
	   system_numa02:       63.74      120.25       86.69       20.69     -13.59%
		Testcase:         Min         Max         Avg      StdDev     %Change
	     user_numa01:    53004.03    68125.84    61697.01     5011.38      24.02%
	     user_numa02:     1650.82     2278.71     1941.26      224.59      -5.25%
		Testcase:         Min         Max         Avg      StdDev     %Change
	    total_numa01:    53478.30    69505.30    62567.12     5288.18      24.72%
	    total_numa02:     1714.56     2398.96     2027.95      238.08      -5.67%


tip + Srikar's ++ (iv)
		Testcase:         Min         Max         Avg      StdDev     %Change
	  elapsed_numa01:      690.74      919.49      782.67       78.51      14.46%
	  elapsed_numa02:       21.78       29.57       26.02        2.65       1.39%
		Testcase:         Min         Max         Avg      StdDev     %Change
	   system_numa01:      659.12     1041.19      870.15      143.13      78.38%
	   system_numa02:       52.20       78.73       64.18       11.28       7.84%
		Testcase:         Min         Max         Avg      StdDev     %Change
	     user_numa01:    56410.39    71492.31    62514.78     5444.90      21.75%
	     user_numa02:     1594.27     1934.40     1754.37      126.41       3.48%
		Testcase:         Min         Max         Avg      StdDev     %Change
	    total_numa01:    57069.50    72509.90    63384.94     5567.71      22.57%
	    total_numa02:     1647.85     2010.87     1818.55      136.88       3.65%


5 interations of Specjbb on 4 node, 24 core powerpc machine.
Ran 1 instance per system.

For specjbb (higher bops per JVM is better)

Plain 4.1.0-rc7-tip (i)
	  Metric:         Min         Max         Avg      StdDev
      bopsperJVM:   265519.00   272466.00   269377.80     2391.04

tip + only Rik's patch (ii)
	  Metric:         Min         Max         Avg      StdDev     %Change
      bopsperJVM:   263393.00   269660.00   266920.20     2792.07      -0.91%

tip + Rik's ++ (iii)
	  Metric:         Min         Max         Avg      StdDev     %Change
      bopsperJVM:   264298.00   271236.00   266818.20     2579.62      -0.94%

tip + Srikar's ++ (iv)
	  Metric:         Min         Max         Avg      StdDev     %Change
      bopsperJVM:   266774.00   272434.00   269839.60 	  2083.19      +0.17%


So fix for numa_has_capacity and always setting preferred node based on
fault stats seems to help autonuma and specjbb.


-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched,numa: document and fix numa_preferred_nid setting
  2015-06-19 17:16         ` Srikar Dronamraju
@ 2015-06-19 17:52           ` Rik van Riel
  2015-06-22 16:04             ` Srikar Dronamraju
  2015-06-22 16:48           ` Srikar Dronamraju
  1 sibling, 1 reply; 14+ messages in thread
From: Rik van Riel @ 2015-06-19 17:52 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: linux-kernel, peterz, mingo, mgorman

On 06/19/2015 01:16 PM, Srikar Dronamraju wrote:
>>
>> OK, so we are looking at two multi-threaded processes
>> on a 4 node system, and waiting for them to converge?
>>
>> It may make sense to add my patch in with your patch
>> 1/4 from last week, as well as the correct part of
>> your patch 4/4, and see how they all work together.
>>
> 
> Tested specjbb and autonumabenchmark on 4 kernels.
> 
> Plain 4.1.0-rc7-tip (i)
> tip + only Rik's patch (ii)
> tip + Rik's ++ (iii)
> tip + Srikar's ++ (iv)

> 5 interations of Specjbb on 4 node, 24 core powerpc machine.
> Ran 1 instance per system.

Would you happen to have 2 instance and 4 instance SPECjbb
numbers, too?  The single instance numbers seem to be within
the margin of error, but I would expect multi-instance numbers
to show more dramatic changes, due to changes in how workloads
converge...

Those behave very differently from single instance, especially
with the "always set the preferred_nid, even if we moved the
task to a node we do NOT prefer" patch...

It would be good to understand the behaviour of these patches
under more circumstances.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched,numa: document and fix numa_preferred_nid setting
  2015-06-19 17:52           ` Rik van Riel
@ 2015-06-22 16:04             ` Srikar Dronamraju
  0 siblings, 0 replies; 14+ messages in thread
From: Srikar Dronamraju @ 2015-06-22 16:04 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, peterz, mingo, mgorman


> Would you happen to have 2 instance and 4 instance SPECjbb
> numbers, too?  The single instance numbers seem to be within
> the margin of error, but I would expect multi-instance numbers
> to show more dramatic changes, due to changes in how workloads
> converge...
>
> Those behave very differently from single instance, especially
> with the "always set the preferred_nid, even if we moved the
> task to a node we do NOT prefer" patch...
>
> It would be good to understand the behaviour of these patches
> under more circumstances.

Here are specjbb2005 numbers with 1 JVM per System, 2 JVMs per System
and 4 JVMs per System.

Plain 4.1.0-rc7-tip (i)
tip + Rik's ++ (ii)
tip + Srikar's ++ (iii)
tip + Srikar's + Modified Rik's patch (iv)

(i) = Plain 4.1.0-rc7-tip = tip = 4.1.0-rc7 (b7ca96b)

(ii) =  tip + only Rik's suggested patches =  (i) + Rik's fix numa_preferred_nid setting
	+ Srikar's numa hotness + correct nid for evaluating task weight

(iii)  = tip + Srikar's ++ (iii) = (i) + Srikar's  numa hotness + correct nid for evaluating
	 task weight + numa_has_capacity fix +  always update preferred node

(iv) =  tip + Srikar's ++ (iv) = (i) + Srikar's  numa hotness + correct nid for evaluating
	 task weight + numa_has_capacity fix + Rik's modified patch.
	(Rik's modified patch == I removed node_isset check before setting
	nid as the preferred node)

jbb2005_1JVMperSYSTEM
Plain 4.1.0-rc7-tip (i)
		  Metric:         Min         Max         Avg      StdDev     %Change
	      bopsperJVM:   265519.00   272466.00   269377.80     2391.04

tip + Rik's ++ (ii)
	      bopsperJVM:   264298.00   271236.00   266818.20     2579.62      -0.94%

tip + Srikar's ++ (iii)
	      bopsperJVM:   266774.00   272434.00   269839.60     2083.19       0.17%

tip + Srikar's + Rik's (iv)
	      bopsperJVM:   265037.00   274419.00   269280.00     3146.74      -0.04%



jbb2005_2JVMperSYSTEM
Plain 4.1.0-rc7-tip (i)
		  Metric:         Min         Max         Avg      StdDev     %Change
	      bopsperJVM:   269575.00   288495.00   279910.80     6151.49

tip + Srikar's ++ (iii)
	      bopsperJVM:   278810.00   287706.00   282514.00     2946.37      0.90%

tip + Rik's ++ (ii)
	      bopsperJVM:   286785.00   289515.00   288311.80     1206.66      2.90%

tip + Srikar's + Rik's (iv)
	      bopsperJVM:   283295.00   293466.00   287848.80     3427.06      2.70%


jbb2005_4JVMperSYSTEM
Plain 4.1.0-rc7-tip (i)
		  Metric:         Min         Max         Avg      StdDev     %Change
	      bopsperJVM:   248392.00   263826.00   257263.20     5946.44

tip + Rik's ++ (ii)
	      bopsperJVM:   257057.00   260303.00   258819.00     1234.46      0.60%

tip + Srikar's ++ (iii)
	      bopsperJVM:   252968.00   262006.00   257321.80     3131.00      0.02%

tip + Srikar's + Rik's (iv)
	      bopsperJVM:   257063.00   266196.00   262547.80     3099.57      1.99%


Summary:
While Rik's suggested patchset performs the best in 2 JVM case and
numa01. A modified version of his patch, provides good performance in 2
JVM, 4 JVM cases and numa01. However these two patchsets dont regress in
numa02 (probably a little less with modified patch)


-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched,numa: document and fix numa_preferred_nid setting
  2015-06-16 19:54 [PATCH] sched,numa: document and fix numa_preferred_nid setting Rik van Riel
  2015-06-18 15:55 ` Srikar Dronamraju
@ 2015-06-22 16:13 ` Srikar Dronamraju
  2015-06-22 22:28   ` Rik van Riel
  1 sibling, 1 reply; 14+ messages in thread
From: Srikar Dronamraju @ 2015-06-22 16:13 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, peterz, mingo, mgorman

> +	 * migrating the task to where it really belongs.
> +	 * The exception is a task that belongs to a large numa_group, which
> +	 * spans multiple NUMA nodes. If that task migrates into one of the
> +	 * workload's active nodes, remember that node as the task's
> +	 * numa_preferred_nid, so the workload can settle down.
>  	 */
>  	if (p->numa_group) {
>  		if (env.best_cpu == -1)
> @@ -1513,7 +1520,7 @@ static int task_numa_migrate(struct task_struct *p)
>  			nid = env.dst_nid;
>  
>  		if (node_isset(nid, p->numa_group->active_nodes))
> -			sched_setnuma(p, env.dst_nid);
> +			sched_setnuma(p, nid);
>  	}
>  
>  	/* No better CPU than the current one was found. */
> 

When I refer to the Modified Rik's patch, I mean to remove the
node_isset() check before the sched_setnuma. With that change, we kind
of reduce the numa02 and 1JVMper System regression while getting as good
numbers as Rik's patch with 2JVM and 4JVM per System.

The idea behind removing the node_isset check is:
node_isset is mostly used to track mem movement to nodes where cpus are
running and not vice versa.  This is as per comment in
update_numa_active_node_mask. There could be a sitation where task memory
is all in a node and the node has capacity to accomodate but no tasks
associated with the task have run enuf on that node. In such a case, we
shouldnt be ruling out migrating the task to the node.

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched,numa: document and fix numa_preferred_nid setting
  2015-06-19 17:16         ` Srikar Dronamraju
  2015-06-19 17:52           ` Rik van Riel
@ 2015-06-22 16:48           ` Srikar Dronamraju
  1 sibling, 0 replies; 14+ messages in thread
From: Srikar Dronamraju @ 2015-06-22 16:48 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, peterz, mingo, mgorman

Updated autonumabenchmark numbers

Plain 4.1.0-rc7-tip (i)
		Testcase:         Min         Max         Avg      StdDev
	  elapsed_numa01:      858.85      949.18      915.64       33.06
	  elapsed_numa02:       23.09       29.89       26.43        2.18
		Testcase:         Min         Max         Avg      StdDev
	   system_numa01:     1516.72     1855.08     1686.24      113.95
	   system_numa02:       63.69       79.06       70.35        5.87
		Testcase:         Min         Max         Avg      StdDev
	     user_numa01:    73284.76    80818.21    78060.88     2773.60
	     user_numa02:     1690.18     2071.07     1821.64      140.25
		Testcase:         Min         Max         Avg      StdDev
	    total_numa01:    74801.50    82572.60    79747.12     2875.61
	    total_numa02:     1753.87     2142.77     1891.99      143.59

tip + Rik's ++ (ii)
		Testcase:         Min         Max         Avg      StdDev     %Change
	  elapsed_numa01:      630.60      860.06      760.07       74.33      18.09%
	  elapsed_numa02:       21.92       34.42       27.72        4.49      -3.75%
		Testcase:         Min         Max         Avg      StdDev     %Change
	   system_numa01:      474.31     1379.49      870.12      296.35      59.16%
	   system_numa02:       63.74      120.25       86.69       20.69     -13.59%
		Testcase:         Min         Max         Avg      StdDev     %Change
	     user_numa01:    53004.03    68125.84    61697.01     5011.38      24.02%
	     user_numa02:     1650.82     2278.71     1941.26      224.59      -5.25%
		Testcase:         Min         Max         Avg      StdDev     %Change
	    total_numa01:    53478.30    69505.30    62567.12     5288.18      24.72%
	    total_numa02:     1714.56     2398.96     2027.95      238.08      -5.67%

tip + Srikar's ++ (iii)
		Testcase:         Min         Max         Avg      StdDev     %Change
	  elapsed_numa01:      690.74      919.49      782.67       78.51      14.46%
	  elapsed_numa02:       21.78       29.57       26.02        2.65       1.39%
		Testcase:         Min         Max         Avg      StdDev     %Change
	   system_numa01:      659.12     1041.19      870.15      143.13      78.38%
	   system_numa02:       52.20       78.73       64.18       11.28       7.84%
		Testcase:         Min         Max         Avg      StdDev     %Change
	     user_numa01:    56410.39    71492.31    62514.78     5444.90      21.75%
	     user_numa02:     1594.27     1934.40     1754.37      126.41       3.48%
		Testcase:         Min         Max         Avg      StdDev     %Change
	    total_numa01:    57069.50    72509.90    63384.94     5567.71      22.57%
	    total_numa02:     1647.85     2010.87     1818.55      136.88       3.65%

tip + Srikar's + Modified Rik's patch (iv)
		Testcase:         Min         Max         Avg      StdDev     %Change
	  elapsed_numa01:      674.72      815.10      746.50       51.81      20.75%
	  elapsed_numa02:       21.02       34.57       27.58        4.63      -3.33%
		Testcase:         Min         Max         Avg      StdDev     %Change
	   system_numa01:      726.19     1099.16      879.36      141.02      73.41%
	   system_numa02:       33.99       72.99       58.89       14.26      15.70%
		Testcase:         Min         Max         Avg      StdDev     %Change
	     user_numa01:    56350.75    67318.49    61998.95     4696.34      23.86%
	     user_numa02:     1518.66     2301.80     1882.96      261.18      -2.66%
		Testcase:         Min         Max         Avg      StdDev     %Change
	    total_numa01:    57076.90    68417.70    62878.32     4826.93      24.66%
	    total_numa02:     1552.65     2374.79     1941.85      274.28      -2.10%

Plain 4.1.0-rc7-tip (i)
tip + Rik's ++ (ii)
tip + Srikar's ++ (iii)
tip + Srikar's + Modified Rik's patch (iv)

(i) = Plain 4.1.0-rc7-tip = tip = 4.1.0-rc7 (b7ca96b)

(ii) =  tip + only Rik's suggested patches =  (i) + Rik's fix numa_preferred_nid setting
	+ Srikar's numa hotness + correct nid for evaluating task weight

(iii)  = tip + Srikar's ++ (iii) = (i) + Srikar's  numa hotness + correct nid for evaluating
	 task weight + numa_has_capacity fix +  always update preferred node

(iv) =  tip + Srikar's ++ (iv) = (i) + Srikar's  numa hotness + correct nid for evaluating
	 task weight + numa_has_capacity fix + Rik's modified patch.
	(Rik's modified patch == I removed node_isset check before setting
	nid as the preferred node)

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched,numa: document and fix numa_preferred_nid setting
  2015-06-22 16:13 ` Srikar Dronamraju
@ 2015-06-22 22:28   ` Rik van Riel
  0 siblings, 0 replies; 14+ messages in thread
From: Rik van Riel @ 2015-06-22 22:28 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: linux-kernel, peterz, mingo, mgorman

On 06/22/2015 12:13 PM, Srikar Dronamraju wrote:
>> +	 * migrating the task to where it really belongs.
>> +	 * The exception is a task that belongs to a large numa_group, which
>> +	 * spans multiple NUMA nodes. If that task migrates into one of the
>> +	 * workload's active nodes, remember that node as the task's
>> +	 * numa_preferred_nid, so the workload can settle down.
>>  	 */
>>  	if (p->numa_group) {
>>  		if (env.best_cpu == -1)
>> @@ -1513,7 +1520,7 @@ static int task_numa_migrate(struct task_struct *p)
>>  			nid = env.dst_nid;
>>  
>>  		if (node_isset(nid, p->numa_group->active_nodes))
>> -			sched_setnuma(p, env.dst_nid);
>> +			sched_setnuma(p, nid);
>>  	}
>>  
>>  	/* No better CPU than the current one was found. */
>>
> 
> When I refer to the Modified Rik's patch, I mean to remove the
> node_isset() check before the sched_setnuma. With that change, we kind
> of reduce the numa02 and 1JVMper System regression while getting as good
> numbers as Rik's patch with 2JVM and 4JVM per System.
> 
> The idea behind removing the node_isset check is:
> node_isset is mostly used to track mem movement to nodes where cpus are
> running and not vice versa.  This is as per comment in
> update_numa_active_node_mask. There could be a sitation where task memory
> is all in a node and the node has capacity to accomodate but no tasks
> associated with the task have run enuf on that node. In such a case, we
> shouldnt be ruling out migrating the task to the node.

That is a good point.

However, if overriding the preferred_nid that task_numa_placement
identified is a good idea in task_numa_migrate, would it also be
a good idea for tasks that are NOT part of a numa group?

What are the consequences of never setting preferred_nid from
task_numa_migrate?   (we would try to migrate the task to a
better node more frequently)

What are the consequences of always setting preferred_nid from
task_numa_migrate?   (we would only try migrating the task once,
and it could get stuck in a sub-optimal location)

The patch seems to work, but I do not understand why, and would
like to know your ideas on why you think the patch works.

I am really not looking forward to the idea of maintaining code
that nobody understands...

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-06-22 22:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-16 19:54 [PATCH] sched,numa: document and fix numa_preferred_nid setting Rik van Riel
2015-06-18 15:55 ` Srikar Dronamraju
2015-06-18 16:06   ` Rik van Riel
2015-06-18 16:41     ` Srikar Dronamraju
2015-06-18 17:00       ` Rik van Riel
2015-06-18 17:11         ` Srikar Dronamraju
2015-06-19 17:16         ` Srikar Dronamraju
2015-06-19 17:52           ` Rik van Riel
2015-06-22 16:04             ` Srikar Dronamraju
2015-06-22 16:48           ` Srikar Dronamraju
2015-06-18 16:12   ` Ingo Molnar
2015-06-18 18:16     ` Rik van Riel
2015-06-22 16:13 ` Srikar Dronamraju
2015-06-22 22:28   ` Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.