[PATCH 0/3] sched,numa: reduce page migrations with pseudo-interleaving

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] sched,numa: reduce page migrations with pseudo-interleaving
@ 2014-04-11 17:00 riel
  2014-04-11 17:00 ` [PATCH 1/3] sched,numa: count pages on active node as local riel
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: riel @ 2014-04-11 17:00 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, chegu_vinod, mgorman

The pseudo-interleaving code deals fairly well with the placement
of tasks that are part of workloads that span multiple NUMA nodes,
but the code has a number of corner cases left that can result in
higher than desired overhead.

This patch series reduces the overhead slightly, mostly visible
through a lower number of page migrations, while leaving the
throughput of the workload essentially unchanged.

On smaller NUMA systems, these patches should have little or no
effect. On a 4 node test system, I did see a reduction in the
number of page migrations running SPECjbb2005; autonuma-benchmark
appears to be unaffected.

NUMA page migrations on an 8 node system running SPECjbb2005,
courtesy of Vinod:

			vanilla		with patch
8 - 1 socket wide:	9138324		8918971
4 - 2 socket wide:	8239914		7315148
2 - 4 socket wide:	5732744		3849624
1 - 8 socket wide:	3348475		2660347 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] sched,numa: count pages on active node as local
  2014-04-11 17:00 [PATCH 0/3] sched,numa: reduce page migrations with pseudo-interleaving riel
@ 2014-04-11 17:00 ` riel
  2014-04-11 17:34   ` Joe Perches
                     ` (2 more replies)
  2014-04-11 17:00 ` [PATCH 2/3] sched,numa: retry placement more frequently when misplaced riel
  2014-04-11 17:00 ` [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node riel
  2 siblings, 3 replies; 19+ messages in thread
From: riel @ 2014-04-11 17:00 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, chegu_vinod, mgorman

From: Rik van Riel <riel@redhat.com>

The NUMA code is smart enough to distribute the memory of workloads
that span multiple NUMA nodes across those NUMA nodes.

However, it still has a pretty high scan rate for such workloads,
because any memory that is left on a node other than the node of
the CPU that faulted on the memory is counted as non-local, which
causes the scan rate to go up.

Counting the memory on any node where the task's numa group is
actively running as local, allows the scan rate to slow down
once the application is settled in.

This should reduce the overhead of the automatic NUMA placement
code, when a workload spans multiple NUMA nodes.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Vinod Chegu <chegu_vinod@hp.com>
---
 kernel/sched/fair.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e9bd0b..12de50a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1737,6 +1737,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	struct task_struct *p = current;
 	bool migrated = flags & TNF_MIGRATED;
 	int cpu_node = task_node(current);
+	int local = !!(flags & TNF_FAULT_LOCAL);
 	int priv;
 
 	if (!numabalancing_enabled)
@@ -1785,6 +1786,17 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 			task_numa_group(p, last_cpupid, flags, &priv);
 	}
 
+	/*
+	 * If a workload spans multiple NUMA nodes, a shared fault that
+	 * occurs wholly within the set of nodes that the workload is
+	 * actively using should be counted as local. This allows the
+	 * scan rate to slow down when a workload has settled down.
+	 */
+	if (!priv && !local && p->numa_group &&
+			node_isset(cpu_node, p->numa_group->active_nodes) &&
+			node_isset(mem_node, p->numa_group->active_nodes))
+		local = 1;
+
 	task_numa_placement(p);
 
 	/*
@@ -1799,7 +1811,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 
 	p->numa_faults_buffer_memory[task_faults_idx(mem_node, priv)] += pages;
 	p->numa_faults_buffer_cpu[task_faults_idx(cpu_node, priv)] += pages;
-	p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
+	p->numa_faults_locality[local] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] sched,numa: retry placement more frequently when misplaced
  2014-04-11 17:00 [PATCH 0/3] sched,numa: reduce page migrations with pseudo-interleaving riel
  2014-04-11 17:00 ` [PATCH 1/3] sched,numa: count pages on active node as local riel
@ 2014-04-11 17:00 ` riel
  2014-04-11 17:46   ` Joe Perches
                     ` (2 more replies)
  2014-04-11 17:00 ` [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node riel
  2 siblings, 3 replies; 19+ messages in thread
From: riel @ 2014-04-11 17:00 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, chegu_vinod, mgorman

From: Rik van Riel <riel@redhat.com>

When tasks have not converged on their preferred nodes yet, we want
to retry fairly often, to make sure we do not migrate a task's memory
to an undesirable location, only to have to move it again later.

This patch reduces the interval at which migration is retried,
when the task's numa_scan_period is small.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Vinod Chegu <chegu_vinod@hp.com>
---
 kernel/sched/fair.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12de50a..babd316 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1326,12 +1326,15 @@ static int task_numa_migrate(struct task_struct *p)
 /* Attempt to migrate a task to a CPU on the preferred node. */
 static void numa_migrate_preferred(struct task_struct *p)
 {
+	unsigned long interval = HZ;
+
 	/* This task has no NUMA fault statistics yet */
 	if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults_memory))
 		return;
 
 	/* Periodically retry migrating the task to the preferred node */
-	p->numa_migrate_retry = jiffies + HZ;
+	interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
+	p->numa_migrate_retry = jiffies + interval;
 
 	/* Success if task is already running on preferred CPU */
 	if (task_node(p) == p->numa_preferred_nid)
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node
  2014-04-11 17:00 [PATCH 0/3] sched,numa: reduce page migrations with pseudo-interleaving riel
  2014-04-11 17:00 ` [PATCH 1/3] sched,numa: count pages on active node as local riel
  2014-04-11 17:00 ` [PATCH 2/3] sched,numa: retry placement more frequently when misplaced riel
@ 2014-04-11 17:00 ` riel
  2014-04-14 12:56   ` Peter Zijlstra
                     ` (2 more replies)
  2 siblings, 3 replies; 19+ messages in thread
From: riel @ 2014-04-11 17:00 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, chegu_vinod, mgorman

From: Rik van Riel <riel@redhat.com>

Setting the numa_preferred_node for a task in task_numa_migrate
does nothing on a 2-node system. Either we migrate to the node
that already was our preferred node, or we stay where we were.

On a 4-node system, it can slightly decrease overhead, by not
calling the NUMA code as much. Since every node tends to be
directly connected to every other node, running on the wrong
node for a while does not do much damage.

However, on an 8 node system, there are far more bad nodes
than there are good ones, and pretending that a second choice
is actually the preferred node can greatly delay, or even
prevent, a workload from converging.

The only time we can safely pretend that a second choice
node is the preferred node is when the task is part of a
workload that spans multiple NUMA nodes.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Vinod Chegu <chegu_vinod@hp.com>
---
 kernel/sched/fair.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index babd316..302facf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1301,7 +1301,16 @@ static int task_numa_migrate(struct task_struct *p)
 	if (env.best_cpu == -1)
 		return -EAGAIN;

-	sched_setnuma(p, env.dst_nid);
+	/*
+	 * If the task is part of a workload that spans multiple NUMA nodes,
+	 * and is migrating into one of the workload's active nodes, remember
+	 * this node as the task's preferred numa node, so the workload can
+	 * settle down.
+	 * A task that migrated to a second choice node will be better off
+	 * trying for a better one later. Do not set the preferred node here.
+	 */
+	if (p->numa_group && node_isset(env.dst_nid, p->numa_group->active_nodes))
+		sched_setnuma(p, env.dst_nid);

 	/*
 	 * Reset the scan period if the task is being rescheduled on an
-- 
1.8.5.3

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] sched,numa: count pages on active node as local
  2014-04-11 17:00 ` [PATCH 1/3] sched,numa: count pages on active node as local riel
@ 2014-04-11 17:34   ` Joe Perches
  2014-04-11 17:41     ` Rik van Riel
  2014-04-25  9:04   ` Mel Gorman
  2014-05-08 10:42   ` [tip:sched/core] sched/numa: Count " tip-bot for Rik van Riel
  2 siblings, 1 reply; 19+ messages in thread
From: Joe Perches @ 2014-04-11 17:34 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mingo, peterz, chegu_vinod, mgorman

On Fri, 2014-04-11 at 13:00 -0400, riel@redhat.com wrote:
> This should reduce the overhead of the automatic NUMA placement
> code, when a workload spans multiple NUMA nodes.

trivial style note:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
[]
> @@ -1737,6 +1737,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
>  	struct task_struct *p = current;
>  	bool migrated = flags & TNF_MIGRATED;
>  	int cpu_node = task_node(current);
> +	int local = !!(flags & TNF_FAULT_LOCAL);

Perhaps local would look nicer as bool
and be better placed next to migrated.

	bool migrated = flags & TNF_MIGRATED;
	bool local = flags & TNF_FAULT_LOCAL;
	int cpu_node = task_node(current);

> @@ -1785,6 +1786,17 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
> +	if (!priv && !local && p->numa_group &&
> +			node_isset(cpu_node, p->numa_group->active_nodes) &&
> +			node_isset(mem_node, p->numa_group->active_nodes))
> +		local = 1;

		local = true;



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] sched,numa: count pages on active node as local
  2014-04-11 17:34   ` Joe Perches
@ 2014-04-11 17:41     ` Rik van Riel
  2014-04-11 18:01       ` Joe Perches
  0 siblings, 1 reply; 19+ messages in thread
From: Rik van Riel @ 2014-04-11 17:41 UTC (permalink / raw)
  To: Joe Perches; +Cc: linux-kernel, mingo, peterz, chegu_vinod, mgorman

On 04/11/2014 01:34 PM, Joe Perches wrote:
> On Fri, 2014-04-11 at 13:00 -0400, riel@redhat.com wrote:
>> This should reduce the overhead of the automatic NUMA placement
>> code, when a workload spans multiple NUMA nodes.
> 
> trivial style note:
> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> []
>> @@ -1737,6 +1737,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
>>  	struct task_struct *p = current;
>>  	bool migrated = flags & TNF_MIGRATED;
>>  	int cpu_node = task_node(current);
>> +	int local = !!(flags & TNF_FAULT_LOCAL);
> 
> Perhaps local would look nicer as bool
> and be better placed next to migrated.

The problem is, at the end of the function, local is used as an
array index...

        p->numa_faults_locality[local] += pages;
}

I'm not sure I really want to use a bool as an array index :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] sched,numa: retry placement more frequently when misplaced
  2014-04-11 17:00 ` [PATCH 2/3] sched,numa: retry placement more frequently when misplaced riel
@ 2014-04-11 17:46   ` Joe Perches
  2014-04-11 18:03     ` Rik van Riel
  2014-04-25  9:05   ` Mel Gorman
  2014-05-08 10:42   ` [tip:sched/core] sched/numa: Retry " tip-bot for Rik van Riel
  2 siblings, 1 reply; 19+ messages in thread
From: Joe Perches @ 2014-04-11 17:46 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mingo, peterz, chegu_vinod, mgorman

On Fri, 2014-04-11 at 13:00 -0400, riel@redhat.com wrote:
> This patch reduces the interval at which migration is retried,
> when the task's numa_scan_period is small.

More style trivia and a question.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
[]
> @@ -1326,12 +1326,15 @@ static int task_numa_migrate(struct task_struct *p)
>  /* Attempt to migrate a task to a CPU on the preferred node. */
>  static void numa_migrate_preferred(struct task_struct *p)
>  {
> +	unsigned long interval = HZ;

Perhaps it'd be better without the unnecessary initialization.

>  	/* This task has no NUMA fault statistics yet */
>  	if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults_memory))
>  		return;
>  
>  	/* Periodically retry migrating the task to the preferred node */
> -	p->numa_migrate_retry = jiffies + HZ;
> +	interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);

and use

	interval = min_t(unsigned long, HZ,
			 msecs_to_jiffies(p->numa_scan_period) / 16);


btw; why 16?

Can msecs_to_jiffies(p->numa_scan_period) ever be < 16?



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] sched,numa: count pages on active node as local
  2014-04-11 17:41     ` Rik van Riel
@ 2014-04-11 18:01       ` Joe Perches
  0 siblings, 0 replies; 19+ messages in thread
From: Joe Perches @ 2014-04-11 18:01 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, mingo, peterz, chegu_vinod, mgorman

On Fri, 2014-04-11 at 13:41 -0400, Rik van Riel wrote:
> On 04/11/2014 01:34 PM, Joe Perches wrote:
> > On Fri, 2014-04-11 at 13:00 -0400, riel@redhat.com wrote:
> >> This should reduce the overhead of the automatic NUMA placement
> >> code, when a workload spans multiple NUMA nodes.
[]
> > Perhaps local would look nicer as bool
> > and be better placed next to migrated.
> 
> The problem is, at the end of the function, local is used as an
> array index...

Oh, thanks.

>         p->numa_faults_locality[local] += pages;
> 
> I'm not sure I really want to use a bool as an array index :)

Nor I particularly but using bool as an array index is
already done in at least kernel/workqueue.c and
kernel/trace/blktrace.c

Doesn't make it right of course.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] sched,numa: retry placement more frequently when misplaced
  2014-04-11 17:46   ` Joe Perches
@ 2014-04-11 18:03     ` Rik van Riel
  2014-04-14  8:19       ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Rik van Riel @ 2014-04-11 18:03 UTC (permalink / raw)
  To: Joe Perches; +Cc: linux-kernel, mingo, peterz, chegu_vinod, mgorman

On 04/11/2014 01:46 PM, Joe Perches wrote:
> On Fri, 2014-04-11 at 13:00 -0400, riel@redhat.com wrote:
>> This patch reduces the interval at which migration is retried,
>> when the task's numa_scan_period is small.
> 
> More style trivia and a question.
> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> []
>> @@ -1326,12 +1326,15 @@ static int task_numa_migrate(struct task_struct *p)
>>  /* Attempt to migrate a task to a CPU on the preferred node. */
>>  static void numa_migrate_preferred(struct task_struct *p)
>>  {
>> +	unsigned long interval = HZ;
> 
> Perhaps it'd be better without the unnecessary initialization.
> 
>>  	/* This task has no NUMA fault statistics yet */
>>  	if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults_memory))
>>  		return;
>>  
>>  	/* Periodically retry migrating the task to the preferred node */
>> -	p->numa_migrate_retry = jiffies + HZ;
>> +	interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
> 
> and use
> 
> 	interval = min_t(unsigned long, HZ,
> 			 msecs_to_jiffies(p->numa_scan_period) / 16);

That's what I had before, but spilling things over across
multiple lines like that didn't exactly help readability.

> btw; why 16?
> 
> Can msecs_to_jiffies(p->numa_scan_period) ever be < 16?

I picked 16 because there is a cost tradeoff between unmapping
and faulting (and potentially migrating) a task's memory, which
is very expensive, and searching for a better NUMA node to run
on, which is potentially slightly expensive.

This way we may run on the wrong NUMA node for around 6% of the
time between unmapping all of the task's memory (and faulting
it back in with NUMA hinting faults), before retrying migration
of the task to a better node.

I suppose it is possible for a sysadmin to set the minimum
numa scan period to under 16 milliseconds, but if your system
is trying to unmap all of a task's memory every 16 milliseconds,
and fault it back in, task placement is likely to be the least
of your problems :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] sched,numa: retry placement more frequently when misplaced
  2014-04-11 18:03     ` Rik van Riel
@ 2014-04-14  8:19       ` Ingo Molnar
  0 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2014-04-14  8:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Joe Perches, linux-kernel, peterz, chegu_vinod, mgorman,
	the arch/x86 maintainers

* Rik van Riel <riel@redhat.com> wrote:

> On 04/11/2014 01:46 PM, Joe Perches wrote:
> > On Fri, 2014-04-11 at 13:00 -0400, riel@redhat.com wrote:
> >> This patch reduces the interval at which migration is retried,
> >> when the task's numa_scan_period is small.
> > 
> > More style trivia and a question.

[...]

> > 	interval = min_t(unsigned long, HZ,
> > 			 msecs_to_jiffies(p->numa_scan_period) / 16);
> 
> That's what I had before, but spilling things over across multiple 
> lines like that didn't exactly help readability.

Joe, the low quality 'trivial reviews' you are doing are becoming a 
burden on people, so could you please either exclude arch/x86/ from 
your "waste other people's time" email filter; or improve the quality 
(and relevance) of your replies, so that you consider not just 
trivialities but the bigger picture as well?

Most people who enter kernel development start with trivial patches 
and within a few months of learning the kernel ropes rise up to more 
serious, more useful topics.

This process is constructive: it gives the Linux kernel a constant 
influx of useful cleanup patches, while it also gives newbies a 
smooth, easy path of entry into kernel hacking. Thus we maintainers 
are friendly and inclusive to newbie-originated cleanups.

But the problem is that you remained stuck at the cleanup level for 
about a decade, and you don't seem to be willing to improve ...

The thing is, as of today there's hundreds of thousands of 'cleanups' 
flagged by checkpatch.pl on the latest kernel tree. That body of "easy 
changes" is more than enough to keep newbies busy, but had all those 
lines of code been commented on for trivialities by cleanup-only 
people like you - forcing another 1 million often bogus commits into 
the development process - it would have bogged Linux down to a 
standstill and would have prevented _real_ review from occuring.

Every time you comment on a patch, considering only trivialities, you 
risk crowding out some real reviewer who might mistakenly believe when 
seeing a reply to the email thread that the patch in question got a 
thorough review...

So what you are doing is starting to become net destructive:

 - it is crowding out real review

 - the 'solutions' you suggest often result in worse code

 - you burden contributors and maintainers with make-work 'cleanups'
   created by a professional cleanup bureaucrat.

Stop doing these low-substance reviews! Do some real reviews (which 
can very much include trivial comments as well), or try to improve 
your review abilities by writing real kernel code.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node
  2014-04-11 17:00 ` [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node riel
@ 2014-04-14 12:56   ` Peter Zijlstra
  2014-04-15 14:35     ` Rik van Riel
  2014-04-25  9:09   ` Mel Gorman
  2014-05-08 10:43   ` [tip:sched/core] sched/numa: Do " tip-bot for Rik van Riel
  2 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-04-14 12:56 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mingo, chegu_vinod, mgorman

On Fri, Apr 11, 2014 at 01:00:29PM -0400, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> Setting the numa_preferred_node for a task in task_numa_migrate
> does nothing on a 2-node system. Either we migrate to the node
> that already was our preferred node, or we stay where we were.
> 
> On a 4-node system, it can slightly decrease overhead, by not
> calling the NUMA code as much. Since every node tends to be
> directly connected to every other node, running on the wrong
> node for a while does not do much damage.
> 
> However, on an 8 node system, there are far more bad nodes
> than there are good ones, and pretending that a second choice
> is actually the preferred node can greatly delay, or even
> prevent, a workload from converging.
> 
> The only time we can safely pretend that a second choice
> node is the preferred node is when the task is part of a
> workload that spans multiple NUMA nodes.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Tested-by: Vinod Chegu <chegu_vinod@hp.com>
> ---
>  kernel/sched/fair.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index babd316..302facf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1301,7 +1301,16 @@ static int task_numa_migrate(struct task_struct *p)
>  	if (env.best_cpu == -1)
>  		return -EAGAIN;
>  
> -	sched_setnuma(p, env.dst_nid);
> +	/*
> +	 * If the task is part of a workload that spans multiple NUMA nodes,
> +	 * and is migrating into one of the workload's active nodes, remember

I read 'into' as:
  !node_isset(env.src_nid, ...) && node_isset(env.dst_nid, ...)

The code doesn't seem to do this.

> +	 * this node as the task's preferred numa node, so the workload can
> +	 * settle down.
> +	 * A task that migrated to a second choice node will be better off
> +	 * trying for a better one later. Do not set the preferred node here.
> +	 */
> +	if (p->numa_group && node_isset(env.dst_nid, p->numa_group->active_nodes))
> +		sched_setnuma(p, env.dst_nid);

OK, so I was totally confused on this one.

What I missed was that we set the primary choice over in
task_numa_placement().

I'm not really happy with the changelog; but I'm also struggling to
identify what exactly is missing. Or rather, the thing makes me
confused, and not feel like it actually explains it proper.

That said; I tend to more or less agree with the actual change, but..

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node
  2014-04-14 12:56   ` Peter Zijlstra
@ 2014-04-15 14:35     ` Rik van Riel
  2014-04-15 16:51       ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Rik van Riel @ 2014-04-15 14:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, chegu_vinod, mgorman

On 04/14/2014 08:56 AM, Peter Zijlstra wrote:
> On Fri, Apr 11, 2014 at 01:00:29PM -0400, riel@redhat.com wrote:
>> From: Rik van Riel <riel@redhat.com>
>>
>> Setting the numa_preferred_node for a task in task_numa_migrate
>> does nothing on a 2-node system. Either we migrate to the node
>> that already was our preferred node, or we stay where we were.
>>
>> On a 4-node system, it can slightly decrease overhead, by not
>> calling the NUMA code as much. Since every node tends to be
>> directly connected to every other node, running on the wrong
>> node for a while does not do much damage.
>>
>> However, on an 8 node system, there are far more bad nodes
>> than there are good ones, and pretending that a second choice
>> is actually the preferred node can greatly delay, or even
>> prevent, a workload from converging.
>>
>> The only time we can safely pretend that a second choice
>> node is the preferred node is when the task is part of a
>> workload that spans multiple NUMA nodes.
>>
>> Signed-off-by: Rik van Riel <riel@redhat.com>
>> Tested-by: Vinod Chegu <chegu_vinod@hp.com>
>> ---
>>   kernel/sched/fair.c | 11 ++++++++++-
>>   1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index babd316..302facf 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1301,7 +1301,16 @@ static int task_numa_migrate(struct task_struct *p)
>>   	if (env.best_cpu == -1)
>>   		return -EAGAIN;
>>
>> -	sched_setnuma(p, env.dst_nid);
>> +	/*
>> +	 * If the task is part of a workload that spans multiple NUMA nodes,
>> +	 * and is migrating into one of the workload's active nodes, remember
>
> I read 'into' as:
>    !node_isset(env.src_nid, ...) && node_isset(env.dst_nid, ...)
>
> The code doesn't seem to do this.

s/into/to/ makes the comment and the code match
again :)

>> +	 * this node as the task's preferred numa node, so the workload can
>> +	 * settle down.
>> +	 * A task that migrated to a second choice node will be better off
>> +	 * trying for a better one later. Do not set the preferred node here.
>> +	 */
>> +	if (p->numa_group && node_isset(env.dst_nid, p->numa_group->active_nodes))
>> +		sched_setnuma(p, env.dst_nid);
>
> OK, so I was totally confused on this one.
>
> What I missed was that we set the primary choice over in
> task_numa_placement().
>
> I'm not really happy with the changelog; but I'm also struggling to
> identify what exactly is missing. Or rather, the thing makes me
> confused, and not feel like it actually explains it proper.
>
> That said; I tend to more or less agree with the actual change, but..

I have looked at the comment and the changelog some
more, and it is not clear to me what you are missing,
or what I could be explaining better...



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node
  2014-04-15 14:35     ` Rik van Riel
@ 2014-04-15 16:51       ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-04-15 16:51 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, mingo, chegu_vinod, mgorman

On Tue, Apr 15, 2014 at 10:35:29AM -0400, Rik van Riel wrote:
> I have looked at the comment and the changelog some
> more, and it is not clear to me what you are missing,
> or what I could be explaining better...

Yeah, can't blame you. I couldn't explain myself either, other than
having this feeling there's something there.

In any case, I can't blame you for that, and I did queue the patches.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] sched,numa: count pages on active node as local
  2014-04-11 17:00 ` [PATCH 1/3] sched,numa: count pages on active node as local riel
  2014-04-11 17:34   ` Joe Perches
@ 2014-04-25  9:04   ` Mel Gorman
  2014-05-08 10:42   ` [tip:sched/core] sched/numa: Count " tip-bot for Rik van Riel
  2 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2014-04-25  9:04 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mingo, peterz, chegu_vinod

On Fri, Apr 11, 2014 at 01:00:27PM -0400, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> The NUMA code is smart enough to distribute the memory of workloads
> that span multiple NUMA nodes across those NUMA nodes.
> 
> However, it still has a pretty high scan rate for such workloads,
> because any memory that is left on a node other than the node of
> the CPU that faulted on the memory is counted as non-local, which
> causes the scan rate to go up.
> 
> Counting the memory on any node where the task's numa group is
> actively running as local, allows the scan rate to slow down
> once the application is settled in.
> 
> This should reduce the overhead of the automatic NUMA placement
> code, when a workload spans multiple NUMA nodes.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Tested-by: Vinod Chegu <chegu_vinod@hp.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] sched,numa: retry placement more frequently when misplaced
  2014-04-11 17:00 ` [PATCH 2/3] sched,numa: retry placement more frequently when misplaced riel
  2014-04-11 17:46   ` Joe Perches
@ 2014-04-25  9:05   ` Mel Gorman
  2014-05-08 10:42   ` [tip:sched/core] sched/numa: Retry " tip-bot for Rik van Riel
  2 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2014-04-25  9:05 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mingo, peterz, chegu_vinod

On Fri, Apr 11, 2014 at 01:00:28PM -0400, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> When tasks have not converged on their preferred nodes yet, we want
> to retry fairly often, to make sure we do not migrate a task's memory
> to an undesirable location, only to have to move it again later.
> 
> This patch reduces the interval at which migration is retried,
> when the task's numa_scan_period is small.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Tested-by: Vinod Chegu <chegu_vinod@hp.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node
  2014-04-11 17:00 ` [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node riel
  2014-04-14 12:56   ` Peter Zijlstra
@ 2014-04-25  9:09   ` Mel Gorman
  2014-05-08 10:43   ` [tip:sched/core] sched/numa: Do " tip-bot for Rik van Riel
  2 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2014-04-25  9:09 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mingo, peterz, chegu_vinod

On Fri, Apr 11, 2014 at 01:00:29PM -0400, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> Setting the numa_preferred_node for a task in task_numa_migrate
> does nothing on a 2-node system. Either we migrate to the node
> that already was our preferred node, or we stay where we were.
> 
> On a 4-node system, it can slightly decrease overhead, by not
> calling the NUMA code as much. Since every node tends to be
> directly connected to every other node, running on the wrong
> node for a while does not do much damage.
> 

Guess what size of machine I do the vast bulk of testing of automatic
NUMA balancing on!

> However, on an 8 node system, there are far more bad nodes
> than there are good ones, and pretending that a second choice
> is actually the preferred node can greatly delay, or even
> prevent, a workload from converging.
> 
> The only time we can safely pretend that a second choice
> node is the preferred node is when the task is part of a
> workload that spans multiple NUMA nodes.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Tested-by: Vinod Chegu <chegu_vinod@hp.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [tip:sched/core] sched/numa: Count pages on active node as local
  2014-04-11 17:00 ` [PATCH 1/3] sched,numa: count pages on active node as local riel
  2014-04-11 17:34   ` Joe Perches
  2014-04-25  9:04   ` Mel Gorman
@ 2014-05-08 10:42   ` tip-bot for Rik van Riel
  2 siblings, 0 replies; 19+ messages in thread
From: tip-bot for Rik van Riel @ 2014-05-08 10:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, riel, chegu_vinod,
	mgorman, tglx

Commit-ID:  792568ec6a31ca560ca4d528782cbc6cd2cea8b0
Gitweb:     http://git.kernel.org/tip/792568ec6a31ca560ca4d528782cbc6cd2cea8b0
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Fri, 11 Apr 2014 13:00:27 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 7 May 2014 13:33:45 +0200

sched/numa: Count pages on active node as local

The NUMA code is smart enough to distribute the memory of workloads
that span multiple NUMA nodes across those NUMA nodes.

However, it still has a pretty high scan rate for such workloads,
because any memory that is left on a node other than the node of
the CPU that faulted on the memory is counted as non-local, which
causes the scan rate to go up.

Counting the memory on any node where the task's numa group is
actively running as local, allows the scan rate to slow down
once the application is settled in.

This should reduce the overhead of the automatic NUMA placement
code, when a workload spans multiple NUMA nodes.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Vinod Chegu <chegu_vinod@hp.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1397235629-16328-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5d859ec..f6457b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1738,6 +1738,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	struct task_struct *p = current;
 	bool migrated = flags & TNF_MIGRATED;
 	int cpu_node = task_node(current);
+	int local = !!(flags & TNF_FAULT_LOCAL);
 	int priv;
 
 	if (!numabalancing_enabled)
@@ -1786,6 +1787,17 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 			task_numa_group(p, last_cpupid, flags, &priv);
 	}
 
+	/*
+	 * If a workload spans multiple NUMA nodes, a shared fault that
+	 * occurs wholly within the set of nodes that the workload is
+	 * actively using should be counted as local. This allows the
+	 * scan rate to slow down when a workload has settled down.
+	 */
+	if (!priv && !local && p->numa_group &&
+			node_isset(cpu_node, p->numa_group->active_nodes) &&
+			node_isset(mem_node, p->numa_group->active_nodes))
+		local = 1;
+
 	task_numa_placement(p);
 
 	/*
@@ -1800,7 +1812,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 
 	p->numa_faults_buffer_memory[task_faults_idx(mem_node, priv)] += pages;
 	p->numa_faults_buffer_cpu[task_faults_idx(cpu_node, priv)] += pages;
-	p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
+	p->numa_faults_locality[local] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [tip:sched/core] sched/numa: Retry placement more frequently when misplaced
  2014-04-11 17:00 ` [PATCH 2/3] sched,numa: retry placement more frequently when misplaced riel
  2014-04-11 17:46   ` Joe Perches
  2014-04-25  9:05   ` Mel Gorman
@ 2014-05-08 10:42   ` tip-bot for Rik van Riel
  2 siblings, 0 replies; 19+ messages in thread
From: tip-bot for Rik van Riel @ 2014-05-08 10:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, riel, chegu_vinod,
	mgorman, tglx

Commit-ID:  5085e2a328849bdee6650b32d52c87c3788ab01c
Gitweb:     http://git.kernel.org/tip/5085e2a328849bdee6650b32d52c87c3788ab01c
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Fri, 11 Apr 2014 13:00:28 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 7 May 2014 13:33:46 +0200

sched/numa: Retry placement more frequently when misplaced

When tasks have not converged on their preferred nodes yet, we want
to retry fairly often, to make sure we do not migrate a task's memory
to an undesirable location, only to have to move it again later.

This patch reduces the interval at which migration is retried,
when the task's numa_scan_period is small.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Vinod Chegu <chegu_vinod@hp.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1397235629-16328-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f6457b6..ecea8d9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1326,12 +1326,15 @@ static int task_numa_migrate(struct task_struct *p)
 /* Attempt to migrate a task to a CPU on the preferred node. */
 static void numa_migrate_preferred(struct task_struct *p)
 {
+	unsigned long interval = HZ;
+
 	/* This task has no NUMA fault statistics yet */
 	if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults_memory))
 		return;
 
 	/* Periodically retry migrating the task to the preferred node */
-	p->numa_migrate_retry = jiffies + HZ;
+	interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
+	p->numa_migrate_retry = jiffies + interval;
 
 	/* Success if task is already running on preferred CPU */
 	if (task_node(p) == p->numa_preferred_nid)

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [tip:sched/core] sched/numa: Do not set preferred_node on migration to a second choice node
  2014-04-11 17:00 ` [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node riel
  2014-04-14 12:56   ` Peter Zijlstra
  2014-04-25  9:09   ` Mel Gorman
@ 2014-05-08 10:43   ` tip-bot for Rik van Riel
  2 siblings, 0 replies; 19+ messages in thread
From: tip-bot for Rik van Riel @ 2014-05-08 10:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, riel, chegu_vinod,
	mgorman, tglx

Commit-ID:  68d1b02a58f5d9f584c1fb2923ed60ec68cbbd9b
Gitweb:     http://git.kernel.org/tip/68d1b02a58f5d9f584c1fb2923ed60ec68cbbd9b
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Fri, 11 Apr 2014 13:00:29 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 7 May 2014 13:33:47 +0200

sched/numa: Do not set preferred_node on migration to a second choice node

Setting the numa_preferred_node for a task in task_numa_migrate
does nothing on a 2-node system. Either we migrate to the node
that already was our preferred node, or we stay where we were.

On a 4-node system, it can slightly decrease overhead, by not
calling the NUMA code as much. Since every node tends to be
directly connected to every other node, running on the wrong
node for a while does not do much damage.

However, on an 8 node system, there are far more bad nodes
than there are good ones, and pretending that a second choice
is actually the preferred node can greatly delay, or even
prevent, a workload from converging.

The only time we can safely pretend that a second choice
node is the preferred node is when the task is part of a
workload that spans multiple NUMA nodes.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Vinod Chegu <chegu_vinod@hp.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1397235629-16328-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ecea8d9..051903f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1301,7 +1301,16 @@ static int task_numa_migrate(struct task_struct *p)
 	if (env.best_cpu == -1)
 		return -EAGAIN;
 
-	sched_setnuma(p, env.dst_nid);
+	/*
+	 * If the task is part of a workload that spans multiple NUMA nodes,
+	 * and is migrating into one of the workload's active nodes, remember
+	 * this node as the task's preferred numa node, so the workload can
+	 * settle down.
+	 * A task that migrated to a second choice node will be better off
+	 * trying for a better one later. Do not set the preferred node here.
+	 */
+	if (p->numa_group && node_isset(env.dst_nid, p->numa_group->active_nodes))
+		sched_setnuma(p, env.dst_nid);
 
 	/*
 	 * Reset the scan period if the task is being rescheduled on an

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-05-08 10:44 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-11 17:00 [PATCH 0/3] sched,numa: reduce page migrations with pseudo-interleaving riel
2014-04-11 17:00 ` [PATCH 1/3] sched,numa: count pages on active node as local riel
2014-04-11 17:34   ` Joe Perches
2014-04-11 17:41     ` Rik van Riel
2014-04-11 18:01       ` Joe Perches
2014-04-25  9:04   ` Mel Gorman
2014-05-08 10:42   ` [tip:sched/core] sched/numa: Count " tip-bot for Rik van Riel
2014-04-11 17:00 ` [PATCH 2/3] sched,numa: retry placement more frequently when misplaced riel
2014-04-11 17:46   ` Joe Perches
2014-04-11 18:03     ` Rik van Riel
2014-04-14  8:19       ` Ingo Molnar
2014-04-25  9:05   ` Mel Gorman
2014-05-08 10:42   ` [tip:sched/core] sched/numa: Retry " tip-bot for Rik van Riel
2014-04-11 17:00 ` [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node riel
2014-04-14 12:56   ` Peter Zijlstra
2014-04-15 14:35     ` Rik van Riel
2014-04-15 16:51       ` Peter Zijlstra
2014-04-25  9:09   ` Mel Gorman
2014-05-08 10:43   ` [tip:sched/core] sched/numa: Do " tip-bot for Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.