linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched,numa cap pte scanning overhead to 3% of run time
@ 2015-11-04 18:25 Rik van Riel
  2015-11-05 15:34 ` Peter Zijlstra
  0 siblings, 1 reply; 4+ messages in thread
From: Rik van Riel @ 2015-11-04 18:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, mgorman, jstancek

There is a fundamental mismatch between the runtime based NUMA scanning
at the task level, and the wall clock time NUMA scanning at the mm level.
On a severely overloaded system, with very large processes, this mismatch
can cause the system to spend all of its time in change_prot_numa().

This can happen if the task spends at least two ticks in change_prot_numa(),
and only gets two ticks of CPU time in the real time between two scan
intervals of the mm.

This patch ensures that if the system is so busy that the task got
rescheduled during change_prot_numa(), we never spend more than 3% of run
time scanning PTEs.

This patch does nothing if the CPU is not overloaded at all, and the
task is not rescheduled during change_prot_numa().

All of the above only works if we fix the math underflow issue in
task_numa_tick, so do that as well (Jan Stancek).

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-and-tested-by: Jan Stancek <jstancek@redhat.com>
---
 kernel/sched/fair.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 824aa9f501a3..e9b9ac424a76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2155,6 +2155,7 @@ void task_numa_work(struct callback_head *work)
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
+	u64 runtime = p->se.sum_exec_runtime;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
@@ -2277,6 +2278,20 @@ void task_numa_work(struct callback_head *work)
 	else
 		reset_ptenuma_scan(p);
 	up_read(&mm->mmap_sem);
+
+	/*
+	 * There is a fundamental mismatch between the runtime based
+	 * NUMA scanning at the task level, and the wall clock time
+	 * NUMA scanning at the mm level. On a severely overloaded
+	 * system, with very large processes, this mismatch can cause
+	 * the system to spend all of its time in change_prot_numa().
+	 * Limit NUMA PTE scanning to 3% of the task's run time, if
+	 * we spent so much time scanning we got rescheduled.
+	 */
+	if (unlikely(p->se.sum_exec_runtime != runtime)) {
+		u64 diff = p->se.sum_exec_runtime - runtime;
+		p->node_stamp += 32 * diff;
+	}
 }
 
 /*
@@ -2302,7 +2317,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	now = curr->se.sum_exec_runtime;
 	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
 
-	if (now - curr->node_stamp > period) {
+	if (now > curr->node_stamp + period) {
 		if (!curr->node_stamp)
 			curr->numa_scan_period = task_scan_min(curr);
 		curr->node_stamp += period;

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] sched,numa cap pte scanning overhead to 3% of run time
  2015-11-04 18:25 [PATCH] sched,numa cap pte scanning overhead to 3% of run time Rik van Riel
@ 2015-11-05 15:34 ` Peter Zijlstra
  2015-11-05 15:56   ` Rik van Riel
  0 siblings, 1 reply; 4+ messages in thread
From: Peter Zijlstra @ 2015-11-05 15:34 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, mingo, mgorman, jstancek

On Wed, Nov 04, 2015 at 01:25:15PM -0500, Rik van Riel wrote:
> +++ b/kernel/sched/fair.c
> @@ -2155,6 +2155,7 @@ void task_numa_work(struct callback_head *work)
>  	unsigned long migrate, next_scan, now = jiffies;
>  	struct task_struct *p = current;
>  	struct mm_struct *mm = p->mm;
> +	u64 runtime = p->se.sum_exec_runtime;
>  	struct vm_area_struct *vma;
>  	unsigned long start, end;
>  	unsigned long nr_pte_updates = 0;
> @@ -2277,6 +2278,20 @@ void task_numa_work(struct callback_head *work)
>  	else
>  		reset_ptenuma_scan(p);
>  	up_read(&mm->mmap_sem);
> +
> +	/*
> +	 * There is a fundamental mismatch between the runtime based
> +	 * NUMA scanning at the task level, and the wall clock time
> +	 * NUMA scanning at the mm level. On a severely overloaded
> +	 * system, with very large processes, this mismatch can cause
> +	 * the system to spend all of its time in change_prot_numa().
> +	 * Limit NUMA PTE scanning to 3% of the task's run time, if
> +	 * we spent so much time scanning we got rescheduled.
> +	 */
> +	if (unlikely(p->se.sum_exec_runtime != runtime)) {
> +		u64 diff = p->se.sum_exec_runtime - runtime;
> +		p->node_stamp += 32 * diff;
> +	}

I don't actually see how this does what it says it does.

>  }
>  
>  /*
> @@ -2302,7 +2317,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
>  	now = curr->se.sum_exec_runtime;
>  	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
>  
> -	if (now - curr->node_stamp > period) {
> +	if (now > curr->node_stamp + period) {
>  		if (!curr->node_stamp)
>  			curr->numa_scan_period = task_scan_min(curr);
>  		curr->node_stamp += period;

And this really should be an independent patch. Although the fix I had
in mind looked like:

	if ((s64)(now - curr->node_stamp) > period)

But I suppose this works too.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] sched,numa cap pte scanning overhead to 3% of run time
  2015-11-05 15:34 ` Peter Zijlstra
@ 2015-11-05 15:56   ` Rik van Riel
  2015-11-05 16:37     ` Peter Zijlstra
  0 siblings, 1 reply; 4+ messages in thread
From: Rik van Riel @ 2015-11-05 15:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, mgorman, jstancek

On 11/05/2015 10:34 AM, Peter Zijlstra wrote:
> On Wed, Nov 04, 2015 at 01:25:15PM -0500, Rik van Riel wrote:
>> +++ b/kernel/sched/fair.c
>> @@ -2155,6 +2155,7 @@ void task_numa_work(struct callback_head *work)
>>  	unsigned long migrate, next_scan, now = jiffies;
>>  	struct task_struct *p = current;
>>  	struct mm_struct *mm = p->mm;
>> +	u64 runtime = p->se.sum_exec_runtime;
>>  	struct vm_area_struct *vma;
>>  	unsigned long start, end;
>>  	unsigned long nr_pte_updates = 0;
>> @@ -2277,6 +2278,20 @@ void task_numa_work(struct callback_head *work)
>>  	else
>>  		reset_ptenuma_scan(p);
>>  	up_read(&mm->mmap_sem);
>> +
>> +	/*
>> +	 * There is a fundamental mismatch between the runtime based
>> +	 * NUMA scanning at the task level, and the wall clock time
>> +	 * NUMA scanning at the mm level. On a severely overloaded
>> +	 * system, with very large processes, this mismatch can cause
>> +	 * the system to spend all of its time in change_prot_numa().
>> +	 * Limit NUMA PTE scanning to 3% of the task's run time, if
>> +	 * we spent so much time scanning we got rescheduled.
>> +	 */
>> +	if (unlikely(p->se.sum_exec_runtime != runtime)) {
>> +		u64 diff = p->se.sum_exec_runtime - runtime;
>> +		p->node_stamp += 32 * diff;
>> +	}
> 
> I don't actually see how this does what it says it does

If we got rescheduled during the assigning of runtime
above, and this point, the scheduler should have
updated the p->se.sum_exec_runtime statistic, given
that update_curr is called from both dequeue_entity
and enqueue_entity in fair.c

Advancing the node_stamp by 32x the amount of time
the task consumed between entering task_numa_work and
this point should ensure task_numa_work does not get
queued again until we have used 32x as much time doing
something else.

That should limit the CPU time used by task_numa_work.

What am I missing?

>> @@ -2302,7 +2317,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
>>  	now = curr->se.sum_exec_runtime;
>>  	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
>>  
>> -	if (now - curr->node_stamp > period) {
>> +	if (now > curr->node_stamp + period) {
>>  		if (!curr->node_stamp)
>>  			curr->numa_scan_period = task_scan_min(curr);
>>  		curr->node_stamp += period;
> 
> And this really should be an independent patch. Although the fix I had
> in mind looked like:
> 
> 	if ((s64)(now - curr->node_stamp) > period)
> 
> But I suppose this works too.

I can resend this as a separate patch if you prefer.



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] sched,numa cap pte scanning overhead to 3% of run time
  2015-11-05 15:56   ` Rik van Riel
@ 2015-11-05 16:37     ` Peter Zijlstra
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Zijlstra @ 2015-11-05 16:37 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, mingo, mgorman, jstancek

On Thu, Nov 05, 2015 at 10:56:29AM -0500, Rik van Riel wrote:
> On 11/05/2015 10:34 AM, Peter Zijlstra wrote:
> > On Wed, Nov 04, 2015 at 01:25:15PM -0500, Rik van Riel wrote:
> >> +++ b/kernel/sched/fair.c
> >> @@ -2155,6 +2155,7 @@ void task_numa_work(struct callback_head *work)
> >>  	unsigned long migrate, next_scan, now = jiffies;
> >>  	struct task_struct *p = current;
> >>  	struct mm_struct *mm = p->mm;
> >> +	u64 runtime = p->se.sum_exec_runtime;
> >>  	struct vm_area_struct *vma;
> >>  	unsigned long start, end;
> >>  	unsigned long nr_pte_updates = 0;
> >> @@ -2277,6 +2278,20 @@ void task_numa_work(struct callback_head *work)
> >>  	else
> >>  		reset_ptenuma_scan(p);
> >>  	up_read(&mm->mmap_sem);
> >> +
> >> +	/*
> >> +	 * There is a fundamental mismatch between the runtime based
> >> +	 * NUMA scanning at the task level, and the wall clock time
> >> +	 * NUMA scanning at the mm level. On a severely overloaded
> >> +	 * system, with very large processes, this mismatch can cause
> >> +	 * the system to spend all of its time in change_prot_numa().
> >> +	 * Limit NUMA PTE scanning to 3% of the task's run time, if
> >> +	 * we spent so much time scanning we got rescheduled.
> >> +	 */
> >> +	if (unlikely(p->se.sum_exec_runtime != runtime)) {
> >> +		u64 diff = p->se.sum_exec_runtime - runtime;
> >> +		p->node_stamp += 32 * diff;
> >> +	}
> > 
> > I don't actually see how this does what it says it does
> 
> If we got rescheduled during the assigning of runtime

Or just had a tick. Even if the whole thing took a fraction of a ms but
we got unlucky and got hit by a tick the sum_exec_runtime would get
updated and not match here.

> Advancing the node_stamp by 32x the amount of time
> the task consumed between entering task_numa_work and
> this point should ensure task_numa_work does not get
> queued again until we have used 32x as much time doing
> something else.

> What am I missing?

The above, issue and the fact that I'm really tired and didn't do 1:32 ~
3%.

So the tick scenario can cause a 32*TICK_NSEC delay even though we spend
much less than TICK_NSEC time scanning, dropping th effective rate much
below the 3%.

Not sure it makes sense to do more accurate accounting, but I suppose we
should mention it somewhere.

> >> @@ -2302,7 +2317,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
> >>  	now = curr->se.sum_exec_runtime;
> >>  	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
> >>  
> >> -	if (now - curr->node_stamp > period) {
> >> +	if (now > curr->node_stamp + period) {
> >>  		if (!curr->node_stamp)
> >>  			curr->numa_scan_period = task_scan_min(curr);
> >>  		curr->node_stamp += period;

> I can resend this as a separate patch if you prefer.

Yes, its an unrelated fix.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-11-05 16:37 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-04 18:25 [PATCH] sched,numa cap pte scanning overhead to 3% of run time Rik van Riel
2015-11-05 15:34 ` Peter Zijlstra
2015-11-05 15:56   ` Rik van Riel
2015-11-05 16:37     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).