All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] sched,numa: cap pte scanning overhead to 3% of run time
@ 2015-11-05 20:56 riel
  2015-11-05 20:56 ` [PATCH 1/2] sched,numa: fix math underflow in task_tick_numa riel
  2015-11-05 20:56 ` [PATCH 2/2] sched,numa: cap pte scanning overhead to 3% of run time riel
  0 siblings, 2 replies; 5+ messages in thread
From: riel @ 2015-11-05 20:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, jstancek, mgorman

Jan Stancek identified an LTP stress test causing trouble with the
NUMA balancing code. The test forks off enough 3GB sized tasks to
fill up 80% of system memory on a system with 12TB RAM. That results
in over 2000 tasks allocating and touching memory simultaneously.

The NUMA balancing code causes each task to scan a certain number of PTEs
every 10ms. Due to the large number of tasks on the system, and the large
amount of memory in each process, it may take 10ms for each task to finish
its PTE scan.

Meanwhile, the NUMA code only tries to ensure each task has used a few (2-3)
ms of CPU time in-between invocations of task_numa_work.

On a system that overloaded, we end up spending essentially all of our
CPU time in task_numa_work, and the tasks make very little progress.

Allocating all the memory can take several hours.

With these patches, the CPU time spent in task_numa_work is limited to
around 3% of run time, and the test case completes in minutes.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] sched,numa: fix math underflow in task_tick_numa
  2015-11-05 20:56 [PATCH 0/2] sched,numa: cap pte scanning overhead to 3% of run time riel
@ 2015-11-05 20:56 ` riel
  2015-11-10  6:40   ` [tip:sched/urgent] sched/numa: Fix math underflow in task_tick_numa() tip-bot for Rik van Riel
  2015-11-05 20:56 ` [PATCH 2/2] sched,numa: cap pte scanning overhead to 3% of run time riel
  1 sibling, 1 reply; 5+ messages in thread
From: riel @ 2015-11-05 20:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, jstancek, mgorman

From: Rik van Riel <riel@redhat.com>

The NUMA balancing code implements delays in scanning by
advancing curr->node_stamp beyond curr->se.sum_exec_runtime.

With unsigned math, that creates an underflow, which results
in task_numa_work being queued all the time, even when we
don't want to.

Avoiding the math underflow makes it possible to reduce CPU
overhead in the NUMA balancing code.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-and-tested-by: Jan Stancek <jstancek@redhat.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 824aa9f501a3..f04fda8f669c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2302,7 +2302,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	now = curr->se.sum_exec_runtime;
 	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
 
-	if (now - curr->node_stamp > period) {
+	if (now > curr->node_stamp + period) {
 		if (!curr->node_stamp)
 			curr->numa_scan_period = task_scan_min(curr);
 		curr->node_stamp += period;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] sched,numa: cap pte scanning overhead to 3% of run time
  2015-11-05 20:56 [PATCH 0/2] sched,numa: cap pte scanning overhead to 3% of run time riel
  2015-11-05 20:56 ` [PATCH 1/2] sched,numa: fix math underflow in task_tick_numa riel
@ 2015-11-05 20:56 ` riel
  2015-11-23 16:19   ` [tip:sched/core] sched/numa: Cap PTE " tip-bot for Rik van Riel
  1 sibling, 1 reply; 5+ messages in thread
From: riel @ 2015-11-05 20:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, jstancek, mgorman

From: Rik van Riel <riel@redhat.com>

There is a fundamental mismatch between the runtime based NUMA scanning
at the task level, and the wall clock time NUMA scanning at the mm level.
On a severely overloaded system, with very large processes, this mismatch
can cause the system to spend all of its time in change_prot_numa().

This can happen if the task spends at least two ticks in change_prot_numa(),
and only gets two ticks of CPU time in the real time between two scan
intervals of the mm.

This patch ensures that a task never spends more than 3% of run
time scanning PTEs. It does that by ensuring that in-between
task_numa_work runs, the task spends at least 32x as much time on
other things than it did on task_numa_work.

This is done stochastically: if a timer tick happens, or the task
gets rescheduled during task_numa_work, we delay a future run of
task_numa_work until the task has spent at least 32x the amount of
CPU time doing something else, as it spent inside task_numa_work.
The longer task_numa_work takes, the more likely it is this happens.

If task_numa_work takes very little time, chances are low that that
code will do anything, but we will not care.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-and-tested-by: Jan Stancek <jstancek@redhat.com>
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f04fda8f669c..b0924377ab0d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2155,6 +2155,7 @@ void task_numa_work(struct callback_head *work)
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
+	u64 runtime = p->se.sum_exec_runtime;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
@@ -2277,6 +2278,17 @@ void task_numa_work(struct callback_head *work)
 	else
 		reset_ptenuma_scan(p);
 	up_read(&mm->mmap_sem);
+
+	/*
+	 * Make sure tasks use at least 32x as much time to run other code
+	 * than they used here, to limit NUMA PTE scanning overhead to 3% max.
+	 * Usually update_task_scan_period slows down scanning enough; on an
+	 * overloaded system we need to limit overhead on a per task basis.
+	 */
+	if (unlikely(p->se.sum_exec_runtime != runtime)) {
+		u64 diff = p->se.sum_exec_runtime - runtime;
+		p->node_stamp += 32 * diff;
+	}
 }
 
 /*
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [tip:sched/urgent] sched/numa: Fix math underflow in task_tick_numa()
  2015-11-05 20:56 ` [PATCH 1/2] sched,numa: fix math underflow in task_tick_numa riel
@ 2015-11-10  6:40   ` tip-bot for Rik van Riel
  0 siblings, 0 replies; 5+ messages in thread
From: tip-bot for Rik van Riel @ 2015-11-10  6:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: riel, jstancek, mingo, linux-kernel, hpa, peterz, tglx, torvalds

Commit-ID:  25b3e5a3344e1f700c1efec5b6f0199f04707fb1
Gitweb:     http://git.kernel.org/tip/25b3e5a3344e1f700c1efec5b6f0199f04707fb1
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Thu, 5 Nov 2015 15:56:22 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 9 Nov 2015 16:13:27 +0100

sched/numa: Fix math underflow in task_tick_numa()

The NUMA balancing code implements delays in scanning by
advancing curr->node_stamp beyond curr->se.sum_exec_runtime.

With unsigned math, that creates an underflow, which results
in task_numa_work being queued all the time, even when we
don't want to.

Avoiding the math underflow makes it possible to reduce CPU
overhead in the NUMA balancing code.

Reported-and-tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/1446756983-28173-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 824aa9f..f04fda8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2302,7 +2302,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	now = curr->se.sum_exec_runtime;
 	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
 
-	if (now - curr->node_stamp > period) {
+	if (now > curr->node_stamp + period) {
 		if (!curr->node_stamp)
 			curr->numa_scan_period = task_scan_min(curr);
 		curr->node_stamp += period;

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [tip:sched/core] sched/numa: Cap PTE scanning overhead to 3% of run time
  2015-11-05 20:56 ` [PATCH 2/2] sched,numa: cap pte scanning overhead to 3% of run time riel
@ 2015-11-23 16:19   ` tip-bot for Rik van Riel
  0 siblings, 0 replies; 5+ messages in thread
From: tip-bot for Rik van Riel @ 2015-11-23 16:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, tglx, linux-kernel, hpa, jstancek, riel, mingo, torvalds, efault

Commit-ID:  51170840fe91dfca10fd533b303ea39b2524782a
Gitweb:     http://git.kernel.org/tip/51170840fe91dfca10fd533b303ea39b2524782a
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Thu, 5 Nov 2015 15:56:23 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 23 Nov 2015 09:37:54 +0100

sched/numa: Cap PTE scanning overhead to 3% of run time

There is a fundamental mismatch between the runtime based NUMA scanning
at the task level, and the wall clock time NUMA scanning at the mm level.
On a severely overloaded system, with very large processes, this mismatch
can cause the system to spend all of its time in change_prot_numa().

This can happen if the task spends at least two ticks in change_prot_numa(),
and only gets two ticks of CPU time in the real time between two scan
intervals of the mm.

This patch ensures that a task never spends more than 3% of run
time scanning PTEs. It does that by ensuring that in-between
task_numa_work() runs, the task spends at least 32x as much time on
other things than it did on task_numa_work().

This is done stochastically: if a timer tick happens, or the task
gets rescheduled during task_numa_work(), we delay a future run of
task_numa_work() until the task has spent at least 32x the amount of
CPU time doing something else, as it spent inside task_numa_work().
The longer task_numa_work() takes, the more likely it is this happens.

If task_numa_work() takes very little time, chances are low that that
code will do anything, but we will not care.

Reported-and-tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/1446756983-28173-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 309b1d5..95b944e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2155,6 +2155,7 @@ void task_numa_work(struct callback_head *work)
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
+	u64 runtime = p->se.sum_exec_runtime;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
@@ -2277,6 +2278,17 @@ out:
 	else
 		reset_ptenuma_scan(p);
 	up_read(&mm->mmap_sem);
+
+	/*
+	 * Make sure tasks use at least 32x as much time to run other code
+	 * than they used here, to limit NUMA PTE scanning overhead to 3% max.
+	 * Usually update_task_scan_period slows down scanning enough; on an
+	 * overloaded system we need to limit overhead on a per task basis.
+	 */
+	if (unlikely(p->se.sum_exec_runtime != runtime)) {
+		u64 diff = p->se.sum_exec_runtime - runtime;
+		p->node_stamp += 32 * diff;
+	}
 }
 
 /*

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-11-23 16:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-05 20:56 [PATCH 0/2] sched,numa: cap pte scanning overhead to 3% of run time riel
2015-11-05 20:56 ` [PATCH 1/2] sched,numa: fix math underflow in task_tick_numa riel
2015-11-10  6:40   ` [tip:sched/urgent] sched/numa: Fix math underflow in task_tick_numa() tip-bot for Rik van Riel
2015-11-05 20:56 ` [PATCH 2/2] sched,numa: cap pte scanning overhead to 3% of run time riel
2015-11-23 16:19   ` [tip:sched/core] sched/numa: Cap PTE " tip-bot for Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.