linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: peterz@infradead.org, mingo@kernel.org, mgorman@suse.de,
	jstancek@redhat.com
Subject: [PATCH] sched,numa cap pte scanning overhead to 3% of run time
Date: Wed, 4 Nov 2015 13:25:15 -0500	[thread overview]
Message-ID: <20151104132515.07e41b75@annuminas.surriel.com> (raw)

There is a fundamental mismatch between the runtime based NUMA scanning
at the task level, and the wall clock time NUMA scanning at the mm level.
On a severely overloaded system, with very large processes, this mismatch
can cause the system to spend all of its time in change_prot_numa().

This can happen if the task spends at least two ticks in change_prot_numa(),
and only gets two ticks of CPU time in the real time between two scan
intervals of the mm.

This patch ensures that if the system is so busy that the task got
rescheduled during change_prot_numa(), we never spend more than 3% of run
time scanning PTEs.

This patch does nothing if the CPU is not overloaded at all, and the
task is not rescheduled during change_prot_numa().

All of the above only works if we fix the math underflow issue in
task_numa_tick, so do that as well (Jan Stancek).

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-and-tested-by: Jan Stancek <jstancek@redhat.com>
---
 kernel/sched/fair.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 824aa9f501a3..e9b9ac424a76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2155,6 +2155,7 @@ void task_numa_work(struct callback_head *work)
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
+	u64 runtime = p->se.sum_exec_runtime;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
@@ -2277,6 +2278,20 @@ void task_numa_work(struct callback_head *work)
 	else
 		reset_ptenuma_scan(p);
 	up_read(&mm->mmap_sem);
+
+	/*
+	 * There is a fundamental mismatch between the runtime based
+	 * NUMA scanning at the task level, and the wall clock time
+	 * NUMA scanning at the mm level. On a severely overloaded
+	 * system, with very large processes, this mismatch can cause
+	 * the system to spend all of its time in change_prot_numa().
+	 * Limit NUMA PTE scanning to 3% of the task's run time, if
+	 * we spent so much time scanning we got rescheduled.
+	 */
+	if (unlikely(p->se.sum_exec_runtime != runtime)) {
+		u64 diff = p->se.sum_exec_runtime - runtime;
+		p->node_stamp += 32 * diff;
+	}
 }
 
 /*
@@ -2302,7 +2317,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	now = curr->se.sum_exec_runtime;
 	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
 
-	if (now - curr->node_stamp > period) {
+	if (now > curr->node_stamp + period) {
 		if (!curr->node_stamp)
 			curr->numa_scan_period = task_scan_min(curr);
 		curr->node_stamp += period;

             reply	other threads:[~2015-11-04 18:45 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-04 18:25 Rik van Riel [this message]
2015-11-05 15:34 ` [PATCH] sched,numa cap pte scanning overhead to 3% of run time Peter Zijlstra
2015-11-05 15:56   ` Rik van Riel
2015-11-05 16:37     ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151104132515.07e41b75@annuminas.surriel.com \
    --to=riel@surriel.com \
    --cc=jstancek@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).