All of lore.kernel.org
 help / color / mirror / Atom feed
From: tip-bot for Rik van Riel <tipbot@zytor.com>
To: linux-tip-commits@vger.kernel.org
Cc: peterz@infradead.org, tglx@linutronix.de,
	linux-kernel@vger.kernel.org, hpa@zytor.com, jstancek@redhat.com,
	riel@redhat.com, mingo@kernel.org, torvalds@linux-foundation.org,
	efault@gmx.de
Subject: [tip:sched/core] sched/numa: Cap PTE scanning overhead to 3% of run time
Date: Mon, 23 Nov 2015 08:19:30 -0800	[thread overview]
Message-ID: <tip-51170840fe91dfca10fd533b303ea39b2524782a@git.kernel.org> (raw)
In-Reply-To: <1446756983-28173-3-git-send-email-riel@redhat.com>

Commit-ID:  51170840fe91dfca10fd533b303ea39b2524782a
Gitweb:     http://git.kernel.org/tip/51170840fe91dfca10fd533b303ea39b2524782a
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Thu, 5 Nov 2015 15:56:23 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 23 Nov 2015 09:37:54 +0100

sched/numa: Cap PTE scanning overhead to 3% of run time

There is a fundamental mismatch between the runtime based NUMA scanning
at the task level, and the wall clock time NUMA scanning at the mm level.
On a severely overloaded system, with very large processes, this mismatch
can cause the system to spend all of its time in change_prot_numa().

This can happen if the task spends at least two ticks in change_prot_numa(),
and only gets two ticks of CPU time in the real time between two scan
intervals of the mm.

This patch ensures that a task never spends more than 3% of run
time scanning PTEs. It does that by ensuring that in-between
task_numa_work() runs, the task spends at least 32x as much time on
other things than it did on task_numa_work().

This is done stochastically: if a timer tick happens, or the task
gets rescheduled during task_numa_work(), we delay a future run of
task_numa_work() until the task has spent at least 32x the amount of
CPU time doing something else, as it spent inside task_numa_work().
The longer task_numa_work() takes, the more likely it is this happens.

If task_numa_work() takes very little time, chances are low that that
code will do anything, but we will not care.

Reported-and-tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/1446756983-28173-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 309b1d5..95b944e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2155,6 +2155,7 @@ void task_numa_work(struct callback_head *work)
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
+	u64 runtime = p->se.sum_exec_runtime;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
@@ -2277,6 +2278,17 @@ out:
 	else
 		reset_ptenuma_scan(p);
 	up_read(&mm->mmap_sem);
+
+	/*
+	 * Make sure tasks use at least 32x as much time to run other code
+	 * than they used here, to limit NUMA PTE scanning overhead to 3% max.
+	 * Usually update_task_scan_period slows down scanning enough; on an
+	 * overloaded system we need to limit overhead on a per task basis.
+	 */
+	if (unlikely(p->se.sum_exec_runtime != runtime)) {
+		u64 diff = p->se.sum_exec_runtime - runtime;
+		p->node_stamp += 32 * diff;
+	}
 }
 
 /*

      reply	other threads:[~2015-11-23 16:20 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-05 20:56 [PATCH 0/2] sched,numa: cap pte scanning overhead to 3% of run time riel
2015-11-05 20:56 ` [PATCH 1/2] sched,numa: fix math underflow in task_tick_numa riel
2015-11-10  6:40   ` [tip:sched/urgent] sched/numa: Fix math underflow in task_tick_numa() tip-bot for Rik van Riel
2015-11-05 20:56 ` [PATCH 2/2] sched,numa: cap pte scanning overhead to 3% of run time riel
2015-11-23 16:19   ` tip-bot for Rik van Riel [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=tip-51170840fe91dfca10fd533b303ea39b2524782a@git.kernel.org \
    --to=tipbot@zytor.com \
    --cc=efault@gmx.de \
    --cc=hpa@zytor.com \
    --cc=jstancek@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.