All of lore.kernel.org
 help / color / mirror / Atom feed
From: Raghavendra K T <raghavendra.kt@amd.com>
To: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	"Mel Gorman" <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	"David Hildenbrand" <david@redhat.com>, <rppt@kernel.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Bharata B Rao <bharata@amd.com>,
	Aithal Srikanth <sraithal@amd.com>,
	"kernel test robot" <oliver.sang@intel.com>,
	Raghavendra K T <raghavendra.kt@amd.com>,
	Sapkal Swapnil <Swapnil.Sapkal@amd.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>
Subject: [RFC PATCH V1 2/6] sched/numa: Add disjoint vma unconditional scan logic
Date: Tue, 29 Aug 2023 11:36:10 +0530	[thread overview]
Message-ID: <87e3c08bd1770dd3e6eee099c01e595f14c76fc3.1693287931.git.raghavendra.kt@amd.com> (raw)
In-Reply-To: <cover.1693287931.git.raghavendra.kt@amd.com>

Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
VMA scanning is allowed if:
1) The task had accessed the VMA.
 Rationale: Reduce overhead for the tasks that had not
touched VMA. Also filter out unnecessary scanning.

2) Early phase of the VMA scan where mm->numa_scan_seq is less than 2.
 Rationale: Understanding initial characteristics of VMAs and also
 prevent VMA scanning unfairness.

While that works for most of the times to reduce scanning overhead,
 there are some corner cases associated with it.

Problem statement (Disjoint VMA set):
======================================
Let's look at some of the corner cases with a below example of tasks and their
access pattern.

Consider N tasks (threads) of a process.
Set1 tasks accessing vma_x (group of VMAs)
Set2 tasks accessing vma_y (group of VMAs)

             Set1                      Set2
        -------------------         --------------------
        | task_1..task_n/2 |       | task_n/2+1..task_n |
        -------------------         --------------------
                 |                             |
                 V                             V
        -------------------         --------------------
        |     vma_x       |         |     vma_y         |
        -------------------         --------------------

Corner cases:
(a) Out of N tasks, not all of them gets fair opportunity to scan. (PeterZ).
suppose Set1 tasks gets more opportunity to scan (May be because of the
activity pattern of tasks or other reasons in current design) in the above
example, then vma_x gets scanned more number of times than vma_y.

some experiment is also done here which illustrates this unfairness:
Link: https://lore.kernel.org/lkml/c730dee0-a711-8a8e-3eb1-1bfdd21e6add@amd.com/

(b) Sizes of vmas can differ.
Suppose size of vma_y is far greater than the size of vma_x, then a bigger
portion of vma_y can potentially be left unscanned since scanning is bounded
by scan_size of 256MB (default) for each iteration.

(c) Highly active threads trap a few VMAs frequently, and some of the VMAs not
accessed for long time can potentially get starved of scanning indefinitely
(Mel). There is a possibility of lack of enough hints/details about VMAs if it
is needed later for migration.

(d) Allocation of memory in some specific manner (Mel).
One example could be, Suppose a main thread allocates memory and it is not
active. When other threads tries to act upon it, they may not have much
hints about it, if the corresponding VMA was not scanned.

(e) VMAs that are created after two full scans of mm (mm->numa_scan_seq > 2)
will never get scanned. (Observed rarely but very much possible depending on
workload behaviour).

Above this, a combination of some of the above (e.g., (a) and (b)) can
potentially amplifyi/worsen the side effect.

Current patch tries to address the above issues by enhancing unconditional
VMA scanning logic.

High level idea:
Depending on vma_size, populate a per vma_scan_select value, decrement it
and when it hits zero do force scan (Mel).
vma_scan_select value is again repopulated when it hits zero.

Results:
======
Base: 6.5.0-rc6+ (4853c74bd7ab)
SUT: Milan w/ 2 numa nodes 256 cpus

mmtest		numa01_THREAD_ALLOC manual run:
		base		patched
real		1m22.758s	1m9.200s
user		249m49.540s	229m30.039s
sys		0m25.040s	3m10.451s

numa_pte_updates 	6985	1573363
numa_hint_faults 	2705	1022623
numa_hint_faults_local 	2279	389633
numa_pages_migrated 	426	632990

Reported-by: Aithal Srikanth <sraithal@amd.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm_types.h |  1 +
 kernel/sched/fair.c      | 39 +++++++++++++++++++++++++++++++++++++--
 2 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5e74ce4a28cd..647d9fc5da8d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -479,6 +479,7 @@ struct vma_numab_state {
 	unsigned long next_scan;
 	unsigned long next_pid_reset;
 	unsigned long access_pids[2];
+	unsigned long vma_scan_select;
 };
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f2e1568c1d4..23375c10f36e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2928,6 +2928,36 @@ static void reset_ptenuma_scan(struct task_struct *p)
 	p->mm->numa_scan_offset = 0;
 }
 
+#define VMA_4M	(1U << 22)
+#define VMA_RATELIMIT_SCALEDOWN_F	7
+
+static inline unsigned int vma_scan_ratelimit(struct vm_area_struct *vma)
+{
+	unsigned int vma_size, ratelimit = 0;
+
+	/*
+	 * Rate limit the scanning of VMA based on the size.
+	 * vma_size > 4M allow 1 in 2 times.
+	 * vma_size = 4k allow 1 in 9 times.
+	 * 4k < vma_size < 4M scale between 2 and 9
+	 */
+	vma_size = (vma->vm_end - vma->vm_start);
+	if (vma_size)
+		ratelimit  = (VMA_4M / vma_size) >> VMA_RATELIMIT_SCALEDOWN_F;
+	return 1 + ratelimit;
+}
+
+static bool task_disjoint_vma_select(struct vm_area_struct *vma)
+{
+	if (vma->numab_state->vma_scan_select > 0) {
+		vma->numab_state->vma_scan_select--;
+		return false;
+	} else
+		vma->numab_state->vma_scan_select = vma_scan_ratelimit(vma);
+
+	return true;
+}
+
 static bool vma_is_accessed(struct vm_area_struct *vma)
 {
 	unsigned long pids;
@@ -3058,6 +3088,8 @@ static void task_numa_work(struct callback_head *work)
 			/* Reset happens after 4 times scan delay of scan start */
 			vma->numab_state->next_pid_reset =  vma->numab_state->next_scan +
 				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+
+			vma->numab_state->vma_scan_select = 0;
 		}
 
 		/*
@@ -3077,8 +3109,11 @@ static void task_numa_work(struct callback_head *work)
 			vma->numab_state->access_pids[1] = 0;
 		}
 
-		/* Do not scan the VMA if task has not accessed */
-		if (!vma_is_accessed(vma))
+		/*
+		 * Do not scan the VMA if task has not accessed OR it is still
+		   an unlucky disjoint vma.
+		 */
+		if (!(vma_is_accessed(vma) || task_disjoint_vma_select(vma)))
 			continue;
 
 		do {
-- 
2.34.1


  parent reply	other threads:[~2023-08-29  6:08 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-29  6:06 [RFC PATCH V1 0/6] sched/numa: Enhance disjoint VMA scanning Raghavendra K T
2023-08-29  6:06 ` [RFC PATCH V1 1/6] sched/numa: Move up the access pid reset logic Raghavendra K T
2023-08-29  6:06 ` Raghavendra K T [this message]
2023-09-12  7:50   ` [RFC PATCH V1 2/6] sched/numa: Add disjoint vma unconditional scan logic kernelt test robot
2023-09-13  6:21     ` Raghavendra K T
2023-08-29  6:06 ` [RFC PATCH V1 3/6] sched/numa: Remove unconditional scan logic using mm numa_scan_seq Raghavendra K T
2023-08-29  6:06 ` [RFC PATCH V1 4/6] sched/numa: Increase tasks' access history Raghavendra K T
2023-09-12 14:24   ` kernel test robot
2023-09-13  6:15     ` Raghavendra K T
2023-09-13  7:34       ` Oliver Sang
2023-08-29  6:06 ` [RFC PATCH V1 5/6] sched/numa: Allow recently accessed VMAs to be scanned Raghavendra K T
2023-09-10 15:29   ` kernel test robot
2023-09-11 11:25     ` Raghavendra K T
2023-09-12  2:22       ` Oliver Sang
2023-09-12  6:43         ` Raghavendra K T
2023-08-29  6:06 ` [RFC PATCH V1 6/6] sched/numa: Allow scanning of shared VMAs Raghavendra K T
2023-09-13  5:28 ` [RFC PATCH V1 0/6] sched/numa: Enhance disjoint VMA scanning Swapnil Sapkal
2023-09-13  6:24   ` Raghavendra K T
2023-09-19  6:30 ` Raghavendra K T
2023-09-19  7:15   ` Ingo Molnar
2023-09-19  8:06     ` Raghavendra K T
2023-09-19  9:28 ` Peter Zijlstra
2023-09-19 16:22   ` Mel Gorman
2023-09-19 19:11     ` Peter Zijlstra
2023-09-20 10:42     ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87e3c08bd1770dd3e6eee099c01e595f14c76fc3.1693287931.git.raghavendra.kt@amd.com \
    --to=raghavendra.kt@amd.com \
    --cc=Swapnil.Sapkal@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@amd.com \
    --cc=david@redhat.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=oliver.sang@intel.com \
    --cc=peterz@infradead.org \
    --cc=rppt@kernel.org \
    --cc=sraithal@amd.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.