All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 0/4] sched/numa: Enhance vma scanning
@ 2023-02-28  4:50 Raghavendra K T
  2023-02-28  4:50 ` [PATCH V3 1/4] sched/numa: Apply the scan delay to every new vma Raghavendra K T
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Raghavendra K T @ 2023-02-28  4:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja,
	Raghavendra K T

 The patchset proposes one of the enhancements to numa vma scanning
suggested by Mel. This is continuation of [3]. 

Existing mechanism of scan period involves, scan period derived from
per-thread stats. Process Adaptive autoNUMA [1] proposed to gather NUMA 
fault stats at per-process level to capture aplication behaviour better.

During that course of discussion, Mel proposed several ideas to enhance
current numa balancing. One of the suggestion was below

Track what threads access a VMA. The suggestion was to use an unsigned
long pid_mask and use the lower bits to tag approximately what
threads access a VMA. Skip VMAs that did not trap a fault. This would
be approximate because of PID collisions but would reduce scanning of 
areas the thread is not interested in. The above suggestion intends not
to penalize threads that has no interest in the vma, thus reduce scanning
overhead.

V3 changes are mostly based on PeterZ comments (details below in
changes)

Summary of patchset:
Current patchset implements:
1. Delay the vma scanning logic for newly created VMA's so that
additional overhead of scanning is not incurred for short lived tasks
(implementation by Mel)

2. Store the information of tasks accessing VMA in 2 windows. It is
regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval.
The above time is derived from experimenting (Suggested by PeterZ) to
balance between frequent clearing vs obsolete access data

3. hash_32 used to encode task index accessing VMA information

4. VMA's acess information is used to skip scanning for the tasks
which had not accessed VMA

Things to ponder over:
==========================================
- Improvement to clearing accessing PIDs logic (discussed in-detail in
  patch3 itself (Done in this patchset by implementing 2 window history)

- Current scan period is not changed in the patchset, so we do see frequent
 tries to scan. Relaxing scan period dynamically could improve results
further.

[1] sched/numa: Process Adaptive autoNUMA 
 Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/

[2] RFC V1 Link: 
  https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/

[3] V2 Link:
  https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@amd.com/

Changes since V2:
patch1: 
 - Renaming of structure, macro to function,
 - Add explanation to heuristics
 - Adding more details from result (PeterZ)
 Patch2:
 - Usage of test and set bit (PeterZ)
 - Move storing access PID info to numa_migrate_prep()
 - Add a note on fainess among tasks allowed to scan
   (PeterZ)
 Patch3:
 - Maintain two windows of access PID information
  (PeterZ supported implementation and Gave idea to extend
   to N if needed)
 Patch4:
 - Apply hash_32 function to track VMA accessing PIDs (PeterZ)

Changes since RFC V1:
 - Include Mel's vma scan delay patch
 - Change the accessing pid store logic (Thanks Mel)
 - Fencing structure / code to NUMA_BALANCING (David, Mel)
 - Adding clearing access PID logic (Mel)
 - Descriptive change log ( Mike Rapoport)

Results:
Summary: Huge autonuma cost reduction seen in mmtest. Kernbench and
dbench improvement is around 5% and huge system time (80%+) improvement
from mmtest autonuma.

kernbench
=============
                                    6.1.0-base                 6.1.0-patched
Amean     user-256    22437.65 (   0.00%)    22622.16 *  -0.82%*
Amean     syst-256     9290.30 (   0.00%)     8763.85 *   5.67%*
Amean     elsp-256      159.36 (   0.00%)      157.44 *   1.20%*

Duration User       67322.16    67876.18
Duration System     27884.89    26306.28
Duration Elapsed      498.95      494.42

Ops NUMA alloc hit                1738904367.00  1738882062.00
Ops NUMA alloc local              1738904104.00  1738881490.00
Ops NUMA base-page range updates      440526.00      272095.00
Ops NUMA PTE updates                  440526.00      272095.00
Ops NUMA hint faults                  109109.00       55630.00
Ops NUMA hint local faults %            5474.00         196.00
Ops NUMA hint local percent                5.02           0.35
Ops NUMA pages migrated               103400.00       55434.00
Ops AutoNUMA cost                        550.59         281.11

autonumabench
===============
                                    6.1.0-base                 6.1.0-patched
Amean     syst-NUMA01                  252.55 (   0.00%)       27.71 *  89.03%*
Amean     syst-NUMA01_THREADLOCAL        0.20 (   0.00%)        0.23 * -12.77%*
Amean     syst-NUMA02                    0.91 (   0.00%)        0.76 *  16.22%*
Amean     syst-NUMA02_SMT                0.67 (   0.00%)        0.67 *  -1.07%*
Amean     elsp-NUMA01                  269.93 (   0.00%)      309.44 * -14.64%*
Amean     elsp-NUMA01_THREADLOCAL        1.05 (   0.00%)        1.07 *  -1.36%*
Amean     elsp-NUMA02                    3.26 (   0.00%)        3.29 *  -0.79%*
Amean     elsp-NUMA02_SMT                3.73 (   0.00%)        3.52 *   5.64%*

Duration User      318683.69   330084.06
Duration System      1780.77      206.14
Duration Elapsed     1954.30     2233.06


Ops NUMA alloc hit                  62237331.00    49179090.00
Ops NUMA alloc local                62235222.00    49177092.00
Ops NUMA base-page range updates    85303091.00       29242.00
Ops NUMA PTE updates                85303091.00       29242.00
Ops NUMA hint faults                87457481.00        8302.00
Ops NUMA hint local faults %        66665145.00        6064.00
Ops NUMA hint local percent               76.23          73.04
Ops NUMA pages migrated              9348511.00        2232.00
Ops AutoNUMA cost                     438062.15          41.76

dbench
========
dbench -t 90 <nproc>

Throughput
#clients             base             	patched			%improvement
1		842.655 MB/sec		922.305 MB/sec		9.45
16              5062.82 MB/sec          5079.85 MB/sec          0.34
64              9408.81 MB/sec          9980.89 MB/sec          6.08
256             7076.59 MB/sec          7590.76 MB/sec          7.26

Mel Gorman (1):
  sched/numa: Apply the scan delay to every new vma

Raghavendra K T (3):
  sched/numa: Enhance vma scanning logic
  sched/numa: implement access PID reset logic
  sched/numa: Use hash_32 to mix up PIDs accessing VMA

 include/linux/mm.h       | 30 +++++++++++++++++++++
 include/linux/mm_types.h |  9 +++++++
 kernel/fork.c            |  2 ++
 kernel/sched/fair.c      | 57 ++++++++++++++++++++++++++++++++++++++++
 mm/memory.c              |  3 +++
 5 files changed, 101 insertions(+)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH V3 1/4] sched/numa: Apply the scan delay to every new vma
  2023-02-28  4:50 [PATCH V3 0/4] sched/numa: Enhance vma scanning Raghavendra K T
@ 2023-02-28  4:50 ` Raghavendra K T
  2023-02-28  4:50 ` [PATCH V3 2/4] sched/numa: Enhance vma scanning logic Raghavendra K T
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Raghavendra K T @ 2023-02-28  4:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja,
	Mel Gorman, Raghavendra K T

From: Mel Gorman <mgorman@techsingularity.net>

Currently whenever a new task is created we wait for
sysctl_numa_balancing_scan_delay to avoid unnessary scanning
overhead. Extend the same logic to new or very short-lived VMAs.

(Raghavendra: Add initialization in vm_area_dup())

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm.h       | 16 ++++++++++++++++
 include/linux/mm_types.h |  7 +++++++
 kernel/fork.c            |  2 ++
 kernel/sched/fair.c      | 19 +++++++++++++++++++
 4 files changed, 44 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 974ccca609d2..41cc8997d4e5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -29,6 +29,7 @@
 #include <linux/pgtable.h>
 #include <linux/kasan.h>
 #include <linux/memremap.h>
+#include <linux/slab.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -611,6 +612,20 @@ struct vm_operations_struct {
 					  unsigned long addr);
 };
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline void vma_numab_state_init(struct vm_area_struct *vma)
+{
+	vma->numab_state = NULL;
+}
+static inline void vma_numab_state_free(struct vm_area_struct *vma)
+{
+	kfree(vma->numab_state);
+}
+#else
+static inline void vma_numab_state_init(struct vm_area_struct *vma) {}
+static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
+#endif /* CONFIG_NUMA_BALANCING */
+
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
 	static const struct vm_operations_struct dummy_vm_ops = {};
@@ -619,6 +634,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
+	vma_numab_state_init(vma);
 }
 
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..a4a1093870d3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -435,6 +435,10 @@ struct anon_vma_name {
 	char name[];
 };
 
+struct vma_numab_state {
+	unsigned long next_scan;
+};
+
 /*
  * This struct describes a virtual memory area. There is one of these
  * per VM-area/task. A VM area is any part of the process virtual memory
@@ -504,6 +508,9 @@ struct vm_area_struct {
 #endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 } __randomize_layout;
diff --git a/kernel/fork.c b/kernel/fork.c
index 08969f5aa38d..6c19a3305990 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		 */
 		*new = data_race(*orig);
 		INIT_LIST_HEAD(&new->anon_vma_chain);
+		vma_numab_state_init(new);
 		dup_anon_vma_name(orig, new);
 	}
 	return new;
@@ -481,6 +482,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 
 void vm_area_free(struct vm_area_struct *vma)
 {
+	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4a0b8bd941c..e39c36e71cec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3015,6 +3015,25 @@ static void task_numa_work(struct callback_head *work)
 		if (!vma_is_accessible(vma))
 			continue;
 
+		/* Initialise new per-VMA NUMAB state. */
+		if (!vma->numab_state) {
+			vma->numab_state = kzalloc(sizeof(struct vma_numab_state),
+				GFP_KERNEL);
+			if (!vma->numab_state)
+				continue;
+
+			vma->numab_state->next_scan = now +
+				msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+		}
+
+		/*
+		 * Scanning the VMA's of short lived tasks add more overhead. So
+		 * delay the scan for new VMAs.
+		 */
+		if (mm->numa_scan_seq && time_before(jiffies,
+						vma->numab_state->next_scan))
+			continue;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH V3 2/4] sched/numa: Enhance vma scanning logic
  2023-02-28  4:50 [PATCH V3 0/4] sched/numa: Enhance vma scanning Raghavendra K T
  2023-02-28  4:50 ` [PATCH V3 1/4] sched/numa: Apply the scan delay to every new vma Raghavendra K T
@ 2023-02-28  4:50 ` Raghavendra K T
  2023-02-28  4:50 ` [PATCH V3 3/4] sched/numa: implement access PID reset logic Raghavendra K T
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Raghavendra K T @ 2023-02-28  4:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja,
	Raghavendra K T

 During the Numa scanning make sure only relevant vmas of the
tasks are scanned.

Before:
 All the tasks of a process participate in scanning the vma
even if they do not access vma in it's lifespan.

Now:
 Except cases of first few unconditional scans, if a process do
not touch vma (exluding false positive cases of PID collisions)
tasks no longer scan all vma

Logic used:
1) 6 bits of PID used to mark active bit in vma numab status during
 fault to remember PIDs accessing vma. (Thanks Mel)

2) Subsequently in scan path, vma scanning is skipped if current PID
had not accessed vma.

3) First two times we do allow unconditional scan to preserve earlier
 behaviour of scanning.

Acknowledgement to Bharata B Rao <bharata@amd.com> for initial patch
to store pid information and Peter Zijlstra <peterz@infradead.org>
(Usage of test and set bit)

Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm.h       | 14 ++++++++++++++
 include/linux/mm_types.h |  1 +
 kernel/sched/fair.c      | 19 +++++++++++++++++++
 mm/memory.c              |  3 +++
 4 files changed, 37 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 41cc8997d4e5..097680aaca1e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1388,6 +1388,16 @@ static inline int xchg_page_access_time(struct page *page, int time)
 	last_time = page_cpupid_xchg_last(page, time >> PAGE_ACCESS_TIME_BUCKETS);
 	return last_time << PAGE_ACCESS_TIME_BUCKETS;
 }
+
+static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
+{
+	unsigned int pid_bit;
+
+	pid_bit = current->pid % BITS_PER_LONG;
+	if (vma->numab_state && !test_bit(pid_bit, &vma->numab_state->access_pids)) {
+		__set_bit(pid_bit, &vma->numab_state->access_pids);
+	}
+}
 #else /* !CONFIG_NUMA_BALANCING */
 static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
@@ -1437,6 +1447,10 @@ static inline bool cpupid_match_pid(struct task_struct *task, int cpupid)
 {
 	return false;
 }
+
+static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
+{
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a4a1093870d3..582523e73546 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -437,6 +437,7 @@ struct anon_vma_name {
 
 struct vma_numab_state {
 	unsigned long next_scan;
+	unsigned long access_pids;
 };
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e39c36e71cec..05490cb2d5c6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2916,6 +2916,21 @@ static void reset_ptenuma_scan(struct task_struct *p)
 	p->mm->numa_scan_offset = 0;
 }
 
+static bool vma_is_accessed(struct vm_area_struct *vma)
+{
+	/*
+	 * Allow unconditional access first two times, so that all the (pages)
+	 * of VMAs get prot_none fault introduced irrespective of accesses.
+	 * This is also done to avoid any side effect of task scanning
+	 * amplifying the unfairness of disjoint set of VMAs' access.
+	 */
+	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
+		return true;
+
+	return test_bit(current->pid % BITS_PER_LONG,
+				&vma->numab_state->access_pids);
+}
+
 /*
  * The expensive part of numa migration is done from task_work context.
  * Triggered from task_tick_numa().
@@ -3034,6 +3049,10 @@ static void task_numa_work(struct callback_head *work)
 						vma->numab_state->next_scan))
 			continue;
 
+		/* Do not scan the VMA if task has not accessed */
+		if (!vma_is_accessed(vma))
+			continue;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
diff --git a/mm/memory.c b/mm/memory.c
index 8c8420934d60..150c03a3419c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4698,6 +4698,9 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 {
 	get_page(page);
 
+	/* Record the current PID acceesing VMA */
+	vma_set_access_pid_bit(vma);
+
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == numa_node_id()) {
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH V3 3/4] sched/numa: implement access PID reset logic
  2023-02-28  4:50 [PATCH V3 0/4] sched/numa: Enhance vma scanning Raghavendra K T
  2023-02-28  4:50 ` [PATCH V3 1/4] sched/numa: Apply the scan delay to every new vma Raghavendra K T
  2023-02-28  4:50 ` [PATCH V3 2/4] sched/numa: Enhance vma scanning logic Raghavendra K T
@ 2023-02-28  4:50 ` Raghavendra K T
  2023-02-28  4:50 ` [PATCH V3 4/4] sched/numa: Use hash_32 to mix up PIDs accessing VMA Raghavendra K T
  2023-02-28 21:24 ` [PATCH V3 0/4] sched/numa: Enhance vma scanning Andrew Morton
  4 siblings, 0 replies; 8+ messages in thread
From: Raghavendra K T @ 2023-02-28  4:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja,
	Raghavendra K T

 This helps to ensure, only recently accessed PIDs scan the
VMAs.
Current implementation: (idea supported by PeterZ)
 1. Accessing PID information is maintained in two windows.
 access_pids[1] being newest.

 2. Reset old access PID info i.e. access_pid[0] every
(4 * sysctl_numa_balancing_scan_delay) interval after initial
scan delay period expires.

The above interval seemed to be experimentally optimum since it
avoids frequent reset of access info as well as helps clearing
the old access info regularly.
The reset logic is implemented in scan path.

Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm.h       |  4 ++--
 include/linux/mm_types.h |  3 ++-
 kernel/sched/fair.c      | 23 +++++++++++++++++++++--
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 097680aaca1e..bd07289fc68e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1394,8 +1394,8 @@ static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
 	unsigned int pid_bit;
 
 	pid_bit = current->pid % BITS_PER_LONG;
-	if (vma->numab_state && !test_bit(pid_bit, &vma->numab_state->access_pids)) {
-		__set_bit(pid_bit, &vma->numab_state->access_pids);
+	if (vma->numab_state && !test_bit(pid_bit, &vma->numab_state->access_pids[1])) {
+		__set_bit(pid_bit, &vma->numab_state->access_pids[1]);
 	}
 }
 #else /* !CONFIG_NUMA_BALANCING */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 582523e73546..1f1f8bfeae36 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -437,7 +437,8 @@ struct anon_vma_name {
 
 struct vma_numab_state {
 	unsigned long next_scan;
-	unsigned long access_pids;
+	unsigned long next_pid_reset;
+	unsigned long access_pids[2];
 };
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 05490cb2d5c6..f76d5ecaf345 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2918,6 +2918,7 @@ static void reset_ptenuma_scan(struct task_struct *p)
 
 static bool vma_is_accessed(struct vm_area_struct *vma)
 {
+	unsigned long pids;
 	/*
 	 * Allow unconditional access first two times, so that all the (pages)
 	 * of VMAs get prot_none fault introduced irrespective of accesses.
@@ -2927,10 +2928,12 @@ static bool vma_is_accessed(struct vm_area_struct *vma)
 	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
 		return true;
 
-	return test_bit(current->pid % BITS_PER_LONG,
-				&vma->numab_state->access_pids);
+	pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
+	return test_bit(current->pid % BITS_PER_LONG, &pids);
 }
 
+#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
+
 /*
  * The expensive part of numa migration is done from task_work context.
  * Triggered from task_tick_numa().
@@ -3039,6 +3042,10 @@ static void task_numa_work(struct callback_head *work)
 
 			vma->numab_state->next_scan = now +
 				msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+
+			/* Reset happens after 4 times scan delay of scan start */
+			vma->numab_state->next_pid_reset =  vma->numab_state->next_scan +
+				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
 		}
 
 		/*
@@ -3053,6 +3060,18 @@ static void task_numa_work(struct callback_head *work)
 		if (!vma_is_accessed(vma))
 			continue;
 
+		/*
+		 * RESET access PIDs regularly for old VMAs. Resetting after checking
+		 * vma for recent access to avoid clearing PID info before access..
+		 */
+		if (mm->numa_scan_seq &&
+				time_after(jiffies, vma->numab_state->next_pid_reset)) {
+			vma->numab_state->next_pid_reset = vma->numab_state->next_pid_reset +
+				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+			vma->numab_state->access_pids[0] = READ_ONCE(vma->numab_state->access_pids[1]);
+			vma->numab_state->access_pids[1] = 0;
+		}
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH V3 4/4] sched/numa: Use hash_32 to mix up PIDs accessing VMA
  2023-02-28  4:50 [PATCH V3 0/4] sched/numa: Enhance vma scanning Raghavendra K T
                   ` (2 preceding siblings ...)
  2023-02-28  4:50 ` [PATCH V3 3/4] sched/numa: implement access PID reset logic Raghavendra K T
@ 2023-02-28  4:50 ` Raghavendra K T
  2023-02-28 21:24 ` [PATCH V3 0/4] sched/numa: Enhance vma scanning Andrew Morton
  4 siblings, 0 replies; 8+ messages in thread
From: Raghavendra K T @ 2023-02-28  4:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja,
	Raghavendra K T

before: last 6 bits of PID is used as index to store
information about tasks accessing VMA's.

after: hash_32 is used to take of cases where tasks are
created over a period of time, and thus improve collision
probability.

Result:
The patch series overall improving autonuma cost by a huge
margin.
Kernbench anbd dbench showed around 5% improvement and
system time in mmtest autonuma showed 80% improvement

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm.h  | 2 +-
 kernel/sched/fair.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bd07289fc68e..8493697d1dce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1393,7 +1393,7 @@ static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
 {
 	unsigned int pid_bit;
 
-	pid_bit = current->pid % BITS_PER_LONG;
+	pid_bit = hash_32(current->pid, ilog2(BITS_PER_LONG));
 	if (vma->numab_state && !test_bit(pid_bit, &vma->numab_state->access_pids[1])) {
 		__set_bit(pid_bit, &vma->numab_state->access_pids[1]);
 	}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f76d5ecaf345..46fd9b372e4c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2929,7 +2929,7 @@ static bool vma_is_accessed(struct vm_area_struct *vma)
 		return true;
 
 	pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
-	return test_bit(current->pid % BITS_PER_LONG, &pids);
+	return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
 }
 
 #define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V3 0/4] sched/numa: Enhance vma scanning
  2023-02-28  4:50 [PATCH V3 0/4] sched/numa: Enhance vma scanning Raghavendra K T
                   ` (3 preceding siblings ...)
  2023-02-28  4:50 ` [PATCH V3 4/4] sched/numa: Use hash_32 to mix up PIDs accessing VMA Raghavendra K T
@ 2023-02-28 21:24 ` Andrew Morton
  2023-03-01  4:16   ` Raghavendra K T
  4 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2023-02-28 21:24 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: linux-kernel, linux-mm, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja

On Tue, 28 Feb 2023 10:20:18 +0530 Raghavendra K T <raghavendra.kt@amd.com> wrote:

>  The patchset proposes one of the enhancements to numa vma scanning
> suggested by Mel. This is continuation of [3]. 
> 
> ...
> 
>  include/linux/mm.h       | 30 +++++++++++++++++++++
>  include/linux/mm_types.h |  9 +++++++
>  kernel/fork.c            |  2 ++
>  kernel/sched/fair.c      | 57 ++++++++++++++++++++++++++++++++++++++++
>  mm/memory.c              |  3 +++

It's unclear (to me) which tree would normally carry these.

But there are significant textual conflicts with the "Per-VMA locks"
patchset, and there might be functional issues as well.  So mm.git
would be the better choice.

Please can you redo and retest against tomorrow's mm-unstable branch
(git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm)?  Hopefully the
sched developers can take a look and provide feedback.

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V3 0/4] sched/numa: Enhance vma scanning
  2023-02-28 21:24 ` [PATCH V3 0/4] sched/numa: Enhance vma scanning Andrew Morton
@ 2023-03-01  4:16   ` Raghavendra K T
  2023-03-01 12:32     ` Raghavendra K T
  0 siblings, 1 reply; 8+ messages in thread
From: Raghavendra K T @ 2023-03-01  4:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja

On 3/1/2023 2:54 AM, Andrew Morton wrote:
> On Tue, 28 Feb 2023 10:20:18 +0530 Raghavendra K T <raghavendra.kt@amd.com> wrote:
> 
>>   The patchset proposes one of the enhancements to numa vma scanning
>> suggested by Mel. This is continuation of [3].
>>
>> ...
>>
>>   include/linux/mm.h       | 30 +++++++++++++++++++++
>>   include/linux/mm_types.h |  9 +++++++
>>   kernel/fork.c            |  2 ++
>>   kernel/sched/fair.c      | 57 ++++++++++++++++++++++++++++++++++++++++
>>   mm/memory.c              |  3 +++
> 
> It's unclear (to me) which tree would normally carry these.
> 
> But there are significant textual conflicts with the "Per-VMA locks"
> patchset, and there might be functional issues as well.  So mm.git
> would be the better choice.
> 
> Please can you redo and retest against tomorrow's mm-unstable branch
> (git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm)?  Hopefully the
> sched developers can take a look and provide feedback.
> 

Thank you Andrew. Sure will do that.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V3 0/4] sched/numa: Enhance vma scanning
  2023-03-01  4:16   ` Raghavendra K T
@ 2023-03-01 12:32     ` Raghavendra K T
  0 siblings, 0 replies; 8+ messages in thread
From: Raghavendra K T @ 2023-03-01 12:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja

On 3/1/2023 9:46 AM, Raghavendra K T wrote:
> On 3/1/2023 2:54 AM, Andrew Morton wrote:
>> On Tue, 28 Feb 2023 10:20:18 +0530 Raghavendra K T 
>> <raghavendra.kt@amd.com> wrote:
>>
>>>   The patchset proposes one of the enhancements to numa vma scanning
>>> suggested by Mel. This is continuation of [3].
>>>
>>> ...
>>>
>>>   include/linux/mm.h       | 30 +++++++++++++++++++++
>>>   include/linux/mm_types.h |  9 +++++++
>>>   kernel/fork.c            |  2 ++
>>>   kernel/sched/fair.c      | 57 ++++++++++++++++++++++++++++++++++++++++
>>>   mm/memory.c              |  3 +++
>>
>> It's unclear (to me) which tree would normally carry these.
>>
>> But there are significant textual conflicts with the "Per-VMA locks"
>> patchset, and there might be functional issues as well.  So mm.git
>> would be the better choice.
>>
>> Please can you redo and retest against tomorrow's mm-unstable branch
>> (git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm)?  Hopefully the
>> sched developers can take a look and provide feedback.
>>
> 
> Thank you Andrew. Sure will do that.
> 

Thanks again, Sent rebased patches,

Just to record, so that new discussion can happen in new posting

https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-03-01 12:33 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-28  4:50 [PATCH V3 0/4] sched/numa: Enhance vma scanning Raghavendra K T
2023-02-28  4:50 ` [PATCH V3 1/4] sched/numa: Apply the scan delay to every new vma Raghavendra K T
2023-02-28  4:50 ` [PATCH V3 2/4] sched/numa: Enhance vma scanning logic Raghavendra K T
2023-02-28  4:50 ` [PATCH V3 3/4] sched/numa: implement access PID reset logic Raghavendra K T
2023-02-28  4:50 ` [PATCH V3 4/4] sched/numa: Use hash_32 to mix up PIDs accessing VMA Raghavendra K T
2023-02-28 21:24 ` [PATCH V3 0/4] sched/numa: Enhance vma scanning Andrew Morton
2023-03-01  4:16   ` Raghavendra K T
2023-03-01 12:32     ` Raghavendra K T

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.