linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2 0/3] sched/numa: Enhance vma scanning
@ 2023-02-01  8:02 Raghavendra K T
  2023-02-01  8:02 ` [PATCH V2 1/3] sched/numa: Apply the scan delay to every vma instead of tasks Raghavendra K T
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Raghavendra K T @ 2023-02-01  8:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja,
	Raghavendra K T

 The patchset proposes one of the enhancements to numa vma scanning
suggested by Mel. This is continuation of [2]. Though I have removed
RFC, I do think some parts need more feedback and refinement.

Existing mechanism of scan period involves, scan period derived from
per-thread stats. Process Adaptive autoNUMA [1] proposed to gather NUMA 
fault stats at per-process level to capture aplication behaviour better.

During that course of discussion, Mel proposed several ideas to enhance
current numa balancing. One of the suggestion was below

Track what threads access a VMA. The suggestion was to use an unsigned
long pid_mask and use the lower bits to tag approximately what
threads access a VMA. Skip VMAs that did not trap a fault. This would
be approximate because of PID collisions but would reduce scanning of 
areas the thread is not interested in. The above suggestion intends not
to penalize threads that has no interest in the vma, thus reduce scanning
overhead.

About Patchset:
Patch1:
1) VMA scan delay logic added (Mel) where during initial phase of VMA,
we delay the scanning by sysctl_numa_balancing_scan_delay.

2) A new status structure is added (vma_numab) so as to not grow
the vm_area_struct in !NUMA_BALANCING case.

Patch2:
3) last 6 Bits of PID is used as index to remember which PIDs accessed
VMA in fault path. This is further used to restrict scanning of VMA in
scan path.

Please note that first two times scanning is unconditionally allowed
(using numa_scan_seq). But this may need some potential change since
numa_scan_seq is per mm.

Patch3: 
4) Introduce basic patch clearing of accessed PIDs. This is as of now
done at every 4 * sysctl_numa_balancing_scan_delay interval.

This logic may need more experiment/refinement.

Things to ponder over (and Future TODO):
==========================================
- Improvement to clearing accessing PIDs logic (discussed in-detail in
  patch3 itself)

- Current scan period is not changed in the patchset, so we do see frequent
 tries to scan. Relaxing scan period dynamically could improve results
further.

Result Summary:
================
The result is obtained by running mmtests with below configs
config-numa-autonumabench

There is a significant reduction in AutoNuma cost from the benchmark
runs, But some of the results need improvement. I hope working on the
potential changes mentioned in patch3 and hopefuly numa_scan_period
tuning depending on current scanning efiiciency would help. will be
working on that..

SUT:
2 socket AMD Milan System
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2

256GB memory per socket amounting to 512GB in total
NPS1 NUMA configuration where each socket is a NUMA node

autonumabench
                                                   6.1.0                  6.1.0
BAmean-99 syst-NUMA01                  195.84 (   0.00%)       17.79 (  90.91%)
BAmean-99 syst-NUMA01_THREADLOCAL        0.19 (   0.00%)        0.19 (   2.56%)
BAmean-99 syst-NUMA02                    0.85 (   0.00%)        0.85 (   0.00%)
BAmean-99 syst-NUMA02_SMT                0.62 (   0.00%)        0.65 (  -4.30%)
BAmean-99 elsp-NUMA01                  254.95 (   0.00%)      322.69 ( -26.57%)
BAmean-99 elsp-NUMA01_THREADLOCAL        1.04 (   0.00%)        1.05 (  -1.29%)
BAmean-99 elsp-NUMA02                    3.08 (   0.00%)        3.29 (  -6.94%)
BAmean-99 elsp-NUMA02_SMT                3.49 (   0.00%)        3.43 (   1.91%)

                                          6.1.0          6.1.0
Ops NUMA alloc hit                  59210941.00    50772531.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local                59200395.00    50771359.00
Ops NUMA base-page range updates    90670863.00       10952.00
Ops NUMA PTE updates                90670863.00       10952.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults                92069634.00        9501.00
Ops NUMA hint local faults %        69966984.00        9213.00
Ops NUMA hint local percent               75.99          96.97
Ops NUMA pages migrated              8424631.00         287.00
Ops AutoNUMA cost                     461142.93          47.59

[1] sched/numa: Process Adaptive autoNUMA 
 Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/
[2] RFC V1:
 Link: https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/

Changes since RFC V1:
 - Include Mel's vma scan delay patch
 - Change the accessing pid store logic (Thanks Mel)
 - Fencing structure / code to NUMA_BALANCING (David, Mel)
 - Adding clearing access PID logic (Mel)
 - Descriptive change log ( Mike Rapoport)

Mel Gorman (1):
  sched/numa: Apply the scan delay to every vma instead of tasks

Raghavendra K T (2):
  sched/numa: Enhance vma scanning logic
  sched/numa: Reset the accessing PID information periodically

 include/linux/mm.h       | 23 +++++++++++++++++++
 include/linux/mm_types.h |  9 ++++++++
 kernel/fork.c            |  2 ++
 kernel/sched/fair.c      | 49 ++++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c         |  1 +
 mm/memory.c              |  1 +
 6 files changed, 85 insertions(+)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V2 1/3] sched/numa: Apply the scan delay to every vma instead of tasks
  2023-02-01  8:02 [PATCH V2 0/3] sched/numa: Enhance vma scanning Raghavendra K T
@ 2023-02-01  8:02 ` Raghavendra K T
  2023-02-03 10:24   ` Peter Zijlstra
  2023-02-01  8:02 ` [PATCH V2 2/3] sched/numa: Enhance vma scanning logic Raghavendra K T
  2023-02-01  8:02 ` [PATCH V2 3/3] sched/numa: Reset the accessing PID information periodically Raghavendra K T
  2 siblings, 1 reply; 17+ messages in thread
From: Raghavendra K T @ 2023-02-01  8:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja,
	Mel Gorman, Raghavendra K T

From: Mel Gorman <mgorman@techsingularity.net>

 Avoid scanning new or very short-lived VMAs.

(Raghavendra: Add initialization in vm_area_dup())

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm.h       |  9 +++++++++
 include/linux/mm_types.h |  7 +++++++
 kernel/fork.c            |  2 ++
 kernel/sched/fair.c      | 17 +++++++++++++++++
 4 files changed, 35 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 974ccca609d2..74d9df1d8982 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -611,6 +611,14 @@ struct vm_operations_struct {
 					  unsigned long addr);
 };
 
+#ifdef CONFIG_NUMA_BALANCING
+#define vma_numab_init(vma) do { (vma)->numab = NULL; } while (0)
+#define vma_numab_free(vma) do { kfree((vma)->numab); } while (0)
+#else
+static inline void vma_numab_init(struct vm_area_struct *vma) {}
+static inline void vma_numab_free(struct vm_area_struct *vma) {}
+#endif /* CONFIG_NUMA_BALANCING */
+
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
 	static const struct vm_operations_struct dummy_vm_ops = {};
@@ -619,6 +627,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
+	vma_numab_init(vma);
 }
 
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..e84f95a77321 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -435,6 +435,10 @@ struct anon_vma_name {
 	char name[];
 };
 
+struct vma_numab {
+	unsigned long next_scan;
+};
+
 /*
  * This struct describes a virtual memory area. There is one of these
  * per VM-area/task. A VM area is any part of the process virtual memory
@@ -504,6 +508,9 @@ struct vm_area_struct {
 #endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+	struct vma_numab *numab;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 } __randomize_layout;
diff --git a/kernel/fork.c b/kernel/fork.c
index 08969f5aa38d..ac6f0477cf6e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		 */
 		*new = data_race(*orig);
 		INIT_LIST_HEAD(&new->anon_vma_chain);
+		vma_numab_init(new);
 		dup_anon_vma_name(orig, new);
 	}
 	return new;
@@ -481,6 +482,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 
 void vm_area_free(struct vm_area_struct *vma)
 {
+	vma_numab_free(vma);
 	free_anon_vma_name(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4a0b8bd941c..060b241ce3c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3015,6 +3015,23 @@ static void task_numa_work(struct callback_head *work)
 		if (!vma_is_accessible(vma))
 			continue;
 
+		/* Initialise new per-VMA NUMAB state. */
+		if (!vma->numab) {
+			vma->numab = kzalloc(sizeof(struct vma_numab), GFP_KERNEL);
+			if (!vma->numab)
+				continue;
+
+			vma->numab->next_scan = now +
+				msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+		}
+
+		/*
+		 * After the first scan is complete, delay the balancing scan
+		 * for new VMAs.
+		 */
+		if (mm->numa_scan_seq && time_before(jiffies, vma->numab->next_scan))
+			continue;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V2 2/3] sched/numa: Enhance vma scanning logic
  2023-02-01  8:02 [PATCH V2 0/3] sched/numa: Enhance vma scanning Raghavendra K T
  2023-02-01  8:02 ` [PATCH V2 1/3] sched/numa: Apply the scan delay to every vma instead of tasks Raghavendra K T
@ 2023-02-01  8:02 ` Raghavendra K T
  2023-02-03 11:15   ` Peter Zijlstra
  2023-02-01  8:02 ` [PATCH V2 3/3] sched/numa: Reset the accessing PID information periodically Raghavendra K T
  2 siblings, 1 reply; 17+ messages in thread
From: Raghavendra K T @ 2023-02-01  8:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja,
	Raghavendra K T

 During the Numa scanning make sure only relevant vmas of the
tasks are scanned.

Before:
 All the tasks of a process participate in scanning the vma
even if they do not access vma in it's lifespan.

Now:
 Except cases of first few unconditional scans, if a process do
not touch vma (exluding false positive cases of PID collisions)
tasks no longer scan all vma.

Logic used:
1) 6 bits of PID used to mark active bit in vma numab status during
 fault to remember PIDs accessing vma. (Thanks Mel)

2) Subsequently in scan path, vma scanning is skipped if current PID
had not accessed vma.

3) First two times we do allow unconditional scan to preserve earlier
 behaviour of scanning.

Acknowledgement to Bharata B Rao <bharata@amd.com> for initial patch
to store pid information.

Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm.h       | 14 ++++++++++++++
 include/linux/mm_types.h |  1 +
 kernel/sched/fair.c      | 15 +++++++++++++++
 mm/huge_memory.c         |  1 +
 mm/memory.c              |  1 +
 5 files changed, 32 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74d9df1d8982..489422942482 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1381,6 +1381,16 @@ static inline int xchg_page_access_time(struct page *page, int time)
 	last_time = page_cpupid_xchg_last(page, time >> PAGE_ACCESS_TIME_BUCKETS);
 	return last_time << PAGE_ACCESS_TIME_BUCKETS;
 }
+
+static inline void vma_set_active_pid_bit(struct vm_area_struct *vma)
+{
+	unsigned int active_pid_bit;
+
+	if (vma->numab) {
+		active_pid_bit = current->pid % BITS_PER_LONG;
+		vma->numab->accessing_pids |= 1UL << active_pid_bit;
+	}
+}
 #else /* !CONFIG_NUMA_BALANCING */
 static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
@@ -1430,6 +1440,10 @@ static inline bool cpupid_match_pid(struct task_struct *task, int cpupid)
 {
 	return false;
 }
+
+static inline void vma_set_active_pid_bit(struct vm_area_struct *vma)
+{
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e84f95a77321..980a6a4308b6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -437,6 +437,7 @@ struct anon_vma_name {
 
 struct vma_numab {
 	unsigned long next_scan;
+	unsigned long accessing_pids;
 };
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 060b241ce3c5..3505ae57c07c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2916,6 +2916,18 @@ static void reset_ptenuma_scan(struct task_struct *p)
 	p->mm->numa_scan_offset = 0;
 }
 
+static bool vma_is_accessed(struct vm_area_struct *vma)
+{
+	unsigned int active_pid_bit;
+
+	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
+		return true;
+
+	active_pid_bit = current->pid % BITS_PER_LONG;
+
+	return vma->numab->accessing_pids & (1UL << active_pid_bit);
+}
+
 /*
  * The expensive part of numa migration is done from task_work context.
  * Triggered from task_tick_numa().
@@ -3032,6 +3044,9 @@ static void task_numa_work(struct callback_head *work)
 		if (mm->numa_scan_seq && time_before(jiffies, vma->numab->next_scan))
 			continue;
 
+		if (!vma_is_accessed(vma))
+			continue;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 811d19b5c4f6..d908aa95f3c3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1485,6 +1485,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	bool was_writable = pmd_savedwrite(oldpmd);
 	int flags = 0;
 
+	vma_set_active_pid_bit(vma);
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
 	if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
 		spin_unlock(vmf->ptl);
diff --git a/mm/memory.c b/mm/memory.c
index 8c8420934d60..2ec3045cb8b3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4718,6 +4718,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	bool was_writable = pte_savedwrite(vmf->orig_pte);
 	int flags = 0;
 
+	vma_set_active_pid_bit(vma);
 	/*
 	 * The "pte" at this point cannot be used safely without
 	 * validation through pte_unmap_same(). It's of NUMA type but
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V2 3/3] sched/numa: Reset the accessing PID information periodically
  2023-02-01  8:02 [PATCH V2 0/3] sched/numa: Enhance vma scanning Raghavendra K T
  2023-02-01  8:02 ` [PATCH V2 1/3] sched/numa: Apply the scan delay to every vma instead of tasks Raghavendra K T
  2023-02-01  8:02 ` [PATCH V2 2/3] sched/numa: Enhance vma scanning logic Raghavendra K T
@ 2023-02-01  8:02 ` Raghavendra K T
  2023-02-03 11:35   ` Peter Zijlstra
  2 siblings, 1 reply; 17+ messages in thread
From: Raghavendra K T @ 2023-02-01  8:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja,
	Raghavendra K T

 This helps to ensure, only recently accessed PIDs scan the
VMAs.

Current implementation:
 Reset accessing PIDs every (4 * sysctl_numa_balancing_scan_delay)
interval after initial scan delay period expires. The reset logic
is implemented in scan path

Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
Some of the potential ideas for clearing the accessing PIDs

1) Flag to indicate phase in life cycle of vma and tie with timestamp (reuse next_scan or so)

VMA life cycle

t1         t2         t3                    t4         t5                   t6
|<-  DS  ->|<-  US  ->|<-        CS       ->|<-  US  ->|<-        CS       ->|
flags used to indicate whether we are in DS/CS/US phase

DS (delay scan): Initial phase where scan is avoided for new VMA
US (unconditional scan): Brief period where scanning is allowed irrespective of task faulting the VMA
CS (conditional scan) :  Longer conditiona scanning phase where task scanning is allowed only for VMA of interest  


2) Maintain duplicate list of accessing PIDs to keep track of history of access. and switch/reset. use OR operation during iteration

 Two lists of PIDs maintained. At regular interval old list is reset and we make current list as old list
At any point of time tracking of PIDs accessing VMA is determined by ORing list1 and list2  

accessing_pids_list1 <-  current list
accessing_pids_list2 <-  old list

3) Maintain per vma numa_seq also
Currently numa_seq (how many times we are scanning entire set of VMAs) is maintained at mm level.
Having numa_seq (almost like how many times the current VMA considered for scanning) per VMA may be helpful
in some context (for e.g., whether we need to allow VMA scanning unconditionally for a newly created VMA).

4) Reset accessing PIDs at regular intervals (current implementation)

t1       t2         t3         t4         t5         t6
|<- DS ->|<-  CS  ->|<-  CS  ->|<-  CS  ->|<-  CS  ->|

The current implementation resets accessing PIDs every 4*scan_delay intervals after initial scan delay
time expires. The reset logic is implemented in scan path

 include/linux/mm_types.h |  1 +
 kernel/sched/fair.c      | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 980a6a4308b6..08a007744ea1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -437,6 +437,7 @@ struct anon_vma_name {
 
 struct vma_numab {
 	unsigned long next_scan;
+	unsigned long next_pid_reset;
 	unsigned long accessing_pids;
 };
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3505ae57c07c..14db6d8a5090 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2928,6 +2928,8 @@ static bool vma_is_accessed(struct vm_area_struct *vma)
 	return vma->numab->accessing_pids & (1UL << active_pid_bit);
 }
 
+#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
+
 /*
  * The expensive part of numa migration is done from task_work context.
  * Triggered from task_tick_numa().
@@ -3035,6 +3037,10 @@ static void task_numa_work(struct callback_head *work)
 
 			vma->numab->next_scan = now +
 				msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+
+			/* Reset happens after 4 times scan delay of scan start */
+			vma->numab->next_pid_reset =  vma->numab->next_scan +
+				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
 		}
 
 		/*
@@ -3047,6 +3053,17 @@ static void task_numa_work(struct callback_head *work)
 		if (!vma_is_accessed(vma))
 			continue;
 
+		/*
+		 * RESET accessing PIDs regularly for old VMAs. Resetting after checking
+		 * vma for recent access to avoid clearing PID info before access..
+		 */
+		if (mm->numa_scan_seq &&
+				time_after(jiffies, vma->numab->next_pid_reset)) {
+			vma->numab->next_pid_reset =  vma->numab->next_pid_reset +
+				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+			vma->numab->accessing_pids = 0;
+		}
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 1/3] sched/numa: Apply the scan delay to every vma instead of tasks
  2023-02-01  8:02 ` [PATCH V2 1/3] sched/numa: Apply the scan delay to every vma instead of tasks Raghavendra K T
@ 2023-02-03 10:24   ` Peter Zijlstra
  2023-02-04 17:19     ` Raghavendra K T
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2023-02-03 10:24 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: linux-kernel, linux-mm, Ingo Molnar, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja,
	Mel Gorman

On Wed, Feb 01, 2023 at 01:32:20PM +0530, Raghavendra K T wrote:
> From: Mel Gorman <mgorman@techsingularity.net>
> 
>  Avoid scanning new or very short-lived VMAs.
> 
> (Raghavendra: Add initialization in vm_area_dup())

Given this is a performance centric patch -- some sort of qualification
/ justification would be much appreciated.

Also, perhaps explain the rationale for the actual heuristics chosen.

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
>  include/linux/mm.h       |  9 +++++++++
>  include/linux/mm_types.h |  7 +++++++
>  kernel/fork.c            |  2 ++
>  kernel/sched/fair.c      | 17 +++++++++++++++++
>  4 files changed, 35 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 974ccca609d2..74d9df1d8982 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -611,6 +611,14 @@ struct vm_operations_struct {
>  					  unsigned long addr);
>  };
>  
> +#ifdef CONFIG_NUMA_BALANCING
> +#define vma_numab_init(vma) do { (vma)->numab = NULL; } while (0)
> +#define vma_numab_free(vma) do { kfree((vma)->numab); } while (0)
> +#else
> +static inline void vma_numab_init(struct vm_area_struct *vma) {}
> +static inline void vma_numab_free(struct vm_area_struct *vma) {}
> +#endif /* CONFIG_NUMA_BALANCING */

I'm tripping over the inconsistency of macros and functions here. I'd
suggest making both cases functions.


> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 500e536796ca..e84f95a77321 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -435,6 +435,10 @@ struct anon_vma_name {
>  	char name[];
>  };
>  
> +struct vma_numab {
> +	unsigned long next_scan;
> +};

I'm not sure what a numab is; contraction of new-kebab, something else?

While I appreciate short names, they'd ideally also make sense. If we
cannot come up with a better one, perhaps elucidate the reader with a
comment.

> +
>  /*
>   * This struct describes a virtual memory area. There is one of these
>   * per VM-area/task. A VM area is any part of the process virtual memory
> @@ -504,6 +508,9 @@ struct vm_area_struct {

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e4a0b8bd941c..060b241ce3c5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3015,6 +3015,23 @@ static void task_numa_work(struct callback_head *work)
>  		if (!vma_is_accessible(vma))
>  			continue;
>  
> +		/* Initialise new per-VMA NUMAB state. */
> +		if (!vma->numab) {
> +			vma->numab = kzalloc(sizeof(struct vma_numab), GFP_KERNEL);
> +			if (!vma->numab)
> +				continue;
> +
> +			vma->numab->next_scan = now +
> +				msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
> +		}
> +
> +		/*
> +		 * After the first scan is complete, delay the balancing scan
> +		 * for new VMAs.
> +		 */
> +		if (mm->numa_scan_seq && time_before(jiffies, vma->numab->next_scan))
> +			continue;

I think I sorta see why, but I'm thinking it would be good to include
more of the why in that comment.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 2/3] sched/numa: Enhance vma scanning logic
  2023-02-01  8:02 ` [PATCH V2 2/3] sched/numa: Enhance vma scanning logic Raghavendra K T
@ 2023-02-03 11:15   ` Peter Zijlstra
  2023-02-03 11:27     ` Peter Zijlstra
  2023-02-04 18:14     ` Raghavendra K T
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2023-02-03 11:15 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: linux-kernel, linux-mm, Ingo Molnar, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja

On Wed, Feb 01, 2023 at 01:32:21PM +0530, Raghavendra K T wrote:
>  During the Numa scanning make sure only relevant vmas of the
> tasks are scanned.
> 
> Before:
>  All the tasks of a process participate in scanning the vma
> even if they do not access vma in it's lifespan.
> 
> Now:
>  Except cases of first few unconditional scans, if a process do
> not touch vma (exluding false positive cases of PID collisions)
> tasks no longer scan all vma.
> 
> Logic used:
> 1) 6 bits of PID used to mark active bit in vma numab status during
>  fault to remember PIDs accessing vma. (Thanks Mel)
> 
> 2) Subsequently in scan path, vma scanning is skipped if current PID
> had not accessed vma.
> 
> 3) First two times we do allow unconditional scan to preserve earlier
>  behaviour of scanning.
> 
> Acknowledgement to Bharata B Rao <bharata@amd.com> for initial patch
> to store pid information.
> 
> Suggested-by: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
>  include/linux/mm.h       | 14 ++++++++++++++
>  include/linux/mm_types.h |  1 +
>  kernel/sched/fair.c      | 15 +++++++++++++++
>  mm/huge_memory.c         |  1 +
>  mm/memory.c              |  1 +
>  5 files changed, 32 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 74d9df1d8982..489422942482 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1381,6 +1381,16 @@ static inline int xchg_page_access_time(struct page *page, int time)
>  	last_time = page_cpupid_xchg_last(page, time >> PAGE_ACCESS_TIME_BUCKETS);
>  	return last_time << PAGE_ACCESS_TIME_BUCKETS;
>  }
> +
> +static inline void vma_set_active_pid_bit(struct vm_area_struct *vma)
> +{
> +	unsigned int active_pid_bit;
> +
> +	if (vma->numab) {
> +		active_pid_bit = current->pid % BITS_PER_LONG;
> +		vma->numab->accessing_pids |= 1UL << active_pid_bit;
> +	}
> +}

Perhaps:

	if (vma->numab)
		__set_bit(current->pid % BITS_PER_LONG, &vma->numab->pids);

?

Or maybe even:

	bit = current->pid % BITS_PER_LONG;
	if (vma->numab && !__test_bit(bit, &vma->numab->pids))
		__set_bit(bit, &vma->numab->pids);


> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 060b241ce3c5..3505ae57c07c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2916,6 +2916,18 @@ static void reset_ptenuma_scan(struct task_struct *p)
>  	p->mm->numa_scan_offset = 0;
>  }
>  
> +static bool vma_is_accessed(struct vm_area_struct *vma)
> +{
> +	unsigned int active_pid_bit;
> +
	/*
	 * Tell us why 2....
	 */
> +	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
> +		return true;
> +
> +	active_pid_bit = current->pid % BITS_PER_LONG;
> +
> +	return vma->numab->accessing_pids & (1UL << active_pid_bit);
	return __test_bit(current->pid % BITS_PER_LONG, &vma->numab->pids)
> +}
> +
>  /*
>   * The expensive part of numa migration is done from task_work context.
>   * Triggered from task_tick_numa().
> @@ -3032,6 +3044,9 @@ static void task_numa_work(struct callback_head *work)
>  		if (mm->numa_scan_seq && time_before(jiffies, vma->numab->next_scan))
>  			continue;
>  
		/*
		 * tell us more...
		 */
> +		if (!vma_is_accessed(vma))
> +			continue;
> +
>  		do {
>  			start = max(start, vma->vm_start);
>  			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);


This feels wrong, specifically we track numa_scan_offset per mm, now, if
we divide the threads into two dis-joint groups each only using their
own set of vmas (in fact quite common for workloads with proper data
partitioning) it is possible to consistently sample one set of threads
and thus not scan the other set of vmas.

It seems somewhat unlikely, but not impossible to create significant
unfairness.

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 811d19b5c4f6..d908aa95f3c3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1485,6 +1485,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  	bool was_writable = pmd_savedwrite(oldpmd);
>  	int flags = 0;
>  
> +	vma_set_active_pid_bit(vma);
>  	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>  	if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>  		spin_unlock(vmf->ptl);
> diff --git a/mm/memory.c b/mm/memory.c
> index 8c8420934d60..2ec3045cb8b3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4718,6 +4718,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	bool was_writable = pte_savedwrite(vmf->orig_pte);
>  	int flags = 0;
>  
> +	vma_set_active_pid_bit(vma);
>  	/*
>  	 * The "pte" at this point cannot be used safely without
>  	 * validation through pte_unmap_same(). It's of NUMA type but

Urghh... do_*numa_page() is two near identical functions.. is there
really no sane way to de-duplicate at least some of that?

Also, is this placement right, you're marking the thread even before we
know there's even a page there. I would expect this somewhere around
where we track lastpid.

Maybe numa_migrate_prep() ?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 2/3] sched/numa: Enhance vma scanning logic
  2023-02-03 11:15   ` Peter Zijlstra
@ 2023-02-03 11:27     ` Peter Zijlstra
  2023-02-04 18:18       ` Raghavendra K T
  2023-02-04 18:14     ` Raghavendra K T
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2023-02-03 11:27 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: linux-kernel, linux-mm, Ingo Molnar, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja

On Fri, Feb 03, 2023 at 12:15:48PM +0100, Peter Zijlstra wrote:

> > +static inline void vma_set_active_pid_bit(struct vm_area_struct *vma)
> > +{
> > +	unsigned int active_pid_bit;
> > +
> > +	if (vma->numab) {
> > +		active_pid_bit = current->pid % BITS_PER_LONG;
> > +		vma->numab->accessing_pids |= 1UL << active_pid_bit;
> > +	}
> > +}
> 
> Perhaps:
> 
> 	if (vma->numab)
> 		__set_bit(current->pid % BITS_PER_LONG, &vma->numab->pids);
> 
> ?
> 
> Or maybe even:
> 
> 	bit = current->pid % BITS_PER_LONG;
> 	if (vma->numab && !__test_bit(bit, &vma->numab->pids))
> 		__set_bit(bit, &vma->numab->pids);
> 

The alternative to just taking the low n bits is to use:

  hash_32(current->pid, BITS_PER_LONG)

That mixes things up a bit.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 3/3] sched/numa: Reset the accessing PID information periodically
  2023-02-01  8:02 ` [PATCH V2 3/3] sched/numa: Reset the accessing PID information periodically Raghavendra K T
@ 2023-02-03 11:35   ` Peter Zijlstra
  2023-02-04 18:32     ` Raghavendra K T
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2023-02-03 11:35 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: linux-kernel, linux-mm, Ingo Molnar, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja

On Wed, Feb 01, 2023 at 01:32:22PM +0530, Raghavendra K T wrote:

> 2) Maintain duplicate list of accessing PIDs to keep track of history of access. and switch/reset. use OR operation during iteration
> 
>  Two lists of PIDs maintained. At regular interval old list is reset and we make current list as old list
> At any point of time tracking of PIDs accessing VMA is determined by ORing list1 and list2  
> 
> accessing_pids_list1 <-  current list
> accessing_pids_list2 <-  old list

( I'm not sure why you think this part of the email doesn't need to be
  nicely wrapped at 76 chars.. )

This seems simple enough to me and can be trivially extended to N if
needed.

The typical implementation would looks something like:

	unsigned long pids[N];
	unsigned int pid_idx;

set:
	unsigned long *pids = numab->pids + pid_idx;
	if (!__test_bit(bit, pids))
		__set_bit(bit, pids);

test:
	unsigned long pids = 0;
	for (int i = 0; i < N; i++)
		pids |= numab->pids[i];
	return __test_bit(bit, &pids);

rotate:
	idx = READ_ONCE(numab->pid_idx);
	WRITE_ONCE(numab->pid_idx, (idx + 1) % N);
	numab->pids[idx] = 0;

Note the actual rotate can be simplified to ^1 for N:=2.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 1/3] sched/numa: Apply the scan delay to every vma instead of tasks
  2023-02-03 10:24   ` Peter Zijlstra
@ 2023-02-04 17:19     ` Raghavendra K T
  0 siblings, 0 replies; 17+ messages in thread
From: Raghavendra K T @ 2023-02-04 17:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Ingo Molnar, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja,
	Mel Gorman

On 2/3/2023 3:54 PM, Peter Zijlstra wrote:
> On Wed, Feb 01, 2023 at 01:32:20PM +0530, Raghavendra K T wrote:
>> From: Mel Gorman <mgorman@techsingularity.net>
>>
>>   Avoid scanning new or very short-lived VMAs.
>>
>> (Raghavendra: Add initialization in vm_area_dup())
> 
> Given this is a performance centric patch -- some sort of qualification
> / justification would be much appreciated.
> 

Thank you Peter for the review.
Sure will add more detailed result in cover and summary for the patch
commit message.

> Also, perhaps explain the rationale for the actual heuristics chosen.
> 

Sure will add more detail in the V3

>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
>> ---
>>   include/linux/mm.h       |  9 +++++++++
>>   include/linux/mm_types.h |  7 +++++++
>>   kernel/fork.c            |  2 ++
>>   kernel/sched/fair.c      | 17 +++++++++++++++++
>>   4 files changed, 35 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 974ccca609d2..74d9df1d8982 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -611,6 +611,14 @@ struct vm_operations_struct {
>>   					  unsigned long addr);
>>   };
>>   
>> +#ifdef CONFIG_NUMA_BALANCING
>> +#define vma_numab_init(vma) do { (vma)->numab = NULL; } while (0)
>> +#define vma_numab_free(vma) do { kfree((vma)->numab); } while (0)
>> +#else
>> +static inline void vma_numab_init(struct vm_area_struct *vma) {}
>> +static inline void vma_numab_free(struct vm_area_struct *vma) {}
>> +#endif /* CONFIG_NUMA_BALANCING */
> 
> I'm tripping over the inconsistency of macros and functions here. I'd
> suggest making both cases functions.
> 
> 

Sure will do that

>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 500e536796ca..e84f95a77321 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -435,6 +435,10 @@ struct anon_vma_name {
>>   	char name[];
>>   };
>>   
>> +struct vma_numab {
>> +	unsigned long next_scan;
>> +};
> 
> I'm not sure what a numab is; contraction of new-kebab, something else?
> 
> While I appreciate short names, they'd ideally also make sense. If we
> cannot come up with a better one, perhaps elucidate the reader with a
> comment.

Agree.. How about vma_nuamb vma_numab_state or vma_numab_info as
abbreviation for vma_numa_balancing_info /state?

> 
>> +
>>   /*
>>    * This struct describes a virtual memory area. There is one of these
>>    * per VM-area/task. A VM area is any part of the process virtual memory
>> @@ -504,6 +508,9 @@ struct vm_area_struct {
> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e4a0b8bd941c..060b241ce3c5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3015,6 +3015,23 @@ static void task_numa_work(struct callback_head *work)
>>   		if (!vma_is_accessible(vma))
>>   			continue;
>>   
>> +		/* Initialise new per-VMA NUMAB state. */
>> +		if (!vma->numab) {
>> +			vma->numab = kzalloc(sizeof(struct vma_numab), GFP_KERNEL);
>> +			if (!vma->numab)
>> +				continue;
>> +
>> +			vma->numab->next_scan = now +
>> +				msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
>> +		}
>> +
>> +		/*
>> +		 * After the first scan is complete, delay the balancing scan
>> +		 * for new VMAs.
>> +		 */
>> +		if (mm->numa_scan_seq && time_before(jiffies, vma->numab->next_scan))
>> +			continue;
> 
> I think I sorta see why, but I'm thinking it would be good to include
> more of the why in that comment.

Sure. Will add something in the lines of.. "scanning the VMA's of short
lived tasks add more overhead than benefit...."

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 2/3] sched/numa: Enhance vma scanning logic
  2023-02-03 11:15   ` Peter Zijlstra
  2023-02-03 11:27     ` Peter Zijlstra
@ 2023-02-04 18:14     ` Raghavendra K T
  2023-02-07  6:41       ` Raghavendra K T
  2023-02-28  4:59       ` Raghavendra K T
  1 sibling, 2 replies; 17+ messages in thread
From: Raghavendra K T @ 2023-02-04 18:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Ingo Molnar, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja

On 2/3/2023 4:45 PM, Peter Zijlstra wrote:
> On Wed, Feb 01, 2023 at 01:32:21PM +0530, Raghavendra K T wrote:
>>   During the Numa scanning make sure only relevant vmas of the
>> tasks are scanned.
>>
>> Before:
>>   All the tasks of a process participate in scanning the vma
>> even if they do not access vma in it's lifespan.
>>
>> Now:
>>   Except cases of first few unconditional scans, if a process do
>> not touch vma (exluding false positive cases of PID collisions)
>> tasks no longer scan all vma.
>>
>> Logic used:
>> 1) 6 bits of PID used to mark active bit in vma numab status during
>>   fault to remember PIDs accessing vma. (Thanks Mel)
>>
>> 2) Subsequently in scan path, vma scanning is skipped if current PID
>> had not accessed vma.
>>
>> 3) First two times we do allow unconditional scan to preserve earlier
>>   behaviour of scanning.
>>
>> Acknowledgement to Bharata B Rao <bharata@amd.com> for initial patch
>> to store pid information.
>>
>> Suggested-by: Mel Gorman <mgorman@techsingularity.net>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
>> ---
>>   include/linux/mm.h       | 14 ++++++++++++++
>>   include/linux/mm_types.h |  1 +
>>   kernel/sched/fair.c      | 15 +++++++++++++++
>>   mm/huge_memory.c         |  1 +
>>   mm/memory.c              |  1 +
>>   5 files changed, 32 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 74d9df1d8982..489422942482 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1381,6 +1381,16 @@ static inline int xchg_page_access_time(struct page *page, int time)
>>   	last_time = page_cpupid_xchg_last(page, time >> PAGE_ACCESS_TIME_BUCKETS);
>>   	return last_time << PAGE_ACCESS_TIME_BUCKETS;
>>   }
>> +
>> +static inline void vma_set_active_pid_bit(struct vm_area_struct *vma)
>> +{
>> +	unsigned int active_pid_bit;
>> +
>> +	if (vma->numab) {
>> +		active_pid_bit = current->pid % BITS_PER_LONG;
>> +		vma->numab->accessing_pids |= 1UL << active_pid_bit;
>> +	}
>> +}
> 
> Perhaps:
> 
> 	if (vma->numab)
> 		__set_bit(current->pid % BITS_PER_LONG, &vma->numab->pids);
> 
> ?
> 
> Or maybe even:
> 
> 	bit = current->pid % BITS_PER_LONG;
> 	if (vma->numab && !__test_bit(bit, &vma->numab->pids))
> 		__set_bit(bit, &vma->numab->pids);
> 
> 

Sure ..will use one of the above.

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 060b241ce3c5..3505ae57c07c 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2916,6 +2916,18 @@ static void reset_ptenuma_scan(struct task_struct *p)
>>   	p->mm->numa_scan_offset = 0;
>>   }
>>   
>> +static bool vma_is_accessed(struct vm_area_struct *vma)
>> +{
>> +	unsigned int active_pid_bit;
>> +
> 	/*
> 	 * Tell us why 2....
> 	 */

Agree. The logic is more towards allowing unconditional scan first two
times to build task/page relation. I will experiment if we further need
to allow for two full passes if "multi-stage node selection" (=4), to
take care of early migration.

But only doubt I have is numa_scan_seq is per mm and thus will it create
corner cases or we need to have a per vma count separately when a new
VMA is created..

>> +	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
>> +		return true;
>> +
>> +	active_pid_bit = current->pid % BITS_PER_LONG;
>> +
>> +	return vma->numab->accessing_pids & (1UL << active_pid_bit);
> 	return __test_bit(current->pid % BITS_PER_LONG, &vma->numab->pids)
>> +}
>> +
>>   /*
>>    * The expensive part of numa migration is done from task_work context.
>>    * Triggered from task_tick_numa().
>> @@ -3032,6 +3044,9 @@ static void task_numa_work(struct callback_head *work)
>>   		if (mm->numa_scan_seq && time_before(jiffies, vma->numab->next_scan))
>>   			continue;
>>   
> 		/*
> 		 * tell us more...
> 		 */

Sure. Since this is the core of the whole logic where we want to confine 
VMA scan to PIDs of interest mostly.

>> +		if (!vma_is_accessed(vma))
>> +			continue;
>> +
>>   		do {
>>   			start = max(start, vma->vm_start);
>>   			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
> 
> 
> This feels wrong, specifically we track numa_scan_offset per mm, now, if
> we divide the threads into two dis-joint groups each only using their
> own set of vmas (in fact quite common for workloads with proper data
> partitioning) it is possible to consistently sample one set of threads
> and thus not scan the other set of vmas.
> 
> It seems somewhat unlikely, but not impossible to create significant
> unfairness.
> 

Agree, But that is the reason why we want to allow first few
unconditional scans Or am I missing something?

>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 811d19b5c4f6..d908aa95f3c3 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1485,6 +1485,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>   	bool was_writable = pmd_savedwrite(oldpmd);
>>   	int flags = 0;
>>   
>> +	vma_set_active_pid_bit(vma);
>>   	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>   	if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>>   		spin_unlock(vmf->ptl);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 8c8420934d60..2ec3045cb8b3 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4718,6 +4718,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>   	bool was_writable = pte_savedwrite(vmf->orig_pte);
>>   	int flags = 0;
>>   
>> +	vma_set_active_pid_bit(vma);
>>   	/*
>>   	 * The "pte" at this point cannot be used safely without
>>   	 * validation through pte_unmap_same(). It's of NUMA type but
> 
> Urghh... do_*numa_page() is two near identical functions.. is there
> really no sane way to de-duplicate at least some of that?
> 

Agree. I will explore and will take that as a separate TODO.

> Also, is this placement right, you're marking the thread even before we
> know there's even a page there. I would expect this somewhere around
> where we track lastpid.
> 

Good point. I will check this again

> Maybe numa_migrate_prep() ?

yes.. there was no hurry to record accessing pid early above...


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 2/3] sched/numa: Enhance vma scanning logic
  2023-02-03 11:27     ` Peter Zijlstra
@ 2023-02-04 18:18       ` Raghavendra K T
  0 siblings, 0 replies; 17+ messages in thread
From: Raghavendra K T @ 2023-02-04 18:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Ingo Molnar, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja

On 2/3/2023 4:57 PM, Peter Zijlstra wrote:
> On Fri, Feb 03, 2023 at 12:15:48PM +0100, Peter Zijlstra wrote:
> 
>>> +static inline void vma_set_active_pid_bit(struct vm_area_struct *vma)
>>> +{
>>> +	unsigned int active_pid_bit;
>>> +
>>> +	if (vma->numab) {
>>> +		active_pid_bit = current->pid % BITS_PER_LONG;
>>> +		vma->numab->accessing_pids |= 1UL << active_pid_bit;
>>> +	}
>>> +}
>>
>> Perhaps:
>>
>> 	if (vma->numab)
>> 		__set_bit(current->pid % BITS_PER_LONG, &vma->numab->pids);
>>
>> ?
>>
>> Or maybe even:
>>
>> 	bit = current->pid % BITS_PER_LONG;
>> 	if (vma->numab && !__test_bit(bit, &vma->numab->pids))
>> 		__set_bit(bit, &vma->numab->pids);
>>
> 
> The alternative to just taking the low n bits is to use:
> 
>    hash_32(current->pid, BITS_PER_LONG)
> 
> That mixes things up a bit.

Good idea, when we have workloads that creates lesser number of threads
faster, current solution might have been simpler, but with thread 
creation that happens over period of time hash function mixes and avoids 
collision. will experiment with this option.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 3/3] sched/numa: Reset the accessing PID information periodically
  2023-02-03 11:35   ` Peter Zijlstra
@ 2023-02-04 18:32     ` Raghavendra K T
  0 siblings, 0 replies; 17+ messages in thread
From: Raghavendra K T @ 2023-02-04 18:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Ingo Molnar, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja

On 2/3/2023 5:05 PM, Peter Zijlstra wrote:
> On Wed, Feb 01, 2023 at 01:32:22PM +0530, Raghavendra K T wrote:
> 
>> 2) Maintain duplicate list of accessing PIDs to keep track of history of access. and switch/reset. use OR operation during iteration
>>
>>   Two lists of PIDs maintained. At regular interval old list is reset and we make current list as old list
>> At any point of time tracking of PIDs accessing VMA is determined by ORing list1 and list2
>>
>> accessing_pids_list1 <-  current list
>> accessing_pids_list2 <-  old list
> 
> ( I'm not sure why you think this part of the email doesn't need to be
>    nicely wrapped at 76 chars.. )
> 

Sorry.. copy pasted from my "idea" notes then word wrap fooled me..

> This seems simple enough to me and can be trivially extended to N if
> needed.
>  > The typical implementation would looks something like:
> 
> 	unsigned long pids[N];
> 	unsigned int pid_idx;
> 
> set:
> 	unsigned long *pids = numab->pids + pid_idx;
> 	if (!__test_bit(bit, pids))
> 		__set_bit(bit, pids);
> 
> test:
> 	unsigned long pids = 0;
> 	for (int i = 0; i < N; i++)
> 		pids |= numab->pids[i];
> 	return __test_bit(bit, &pids);
> 
> rotate:
> 	idx = READ_ONCE(numab->pid_idx);
> 	WRITE_ONCE(numab->pid_idx, (idx + 1) % N);
> 	numab->pids[idx] = 0;
> 
> Note the actual rotate can be simplified to ^1 for N:=2.

Thanks good idea. This will be very helpful when we want to
differentiate accessing PIDs in more granular way. Perhaps we can go
with N=2 and stick to below simplification of your code above?

something like:

unsigned long pids[2]

// Assume pids[1] has latest detail always
set:
if (!__test_bit(bit, pids[1])
	__set_bit(bit, pids[1])

test:
unsigned long pids = pids[0] | pids[1];
return __test_bit(bit, &pids);

rotate:
pids[0] = pids[1];
pids[1] = 0;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 2/3] sched/numa: Enhance vma scanning logic
  2023-02-04 18:14     ` Raghavendra K T
@ 2023-02-07  6:41       ` Raghavendra K T
  2023-02-27  6:40         ` Raghavendra K T
  2023-02-28  4:59       ` Raghavendra K T
  1 sibling, 1 reply; 17+ messages in thread
From: Raghavendra K T @ 2023-02-07  6:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Ingo Molnar, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja

On 2/4/2023 11:44 PM, Raghavendra K T wrote:
> On 2/3/2023 4:45 PM, Peter Zijlstra wrote:
>> On Wed, Feb 01, 2023 at 01:32:21PM +0530, Raghavendra K T wrote:
[...]
> 
>>> +        if (!vma_is_accessed(vma))
>>> +            continue;
>>> +
>>>           do {
>>>               start = max(start, vma->vm_start);
>>>               end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
>>
>>
>> This feels wrong, specifically we track numa_scan_offset per mm, now, if
>> we divide the threads into two dis-joint groups each only using their
>> own set of vmas (in fact quite common for workloads with proper data
>> partitioning) it is possible to consistently sample one set of threads
>> and thus not scan the other set of vmas.
>>
>> It seems somewhat unlikely, but not impossible to create significant
>> unfairness.
>>
> 
> Agree, But that is the reason why we want to allow first few
> unconditional scans Or am I missing something?
> 

Thinking further, may be we can summarize the different aspects of 
thread/ two disjoint set case itself into:

1) Unfairness because of way in which threads gets opportunity
to scan.

2) Disjoint set of vmas in the partition set could be of different sizes

3) Disjoint set of vmas could be associated with different number of
threads

Each of above can potentially help or make some thread do heavy lifting

but (2), and (3). is what I think we are trying to be Okay with by
making sure tasks mostly do not scan others' vmas

(1) could be a real issue (though I know that there are many places we
  have corrected the issue by introducing offset in p->numa_next_scan)
but how the distribution looks now practically, I take it as a TODO and
post.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 2/3] sched/numa: Enhance vma scanning logic
  2023-02-07  6:41       ` Raghavendra K T
@ 2023-02-27  6:40         ` Raghavendra K T
  2023-02-27 10:06           ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Raghavendra K T @ 2023-02-27  6:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Ingo Molnar, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja

On 2/7/2023 12:11 PM, Raghavendra K T wrote:
> On 2/4/2023 11:44 PM, Raghavendra K T wrote:
>> On 2/3/2023 4:45 PM, Peter Zijlstra wrote:
>>> On Wed, Feb 01, 2023 at 01:32:21PM +0530, Raghavendra K T wrote:
> [...]
>>
>>>> +        if (!vma_is_accessed(vma))
>>>> +            continue;
>>>> +
>>>>           do {
>>>>               start = max(start, vma->vm_start);
>>>>               end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
>>>
>>>
>>> This feels wrong, specifically we track numa_scan_offset per mm, now, if
>>> we divide the threads into two dis-joint groups each only using their
>>> own set of vmas (in fact quite common for workloads with proper data
>>> partitioning) it is possible to consistently sample one set of threads
>>> and thus not scan the other set of vmas.
>>>
>>> It seems somewhat unlikely, but not impossible to create significant
>>> unfairness.
>>>
>>
>> Agree, But that is the reason why we want to allow first few
>> unconditional scans Or am I missing something?
>>
> 
> Thinking further, may be we can summarize the different aspects of 
> thread/ two disjoint set case itself into:
> 
> 1) Unfairness because of way in which threads gets opportunity
> to scan.
> 
> 2) Disjoint set of vmas in the partition set could be of different sizes
> 
> 3) Disjoint set of vmas could be associated with different number of
> threads
> 
> Each of above can potentially help or make some thread do heavy lifting
> 
> but (2), and (3). is what I think we are trying to be Okay with by
> making sure tasks mostly do not scan others' vmas
> 
> (1) could be a real issue (though I know that there are many places we
>   have corrected the issue by introducing offset in p->numa_next_scan)
> but how the distribution looks now practically, I take it as a TODO and
> post.


Hello PeterZ,
Sorry to come back little late with Data for how unfair are we when we
have two disjoint set of groups mentioned above:

Program I tested with:
1. 128/256 thread program divided into two sets each accessing
different set of memory (set0, set1) from node1 and node2 to induce
migration (8GB divided into 4GB each].
Total iteration per thread was around 2048 to make sure for
128 thread: program ran around 8 min
256 thread: program ran around 17/18 min

SUT: 128 core 256 CPU Milan with 2 nodes

Some of the observations:
1) Total number of threads which gets access to scan in the entire
program span was around 50-62%

for e.g. 128 thread case total tasks got access to scan was around 74-84
for 256 thread range was 123-146

2) There was a little bias towards set 1 always (the threads from where 
remote faults occurred)

In summary: I do see that access to VMAs from disjoint sets is not fully
  fair, But on the other hand it is not very bad too. There is definitely
some scope or possibility to explore/improve fairness in this area
further.

Posting result for one of the 128 threads: (base 6.1.0 vanilla)

column1: frequency
column2: PID

set 0 threads
       1 5906
       1 5908
       2 5912
       1 5914
       1 5916
       1 5918
       1 5926
       2 5928
       3 5932
       3 5938
       1 5940
       1 5944
       2 5948
       2 5950
       1 5956
       1 5962
       1 5974
       1 5978
       1 5992
       1 6004
       1 6006
       1 6008
       1 6012
       3 6014
       1 6016
       2 6026
       2 6028
       1 6030

set 1 threads
       3 5907
       5 5909
       2 5911
       4 5913
       7 5915
       2 5917
       3 5919
       5 5921
       3 5923
       2 5925
       4 5927
       4 5929
       3 5931
       3 5933
       2 5935
       7 5937
       4 5939
       3 5941
       4 5943
       1 5945
       2 5947
       1 5949
       1 5951
       2 5953
       1 5955
       1 5957
       1 5959
       1 5961
       1 5963
       1 5965
       1 5967
       1 5969
       1 5971
       2 5975
       2 5979
       2 5981
       2 5983
       5 5993
       5 5995
      11 5997
       1 5999
       6 6003
       4 6005
       3 6007
       1 6013
       1 6015
       3 6017
       1 6019
       1 6021
       1 6023
       6 6027
       4 6029
       4 6031
       1 6033

PS: I have also tested above applying V3 patch (which incorporates your
suggestions), have not seen much deviation in observation with patch.

Thanks
- Raghu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 2/3] sched/numa: Enhance vma scanning logic
  2023-02-27  6:40         ` Raghavendra K T
@ 2023-02-27 10:06           ` Peter Zijlstra
  2023-02-27 10:12             ` Raghavendra K T
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2023-02-27 10:06 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: linux-kernel, linux-mm, Ingo Molnar, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja

On Mon, Feb 27, 2023 at 12:10:41PM +0530, Raghavendra K T wrote:
> In summary: I do see that access to VMAs from disjoint sets is not fully
>  fair, But on the other hand it is not very bad too. There is definitely
> some scope or possibility to explore/improve fairness in this area
> further.

Ok, might be good to summarize some of this in a comment near here, so
that readers are aware of the caveat of this code.

> PS: I have also tested above applying V3 patch (which incorporates your
> suggestions), have not seen much deviation in observation with patch.

I'll see if I can find it in this dumpester fire I call inbox :-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 2/3] sched/numa: Enhance vma scanning logic
  2023-02-27 10:06           ` Peter Zijlstra
@ 2023-02-27 10:12             ` Raghavendra K T
  0 siblings, 0 replies; 17+ messages in thread
From: Raghavendra K T @ 2023-02-27 10:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Ingo Molnar, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja

On 2/27/2023 3:36 PM, Peter Zijlstra wrote:
> On Mon, Feb 27, 2023 at 12:10:41PM +0530, Raghavendra K T wrote:
>> In summary: I do see that access to VMAs from disjoint sets is not fully
>>   fair, But on the other hand it is not very bad too. There is definitely
>> some scope or possibility to explore/improve fairness in this area
>> further.
> 
> Ok, might be good to summarize some of this in a comment near here, so
> that readers are aware of the caveat of this code.
> 

Sure will do.

>> PS: I have also tested above applying V3 patch (which incorporates your
>> suggestions), have not seen much deviation in observation with patch.
> 
> I'll see if I can find it in this dumpester fire I call inbox :-)

Sorry I wasn't clear there.. Still in inbox and am about to post.. It
was a heads up.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V2 2/3] sched/numa: Enhance vma scanning logic
  2023-02-04 18:14     ` Raghavendra K T
  2023-02-07  6:41       ` Raghavendra K T
@ 2023-02-28  4:59       ` Raghavendra K T
  1 sibling, 0 replies; 17+ messages in thread
From: Raghavendra K T @ 2023-02-28  4:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Ingo Molnar, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Bharata B Rao, Disha Talreja

On 2/4/2023 11:44 PM, Raghavendra K T wrote:
> On 2/3/2023 4:45 PM, Peter Zijlstra wrote:
>> On Wed, Feb 01, 2023 at 01:32:21PM +0530, Raghavendra K T wrote:
[...]>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 8c8420934d60..2ec3045cb8b3 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4718,6 +4718,7 @@ static vm_fault_t do_numa_page(struct vm_fault 
>>> *vmf)
>>>       bool was_writable = pte_savedwrite(vmf->orig_pte);
>>>       int flags = 0;
>>> +    vma_set_active_pid_bit(vma);
>>>       /*
>>>        * The "pte" at this point cannot be used safely without
>>>        * validation through pte_unmap_same(). It's of NUMA type but
>>
>> Urghh... do_*numa_page() is two near identical functions.. is there
>> really no sane way to de-duplicate at least some of that?
>>
> 
> Agree. I will explore and will take that as a separate TODO.
> 

Did spend some time to look at if there is a better way of merging these
two.
code looks similar as you noted, with very subtle changes (pte vs pmd
and difference in unlock).
But I thought only some part of the code can be changed easily, but 
changing whole code did not look to be worth as of now.
(unless we come up with better idea)

This is the only comment perhaps not addressed rightly I feel :(

Thanks

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-02-28  4:59 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-01  8:02 [PATCH V2 0/3] sched/numa: Enhance vma scanning Raghavendra K T
2023-02-01  8:02 ` [PATCH V2 1/3] sched/numa: Apply the scan delay to every vma instead of tasks Raghavendra K T
2023-02-03 10:24   ` Peter Zijlstra
2023-02-04 17:19     ` Raghavendra K T
2023-02-01  8:02 ` [PATCH V2 2/3] sched/numa: Enhance vma scanning logic Raghavendra K T
2023-02-03 11:15   ` Peter Zijlstra
2023-02-03 11:27     ` Peter Zijlstra
2023-02-04 18:18       ` Raghavendra K T
2023-02-04 18:14     ` Raghavendra K T
2023-02-07  6:41       ` Raghavendra K T
2023-02-27  6:40         ` Raghavendra K T
2023-02-27 10:06           ` Peter Zijlstra
2023-02-27 10:12             ` Raghavendra K T
2023-02-28  4:59       ` Raghavendra K T
2023-02-01  8:02 ` [PATCH V2 3/3] sched/numa: Reset the accessing PID information periodically Raghavendra K T
2023-02-03 11:35   ` Peter Zijlstra
2023-02-04 18:32     ` Raghavendra K T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).