linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Refactor do_fault_around()
@ 2023-03-17 21:58 Lorenzo Stoakes
  2023-03-17 21:58 ` [PATCH 1/2] mm: refactor do_fault_around() Lorenzo Stoakes
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Lorenzo Stoakes @ 2023-03-17 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton
  Cc: David Hildenbrand, Matthew Wilcox, Vlastimil Babka, Lorenzo Stoakes

Refactor do_fault_around() to avoid bitwise tricks and arather difficult to
follow logic.  Additionally, prefer fault_around_pages to
fault_around_bytes as the operations are performed at a base page
granularity.

I have run this code against a small program I wrote which generates
significant input data and compares output with the original function to
ensure that it behaves the same as the old code across varying vmf, vma and
fault_around_pages inputs.

Lorenzo Stoakes (2):
  mm: refactor do_fault_around()
  mm: pefer fault_around_pages to fault_around_bytes

 mm/memory.c | 62 ++++++++++++++++++++++++++---------------------------
 1 file changed, 30 insertions(+), 32 deletions(-)

--
2.39.2


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] mm: refactor do_fault_around()
  2023-03-17 21:58 [PATCH 0/2] Refactor do_fault_around() Lorenzo Stoakes
@ 2023-03-17 21:58 ` Lorenzo Stoakes
  2023-03-17 21:58 ` [PATCH 2/2] mm: pefer fault_around_pages to fault_around_bytes Lorenzo Stoakes
  2023-03-17 23:39 ` [PATCH 0/2] Refactor do_fault_around() Andrew Morton
  2 siblings, 0 replies; 6+ messages in thread
From: Lorenzo Stoakes @ 2023-03-17 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton
  Cc: David Hildenbrand, Matthew Wilcox, Vlastimil Babka, Lorenzo Stoakes

The existing logic is confusing and fails to abstract a number of bitwise
tricks.

Use ALIGN_DOWN() to perform alignment, pte_index() to obtain a PTE index
and represent the address range using PTE offsets, which naturally make it
clear that the operation is intended to occur within only a single PTE and
prevent spanning of more than one page table.

We rely on the fact that fault_around_bytes will always be page-aligned, at
least one page in size, a power of two and that it will not exceed
PAGE_SIZE * PTRS_PER_PTE in size (i.e. the address space mapped by a
PTE). These are all guaranteed by fault_around_bytes_set().

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 mm/memory.c | 38 ++++++++++++++++----------------------
 1 file changed, 16 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c5f1bf906d0c..3d85aa7106b0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4437,8 +4437,8 @@ late_initcall(fault_around_debugfs);
  * It uses vm_ops->map_pages() to map the pages, which skips the page if it's
  * not ready to be mapped: not up-to-date, locked, etc.
  *
- * This function doesn't cross the VMA boundaries, in order to call map_pages()
- * only once.
+ * This function doesn't cross VMA or page table boundaries, in order to call
+ * map_pages() and acquire a PTE lock only once.
  *
  * fault_around_bytes defines how many bytes we'll try to map.
  * do_fault_around() expects it to be set to a power of two less than or equal
@@ -4451,27 +4451,19 @@ late_initcall(fault_around_debugfs);
  */
 static vm_fault_t do_fault_around(struct vm_fault *vmf)
 {
-	unsigned long address = vmf->address, nr_pages, mask;
-	pgoff_t start_pgoff = vmf->pgoff;
-	pgoff_t end_pgoff;
-	int off;
+	pgoff_t nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
+	pgoff_t pte_off = pte_index(vmf->address);
+	/* The page offset of vmf->address within the VMA. */
+	pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
+	pgoff_t from_pte, to_pte;
 
-	nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
-	mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
+	/* The PTE offset of the start address, clamped to the VMA. */
+	from_pte = max(ALIGN_DOWN(pte_off, nr_pages),
+		       pte_off - min(pte_off, vma_off));
 
-	address = max(address & mask, vmf->vma->vm_start);
-	off = ((vmf->address - address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
-	start_pgoff -= off;
-
-	/*
-	 *  end_pgoff is either the end of the page table, the end of
-	 *  the vma or nr_pages from start_pgoff, depending what is nearest.
-	 */
-	end_pgoff = start_pgoff -
-		((address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
-		PTRS_PER_PTE - 1;
-	end_pgoff = min3(end_pgoff, vma_pages(vmf->vma) + vmf->vma->vm_pgoff - 1,
-			start_pgoff + nr_pages - 1);
+	/* The PTE offset of the end address, clamped to the VMA and PTE. */
+	to_pte = min3(from_pte + nr_pages, (pgoff_t)PTRS_PER_PTE,
+		      pte_off + vma_pages(vmf->vma) - vma_off) - 1;
 
 	if (pmd_none(*vmf->pmd)) {
 		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
@@ -4479,7 +4471,9 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
 			return VM_FAULT_OOM;
 	}
 
-	return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
+	return vmf->vma->vm_ops->map_pages(vmf,
+		vmf->pgoff + from_pte - pte_off,
+		vmf->pgoff + to_pte - pte_off);
 }
 
 /* Return true if we should do read fault-around, false otherwise */
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] mm: pefer fault_around_pages to fault_around_bytes
  2023-03-17 21:58 [PATCH 0/2] Refactor do_fault_around() Lorenzo Stoakes
  2023-03-17 21:58 ` [PATCH 1/2] mm: refactor do_fault_around() Lorenzo Stoakes
@ 2023-03-17 21:58 ` Lorenzo Stoakes
  2023-03-17 23:39 ` [PATCH 0/2] Refactor do_fault_around() Andrew Morton
  2 siblings, 0 replies; 6+ messages in thread
From: Lorenzo Stoakes @ 2023-03-17 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton
  Cc: David Hildenbrand, Matthew Wilcox, Vlastimil Babka, Lorenzo Stoakes

All use of this value is now at page granularity, so specify the variable
as such too. This simplifies the logic.

We maintain the debugfs entry to ensure that there are no user-visible
changes.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 mm/memory.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3d85aa7106b0..ae01f541ad30 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4393,13 +4393,13 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 	return ret;
 }
 
-static unsigned long fault_around_bytes __read_mostly =
-	rounddown_pow_of_two(65536);
+static unsigned long fault_around_pages __read_mostly =
+	65536 >> PAGE_SHIFT;
 
 #ifdef CONFIG_DEBUG_FS
 static int fault_around_bytes_get(void *data, u64 *val)
 {
-	*val = fault_around_bytes;
+	*val = fault_around_pages << PAGE_SHIFT;
 	return 0;
 }
 
@@ -4411,10 +4411,13 @@ static int fault_around_bytes_set(void *data, u64 val)
 {
 	if (val / PAGE_SIZE > PTRS_PER_PTE)
 		return -EINVAL;
-	if (val > PAGE_SIZE)
-		fault_around_bytes = rounddown_pow_of_two(val);
-	else
-		fault_around_bytes = PAGE_SIZE; /* rounddown_pow_of_two(0) is undefined */
+
+	/*
+	 * The minimum value is 1 page, however this results in no fault-around
+	 * at all. See should_fault_around().
+	 */
+	fault_around_pages = max(rounddown_pow_of_two(val) >> PAGE_SHIFT, 1UL);
+
 	return 0;
 }
 DEFINE_DEBUGFS_ATTRIBUTE(fault_around_bytes_fops,
@@ -4440,18 +4443,18 @@ late_initcall(fault_around_debugfs);
  * This function doesn't cross VMA or page table boundaries, in order to call
  * map_pages() and acquire a PTE lock only once.
  *
- * fault_around_bytes defines how many bytes we'll try to map.
+ * fault_around_pages defines how many pages we'll try to map.
  * do_fault_around() expects it to be set to a power of two less than or equal
  * to PTRS_PER_PTE.
  *
  * The virtual address of the area that we map is naturally aligned to
- * fault_around_bytes rounded down to the machine page size
+ * fault_around_pages * PAGE_SIZE rounded down to the machine page size
  * (and therefore to page order).  This way it's easier to guarantee
  * that we don't cross page table boundaries.
  */
 static vm_fault_t do_fault_around(struct vm_fault *vmf)
 {
-	pgoff_t nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
+	pgoff_t nr_pages = READ_ONCE(fault_around_pages);
 	pgoff_t pte_off = pte_index(vmf->address);
 	/* The page offset of vmf->address within the VMA. */
 	pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
@@ -4486,7 +4489,8 @@ static inline bool should_fault_around(struct vm_fault *vmf)
 	if (uffd_disable_fault_around(vmf->vma))
 		return false;
 
-	return fault_around_bytes >> PAGE_SHIFT > 1;
+	/* A single page implies no faulting 'around' at all. */
+	return fault_around_pages > 1;
 }
 
 static vm_fault_t do_read_fault(struct vm_fault *vmf)
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] Refactor do_fault_around()
  2023-03-17 21:58 [PATCH 0/2] Refactor do_fault_around() Lorenzo Stoakes
  2023-03-17 21:58 ` [PATCH 1/2] mm: refactor do_fault_around() Lorenzo Stoakes
  2023-03-17 21:58 ` [PATCH 2/2] mm: pefer fault_around_pages to fault_around_bytes Lorenzo Stoakes
@ 2023-03-17 23:39 ` Andrew Morton
  2023-03-17 23:48   ` Lorenzo Stoakes
  2 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2023-03-17 23:39 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, David Hildenbrand, Matthew Wilcox,
	Vlastimil Babka

On Fri, 17 Mar 2023 21:58:24 +0000 Lorenzo Stoakes <lstoakes@gmail.com> wrote:

> Refactor do_fault_around() to avoid bitwise tricks and arather difficult to
> follow logic.  Additionally, prefer fault_around_pages to
> fault_around_bytes as the operations are performed at a base page
> granularity.
> 
> I have run this code against a small program I wrote which generates
> significant input data and compares output with the original function to
> ensure that it behaves the same as the old code across varying vmf, vma and
> fault_around_pages inputs.

Well, what changes were you looking for in that testing? 
do_fault_around() could become a no-op and most tests wouldn't notice. 
What we'd be looking for to test these changes is performance
differences.

Perhaps one could add a tracepoint to do_fault_around()'s call to
->map_pages, check that the before-and-after traces are identical.


Or, if you're old school and lazy,

	if (!strcmp(current->comm, "name-of-my-test-program"))
		printk("name-of-my-test-program: %lu %lu\n",
			start_pgoff, end_pgoff)

then grep-n-diff the dmesg output.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] Refactor do_fault_around()
  2023-03-17 23:39 ` [PATCH 0/2] Refactor do_fault_around() Andrew Morton
@ 2023-03-17 23:48   ` Lorenzo Stoakes
  2023-03-17 23:59     ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: Lorenzo Stoakes @ 2023-03-17 23:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, David Hildenbrand, Matthew Wilcox,
	Vlastimil Babka

On Fri, Mar 17, 2023 at 04:39:36PM -0700, Andrew Morton wrote:
> On Fri, 17 Mar 2023 21:58:24 +0000 Lorenzo Stoakes <lstoakes@gmail.com> wrote:
>
> > Refactor do_fault_around() to avoid bitwise tricks and arather difficult to
> > follow logic.  Additionally, prefer fault_around_pages to
> > fault_around_bytes as the operations are performed at a base page
> > granularity.
> >
> > I have run this code against a small program I wrote which generates
> > significant input data and compares output with the original function to
> > ensure that it behaves the same as the old code across varying vmf, vma and
> > fault_around_pages inputs.
>
> Well, what changes were you looking for in that testing?

That there was no functional change between this refactoring and the existing
logic.

> do_fault_around() could become a no-op and most tests wouldn't notice.
> What we'd be looking for to test these changes is performance
> differences.
>
> Perhaps one could add a tracepoint to do_fault_around()'s call to
> ->map_pages, check that the before-and-after traces are identical.
>
>
> Or, if you're old school and lazy,
>
> 	if (!strcmp(current->comm, "name-of-my-test-program"))
> 		printk("name-of-my-test-program: %lu %lu\n",
> 			start_pgoff, end_pgoff)
>
> then grep-n-diff the dmesg output.

I am both old school and lazy, however I went so far as to literally copy/paste
the existing code and my speculative change to a userland program, generate a
whole host of random sensible input data and compare output data with this and
the original logic en masse.

This function is actually quite nice for testability as the input and output
variables are limited in scope, vmf->address, vmf->pgoff, vmf->vma->vm_start/end
+ vmf->vma->vm_pgoff and output start_pgoff, end_pgoff.

I could attach said program but it's some quite iffy C++ code that might horrify
small children and adults alike...

I am more than happy to do performance testing to see if there is any impact if
you require?


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] Refactor do_fault_around()
  2023-03-17 23:48   ` Lorenzo Stoakes
@ 2023-03-17 23:59     ` Andrew Morton
  0 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2023-03-17 23:59 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, David Hildenbrand, Matthew Wilcox,
	Vlastimil Babka

On Fri, 17 Mar 2023 23:48:17 +0000 Lorenzo Stoakes <lstoakes@gmail.com> wrote:

> I went so far as to literally copy/paste
> the existing code and my speculative change to a userland program, generate a
> whole host of random sensible input data and compare output data with this and
> the original logic en masse.

Ah, great, all good, thanks.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-03-17 23:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-17 21:58 [PATCH 0/2] Refactor do_fault_around() Lorenzo Stoakes
2023-03-17 21:58 ` [PATCH 1/2] mm: refactor do_fault_around() Lorenzo Stoakes
2023-03-17 21:58 ` [PATCH 2/2] mm: pefer fault_around_pages to fault_around_bytes Lorenzo Stoakes
2023-03-17 23:39 ` [PATCH 0/2] Refactor do_fault_around() Andrew Morton
2023-03-17 23:48   ` Lorenzo Stoakes
2023-03-17 23:59     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).