linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
@ 2023-04-14 13:02 Ryan Roberts
  2023-04-14 13:02 ` [RFC v2 PATCH 01/17] mm: Expose clear_huge_page() unconditionally Ryan Roberts
                   ` (19 more replies)
  0 siblings, 20 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:02 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Hi All,

This is a second RFC and my first proper attempt at implementing variable order,
large folios for anonymous memory. The first RFC [1], was a partial
implementation and a plea for help in debugging an issue I was hitting; thanks
to Yin Fengwei and Matthew Wilcox for their advice in solving that!

The objective of variable order anonymous folios is to improve performance by
allocating larger chunks of memory during anonymous page faults:

 - Since SW (the kernel) is dealing with larger chunks of memory than base
   pages, there are efficiency savings to be had; fewer page faults, batched PTE
   and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
   overhead. This should benefit all architectures.
 - Since we are now mapping physically contiguous chunks of memory, we can take
   advantage of HW TLB compression techniques. A reduction in TLB pressure
   speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
   TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].

This patch set deals with the SW side of things only but sets us up nicely for
taking advantage of the HW improvements in the near future.

I'm not yet benchmarking a wide variety of use cases, but those that I have
looked at are positive; I see kernel compilation time improved by up to 10%,
which I expect to improve further once I add in the arm64 "contiguous bit".
Memory consumption is somewhere between 1% less and 2% more, depending on how
its measured. More on perf and memory below.

The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
conflict resolution). I have a tree at [4].

[1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
[3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
[4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2

Approach
========

There are 4 fault paths that have been modified:
 - write fault on unallocated address: do_anonymous_page()
 - write fault on zero page: wp_page_copy()
 - write fault on non-exclusive CoW page: wp_page_copy()
 - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()

In the first 2 cases, we will determine the preferred order folio to allocate,
limited by a max order (currently order-4; see below), VMA and PMD bounds, and
state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
folio as the source, subject to constraints that may arise if the source has
been mremapped or partially munmapped. And in the 4th case, we reuse as much of
the folio as we can, subject to the same mremap/munmap constraints.

If allocation of our preferred folio order fails, we gracefully fall back to
lower orders all the way to 0.

Note that none of this affects the behavior of traditional PMD-sized THP. If we
take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings.

Open Questions
==============

How to Move Forwards
--------------------

While the series is a small-ish code change, it represents a big shift in the
way things are done. So I'd appreciate any help in scaling up performance
testing, review and general advice on how best to guide a change like this into
the kernel.

Folio Allocation Order Policy
-----------------------------

The current code is hardcoded to use a maximum order of 4. This was chosen for a
couple of reasons:
 - From the SW performance perspective, I see a knee around here where
   increasing it doesn't lead to much more performance gain.
 - Intuitively I assume that higher orders become increasingly difficult to
   allocate.
 - From the HW performance perspective, arm64's HPA works on order-2 blocks and
   "the contiguous bit" works on order-4 for 4KB base pages (although it's
   order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going
   any higher.

I suggest that ultimately setting the max order should be left to the
architecture. arm64 would take advantage of this and set it to the order
required for the contiguous bit for the configured base page size.

However, I also have a (mild) concern about increased memory consumption. If an
app has a pathological fault pattern (e.g. sparsely touches memory every 64KB)
we would end up allocating 16x as much memory as we used to. One potential
approach I see here is to track fault addresses per-VMA, and increase a per-VMA
max allocation order for consecutive faults that extend a contiguous range, and
decrement when discontiguous. Alternatively/additionally, we could use the VMA
size as an indicator. I'd be interested in your thoughts/opinions.

Deferred Split Queue Lock Contention
------------------------------------

The results below show that we are spending a much greater proportion of time in
the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs.

I think this is (at least partially) related for contention on the deferred
split queue lock. This is a per-memcg spinlock, which means a single spinlock
shared among all 160 CPUs. I've solved part of the problem with the last patch
in the series (which cuts down the need to take the lock), but at folio free
time (free_transhuge_page()), the lock is still taken and I think this could be
a problem. Now that most anonymous pages are large folios, this lock is taken a
lot more.

I think we could probably avoid taking the lock unless !list_empty(), but I
haven't convinced myself its definitely safe, so haven't applied it yet.

Roadmap
=======

Beyond scaling up perf testing, I'm planning to enable use of the "contiguous
bit" on arm64 to validate predictions about HW speedups.

I also think there are some opportunities with madvise to split folios to non-0
orders, which might improve performance in some cases. madvise is also mistaking
exclusive large folios for non-exclusive ones at the moment (due to the "small
pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly
frees the folio.

Results
=======

Performance
-----------

Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and
with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned
before each run.

make defconfig && time make -jN Image

First with -j8:

|           | baseline time  | anonfolio time | percent change |
|           | to compile (s) | to compile (s) | SMALLER=better |
|-----------|---------------:|---------------:|---------------:|
| real-time |          373.0 |          342.8 |          -8.1% |
| user-time |         2333.9 |         2275.3 |          -2.5% |
| sys-time  |          510.7 |          340.9 |         -33.3% |

Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel
execution. The next 2 tables show a breakdown of the cycles spent in the kernel
for the 8 job config:

|                      | baseline | anonfolio | percent change |
|                      | (cycles) | (cycles)  | SMALLER=better |
|----------------------|---------:|----------:|---------------:|
| data abort           |     683B |      316B |         -53.8% |
| instruction abort    |      93B |       76B |         -18.4% |
| syscall              |     887B |      767B |         -13.6% |

|                      | baseline | anonfolio | percent change |
|                      | (cycles) | (cycles)  | SMALLER=better |
|----------------------|---------:|----------:|---------------:|
| arm64_sys_openat     |     194B |      188B |          -3.3% |
| arm64_sys_exit_group |     192B |      124B |         -35.7% |
| arm64_sys_read       |     124B |      108B |         -12.7% |
| arm64_sys_execve     |      75B |       67B |         -11.0% |
| arm64_sys_mmap       |      51B |       50B |          -3.0% |
| arm64_sys_mprotect   |      15B |       13B |         -12.0% |
| arm64_sys_write      |      43B |       42B |          -2.9% |
| arm64_sys_munmap     |      15B |       12B |         -17.0% |
| arm64_sys_newfstatat |      46B |       41B |          -9.7% |
| arm64_sys_clone      |      26B |       24B |         -10.0% |

And now with -j160:

|           | baseline time  | anonfolio time | percent change |
|           | to compile (s) | to compile (s) | SMALLER=better |
|-----------|---------------:|---------------:|---------------:|
| real-time |           53.7 |           48.2 |         -10.2% |
| user-time |         2705.8 |         2842.1 |           5.0% |
| sys-time  |         1370.4 |         1064.3 |         -22.3% |

Above shows a 10.2% improvement in real time execution. But ~3x more time is
spent in the kernel than for the -j8 config. I think this is related to the lock
contention issue I highlighted above, but haven't bottomed it out yet. It's also
not yet clear to me why user-time increases by 5%.

I've also run all the will-it-scale microbenchmarks for a single task, using the
process mode. Results for multiple runs on the same kernel are noisy - I see ~5%
fluctuation. So I'm just calling out tests with results that have gt 5%
improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests
are regressed:

| benchmark            | baseline | anonfolio | percent change |
|                      | ops/s    | ops/s     | BIGGER=better  |
| ---------------------|---------:|----------:|---------------:|
| context_switch1.csv  |   328744 |    351150 |          6.8%  |
| malloc1.csv          |    96214 |     50890 |        -47.1%  |
| mmap1.csv            |   410253 |    375746 |         -8.4%  |
| page_fault1.csv      |   624061 |   3185678 |        410.5%  |
| page_fault2.csv      |   416483 |    557448 |         33.8%  |
| page_fault3.csv      |   724566 |   1152726 |         59.1%  |
| read1.csv            |  1806908 |   1905752 |          5.5%  |
| read2.csv            |   587722 |   1942062 |        230.4%  |
| tlb_flush1.csv       |   143910 |    152097 |          5.7%  |
| tlb_flush2.csv       |   266763 |    322320 |         20.8%  |

I believe malloc1 is an unrealistic test, since it does malloc/free for 128M
object in a loop and never touches the allocated memory. I think the malloc
implementation is maintaining a header just before the allocated object, which
causes a single page fault. Previously that page fault allocated 1 page. Now it
is allocating 16 pages. This cost would be repaid if the test code wrote to the
allocated object. Alternatively the folio allocation order policy described
above would also solve this.

It is not clear to me why mmap1 has slowed down. This remains a todo.

Memory
------

I measured memory consumption while doing a kernel compile with 8 jobs on a
system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the
workload, then calcualted "memory used" high and low watermarks using both
MemFree and MemAvailable. If there is a better way of measuring system memory
consumption, please let me know!

mem-used = 4GB - /proc/meminfo:MemFree

|                      | baseline | anonfolio | percent change |
|                      | (MB)     | (MB)      | SMALLER=better |
|----------------------|---------:|----------:|---------------:|
| mem-used-low         |      825 |       842 |           2.1% |
| mem-used-high        |     2697 |      2672 |          -0.9% |

mem-used = 4GB - /proc/meminfo:MemAvailable

|                      | baseline | anonfolio | percent change |
|                      | (MB)     | (MB)      | SMALLER=better |
|----------------------|---------:|----------:|---------------:|
| mem-used-low         |      518 |       530 |           2.3% |
| mem-used-high        |     1522 |      1537 |           1.0% |

For the high watermark, the methods disagree; we are either saving 1% or using
1% more. For the low watermark, both methods agree that we are using about 2%
more. I plan to investigate whether the proposed folio allocation order policy
can reduce this to zero.

Thanks for making it this far!
Ryan


Ryan Roberts (17):
  mm: Expose clear_huge_page() unconditionally
  mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
  mm: Introduce try_vma_alloc_movable_folio()
  mm: Implement folio_add_new_anon_rmap_range()
  mm: Routines to determine max anon folio allocation order
  mm: Allocate large folios for anonymous memory
  mm: Allow deferred splitting of arbitrary large anon folios
  mm: Implement folio_move_anon_rmap_range()
  mm: Update wp_page_reuse() to operate on range of pages
  mm: Reuse large folios for anonymous memory
  mm: Split __wp_page_copy_user() into 2 variants
  mm: ptep_clear_flush_range_notify() macro for batch operation
  mm: Implement folio_remove_rmap_range()
  mm: Copy large folios for anonymous memory
  mm: Convert zero page to large folios on write
  mm: mmap: Align unhinted maps to highest anon folio order
  mm: Batch-zap large anonymous folio PTE mappings

 arch/alpha/include/asm/page.h   |   5 +-
 arch/arm64/include/asm/page.h   |   3 +-
 arch/arm64/mm/fault.c           |   7 +-
 arch/ia64/include/asm/page.h    |   5 +-
 arch/m68k/include/asm/page_no.h |   7 +-
 arch/s390/include/asm/page.h    |   5 +-
 arch/x86/include/asm/page.h     |   5 +-
 include/linux/highmem.h         |  23 +-
 include/linux/mm.h              |   8 +-
 include/linux/mmu_notifier.h    |  31 ++
 include/linux/rmap.h            |   6 +
 mm/memory.c                     | 877 ++++++++++++++++++++++++++++----
 mm/mmap.c                       |   4 +-
 mm/rmap.c                       | 147 +++++-
 14 files changed, 1000 insertions(+), 133 deletions(-)

--
2.25.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 01/17] mm: Expose clear_huge_page() unconditionally
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
@ 2023-04-14 13:02 ` Ryan Roberts
  2023-04-14 13:02 ` [RFC v2 PATCH 02/17] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() Ryan Roberts
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:02 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

In preparation for extending vma_alloc_zeroed_movable_folio() to
allocate a arbitrary order folio, expose clear_huge_page()
unconditionally, so that it can be used to zero the allocated folio.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/mm.h | 3 ++-
 mm/memory.c        | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f79667824eb..cdb8c6031d0f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3538,10 +3538,11 @@ enum mf_action_page_type {
  */
 extern const struct attribute_group memory_failure_attr_group;

-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr_hint,
 			    unsigned int pages_per_huge_page);
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr_hint,
 				struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index 01a23ad48a04..3e2eee8c66a7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5642,7 +5642,6 @@ void __might_fault(const char *file, int line)
 EXPORT_SYMBOL(__might_fault);
 #endif

-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 /*
  * Process all subpages of the specified huge page with the specified
  * operation.  The target subpage will be processed last to keep its
@@ -5730,6 +5729,8 @@ void clear_huge_page(struct page *page,
 	process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, page);
 }

+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
 				    unsigned long addr,
 				    struct vm_area_struct *vma,
--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 02/17] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
  2023-04-14 13:02 ` [RFC v2 PATCH 01/17] mm: Expose clear_huge_page() unconditionally Ryan Roberts
@ 2023-04-14 13:02 ` Ryan Roberts
  2023-04-14 13:02 ` [RFC v2 PATCH 03/17] mm: Introduce try_vma_alloc_movable_folio() Ryan Roberts
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:02 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Allow allocation of large folios with vma_alloc_zeroed_movable_folio().
This prepares the ground for large anonymous folios. The generic
implementation of vma_alloc_zeroed_movable_folio() now uses
clear_huge_page() to zero the allocated folio since it may now be a
non-0 order.

Currently the function is always called with order 0 and no extra gfp
flags, so no functional change intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/alpha/include/asm/page.h   |  5 +++--
 arch/arm64/include/asm/page.h   |  3 ++-
 arch/arm64/mm/fault.c           |  7 ++++---
 arch/ia64/include/asm/page.h    |  5 +++--
 arch/m68k/include/asm/page_no.h |  7 ++++---
 arch/s390/include/asm/page.h    |  5 +++--
 arch/x86/include/asm/page.h     |  5 +++--
 include/linux/highmem.h         | 23 +++++++++++++----------
 mm/memory.c                     |  5 +++--
 9 files changed, 38 insertions(+), 27 deletions(-)

diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
index 4db1ebc0ed99..6fc7fe91b6cb 100644
--- a/arch/alpha/include/asm/page.h
+++ b/arch/alpha/include/asm/page.h
@@ -17,8 +17,9 @@
 extern void clear_page(void *page);
 #define clear_user_page(page, vaddr, pg)	clear_page(page)

-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)

 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 2312e6ee595f..47710852f872 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -30,7 +30,8 @@ void copy_highpage(struct page *to, struct page *from);
 #define __HAVE_ARCH_COPY_HIGHPAGE

 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-						unsigned long vaddr);
+						unsigned long vaddr,
+						gfp_t gfp, int order);
 #define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio

 void tag_clear_highpage(struct page *to);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index f4cb0f85ccf4..3b4cc04f7a23 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -926,9 +926,10 @@ NOKPROBE_SYMBOL(do_debug_exception);
  * Used during anonymous page fault handling.
  */
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-						unsigned long vaddr)
+						unsigned long vaddr,
+						gfp_t gfp, int order)
 {
-	gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
+	gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO | gfp;

 	/*
 	 * If the page is mapped with PROT_MTE, initialise the tags at the
@@ -938,7 +939,7 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
 	if (vma->vm_flags & VM_MTE)
 		flags |= __GFP_ZEROTAGS;

-	return vma_alloc_folio(flags, 0, vma, vaddr, false);
+	return vma_alloc_folio(flags, order, vma, vaddr, false);
 }

 void tag_clear_highpage(struct page *page)
diff --git a/arch/ia64/include/asm/page.h b/arch/ia64/include/asm/page.h
index 310b09c3342d..ebdf04274023 100644
--- a/arch/ia64/include/asm/page.h
+++ b/arch/ia64/include/asm/page.h
@@ -82,10 +82,11 @@ do {						\
 } while (0)


-#define vma_alloc_zeroed_movable_folio(vma, vaddr)			\
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order)		\
 ({									\
 	struct folio *folio = vma_alloc_folio(				\
-		GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false); \
+		GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp),		\
+		order, vma, vaddr, false);				\
 	if (folio)							\
 		flush_dcache_folio(folio);				\
 	folio;								\
diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h
index 060e4c0e7605..4a2fe57fef5e 100644
--- a/arch/m68k/include/asm/page_no.h
+++ b/arch/m68k/include/asm/page_no.h
@@ -3,7 +3,7 @@
 #define _M68K_PAGE_NO_H

 #ifndef __ASSEMBLY__
-
+
 extern unsigned long memory_start;
 extern unsigned long memory_end;

@@ -13,8 +13,9 @@ extern unsigned long memory_end;
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)

 #define __pa(vaddr)		((unsigned long)(vaddr))
 #define __va(paddr)		((void *)((unsigned long)(paddr)))
diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index 8a2a3b5d1e29..b749564140f1 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -73,8 +73,9 @@ static inline void copy_page(void *to, void *from)
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)

 /*
  * These are used to make use of C type-checking..
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index d18e5c332cb9..34deab1a8dae 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -34,8 +34,9 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 	copy_page(to, from);
 }

-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)

 #ifndef __pa
 #define __pa(x)		__phys_addr((unsigned long)(x))
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 8fc10089e19e..54e68deae5ef 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -209,26 +209,29 @@ static inline void clear_user_highpage(struct page *page, unsigned long vaddr)

 #ifndef vma_alloc_zeroed_movable_folio
 /**
- * vma_alloc_zeroed_movable_folio - Allocate a zeroed page for a VMA.
- * @vma: The VMA the page is to be allocated for.
- * @vaddr: The virtual address the page will be inserted into.
- *
- * This function will allocate a page suitable for inserting into this
- * VMA at this virtual address.  It may be allocated from highmem or
+ * vma_alloc_zeroed_movable_folio - Allocate a zeroed folio for a VMA.
+ * @vma: The start VMA the folio is to be allocated for.
+ * @vaddr: The virtual address the folio will be inserted into.
+ * @gfp: Additional gfp falgs to mix in or 0.
+ * @order: The order of the folio (2^order pages).
+ *
+ * This function will allocate a folio suitable for inserting into this
+ * VMA starting at this virtual address.  It may be allocated from highmem or
  * the movable zone.  An architecture may provide its own implementation.
  *
- * Return: A folio containing one allocated and zeroed page or NULL if
+ * Return: A folio containing 2^order allocated and zeroed pages or NULL if
  * we are out of memory.
  */
 static inline
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-				   unsigned long vaddr)
+				   unsigned long vaddr, gfp_t gfp, int order)
 {
 	struct folio *folio;

-	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vaddr, false);
+	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
+					order, vma, vaddr, false);
 	if (folio)
-		clear_user_highpage(&folio->page, vaddr);
+		clear_huge_page(&folio->page, vaddr, 1U << order);

 	return folio;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 3e2eee8c66a7..9d5e8be49f3b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3061,7 +3061,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		goto oom;

 	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
-		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address,
+									0, 0);
 		if (!new_folio)
 			goto oom;
 	} else {
@@ -4063,7 +4064,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
 	if (!folio)
 		goto oom;

--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 03/17] mm: Introduce try_vma_alloc_movable_folio()
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
  2023-04-14 13:02 ` [RFC v2 PATCH 01/17] mm: Expose clear_huge_page() unconditionally Ryan Roberts
  2023-04-14 13:02 ` [RFC v2 PATCH 02/17] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() Ryan Roberts
@ 2023-04-14 13:02 ` Ryan Roberts
  2023-04-17  8:49   ` Yin, Fengwei
  2023-04-14 13:02 ` [RFC v2 PATCH 04/17] mm: Implement folio_add_new_anon_rmap_range() Ryan Roberts
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:02 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Opportunistically attempt to allocate high-order folios in highmem,
optionally zeroed. Retry with lower orders all the way to order-0, until
success. Although, of note, order-1 allocations are skipped since a
large folio must be at least order-2 to work with the THP machinery. The
user must check what they got with folio_order().

This will be used to oportunistically allocate large folios for
anonymous memory with a sensible fallback under memory pressure.

For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
high latency due to reclaim, instead preferring to just try for a lower
order. The same approach is used by the readahead code when allocating
large folios.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 9d5e8be49f3b..ca32f59acef2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2989,6 +2989,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
 	return 0;
 }

+static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
+				unsigned long vaddr, int order, bool zeroed)
+{
+	gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
+
+	if (zeroed)
+		return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
+	else
+		return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
+								vaddr, false);
+}
+
+/*
+ * Opportunistically attempt to allocate high-order folios, retrying with lower
+ * orders all the way to order-0, until success. order-1 allocations are skipped
+ * since a folio must be at least order-2 to work with the THP machinery. The
+ * user must check what they got with folio_order(). vaddr can be any virtual
+ * address that will be mapped by the allocated folio.
+ */
+static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
+				unsigned long vaddr, int order, bool zeroed)
+{
+	struct folio *folio;
+
+	for (; order > 1; order--) {
+		folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
+		if (folio)
+			return folio;
+	}
+
+	return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *
--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 04/17] mm: Implement folio_add_new_anon_rmap_range()
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (2 preceding siblings ...)
  2023-04-14 13:02 ` [RFC v2 PATCH 03/17] mm: Introduce try_vma_alloc_movable_folio() Ryan Roberts
@ 2023-04-14 13:02 ` Ryan Roberts
  2023-04-14 13:02 ` [RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order Ryan Roberts
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:02 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
belonging to a folio, for effciency savings. All pages are accounted as
small pages.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/rmap.h |  2 ++
 mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b87d01660412..5c707f53d7b5 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address);
 void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
 		unsigned long address);
+void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
+		int nr, struct vm_area_struct *vma, unsigned long address);
 void page_add_file_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
 void page_remove_rmap(struct page *, struct vm_area_struct *,
diff --git a/mm/rmap.c b/mm/rmap.c
index 8632e02661ac..d563d979c005 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1302,6 +1302,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 }

+/**
+ * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
+ * anonymous potentially large folio.
+ * @folio:      The folio containing the pages to be mapped
+ * @page:       First page in the folio to be mapped
+ * @nr:         Number of pages to be mapped
+ * @vma:        the vm area in which the mapping is added
+ * @address:    the user virtual address of the first page to be mapped
+ *
+ * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
+ * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
+ * bypassed and the folio does not have to be locked. All pages in the folio are
+ * individually accounted.
+ *
+ * As the folio is new, it's assumed to be mapped exclusively by a single
+ * process.
+ */
+void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
+		int nr, struct vm_area_struct *vma, unsigned long address)
+{
+	int i;
+
+	VM_BUG_ON_VMA(address < vma->vm_start ||
+		      address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
+	__folio_set_swapbacked(folio);
+
+	if (folio_test_large(folio)) {
+		/* increment count (starts at 0) */
+		atomic_set(&folio->_nr_pages_mapped, nr);
+	}
+
+	for (i = 0; i < nr; i++) {
+		/* increment count (starts at -1) */
+		atomic_set(&page->_mapcount, 0);
+		__page_set_anon_rmap(folio, page, vma, address, 1);
+		page++;
+		address += PAGE_SIZE;
+	}
+
+	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
+
+}
+
 /**
  * page_add_file_rmap - add pte mapping to a file page
  * @page:	the page to add the mapping to
--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (3 preceding siblings ...)
  2023-04-14 13:02 ` [RFC v2 PATCH 04/17] mm: Implement folio_add_new_anon_rmap_range() Ryan Roberts
@ 2023-04-14 13:02 ` Ryan Roberts
  2023-04-14 14:09   ` Kirill A. Shutemov
  2023-04-14 13:02 ` [RFC v2 PATCH 06/17] mm: Allocate large folios for anonymous memory Ryan Roberts
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:02 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

For variable-order anonymous folios, we want to tune the order that we
prefer to allocate based on the vma. Add the routines to manage that
heuristic.

TODO: Currently we always use the global maximum. Add per-vma logic!

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/mm.h | 5 +++++
 mm/memory.c        | 8 ++++++++
 2 files changed, 13 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cdb8c6031d0f..cc8d0b239116 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3674,4 +3674,9 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 }
 #endif

+/*
+ * TODO: Should this be set per-architecture?
+ */
+#define ANON_FOLIO_ORDER_MAX	4
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index ca32f59acef2..d7e34a8c46aa 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3022,6 +3022,14 @@ static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
 	return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
 }

+static inline int max_anon_folio_order(struct vm_area_struct *vma)
+{
+	/*
+	 * TODO: Policy for maximum folio order should likely be per-vma.
+	 */
+	return ANON_FOLIO_ORDER_MAX;
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *
--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 06/17] mm: Allocate large folios for anonymous memory
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (4 preceding siblings ...)
  2023-04-14 13:02 ` [RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order Ryan Roberts
@ 2023-04-14 13:02 ` Ryan Roberts
  2023-04-14 13:02 ` [RFC v2 PATCH 07/17] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:02 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Add the machinery to determine what order of folio to allocate within
do_anonymous_page() and deal with racing faults to the same region.

For now, the maximum order is set to 4. This should probably be set
per-vma based on factors, and adjusted dynamically.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 138 insertions(+), 16 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index d7e34a8c46aa..f92a28064596 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3030,6 +3030,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
 	return ANON_FOLIO_ORDER_MAX;
 }

+/*
+ * Returns index of first pte that is not none, or nr if all are none.
+ */
+static inline int check_ptes_none(pte_t *pte, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		if (!pte_none(*pte++))
+			return i;
+	}
+
+	return nr;
+}
+
+static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
+{
+	/*
+	 * The aim here is to determine what size of folio we should allocate
+	 * for this fault. Factors include:
+	 * - Order must not be higher than `order` upon entry
+	 * - Folio must be naturally aligned within VA space
+	 * - Folio must not breach boundaries of vma
+	 * - Folio must be fully contained inside one pmd entry
+	 * - Folio must not overlap any non-none ptes
+	 *
+	 * Additionally, we do not allow order-1 since this breaks assumptions
+	 * elsewhere in the mm; THP pages must be at least order-2 (since they
+	 * store state up to the 3rd struct page subpage), and these pages must
+	 * be THP in order to correctly use pre-existing THP infrastructure such
+	 * as folio_split().
+	 *
+	 * As a consequence of relying on the THP infrastructure, if the system
+	 * does not support THP, we always fallback to order-0.
+	 *
+	 * Note that the caller may or may not choose to lock the pte. If
+	 * unlocked, the calculation should be considered an estimate that will
+	 * need to be validated under the lock.
+	 */
+
+	struct vm_area_struct *vma = vmf->vma;
+	int nr;
+	unsigned long addr;
+	pte_t *pte;
+	pte_t *first_set = NULL;
+	int ret;
+
+	if (has_transparent_hugepage()) {
+		order = min(order, PMD_SHIFT - PAGE_SHIFT);
+
+		for (; order > 1; order--) {
+			nr = 1 << order;
+			addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
+			pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
+
+			/* Check vma bounds. */
+			if (addr < vma->vm_start ||
+			    addr + (nr << PAGE_SHIFT) > vma->vm_end)
+				continue;
+
+			/* Ptes covered by order already known to be none. */
+			if (pte + nr <= first_set)
+				break;
+
+			/* Already found set pte in range covered by order. */
+			if (pte <= first_set)
+				continue;
+
+			/* Need to check if all the ptes are none. */
+			ret = check_ptes_none(pte, nr);
+			if (ret == nr)
+				break;
+
+			first_set = pte + ret;
+		}
+
+		if (order == 1)
+			order = 0;
+	} else
+		order = 0;
+
+	return order;
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *
@@ -4058,6 +4142,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	struct folio *folio;
 	vm_fault_t ret = 0;
 	pte_t entry;
+	unsigned long addr;
+	int order = max_anon_folio_order(vma);
+	int pgcount = BIT(order);

 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
@@ -4099,24 +4186,42 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return handle_userfault(vmf, VM_UFFD_MISSING);
 		}
-		goto setpte;
+		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, vmf->address, vmf->pte);
+		goto unlock;
 	}

-	/* Allocate our own private page. */
+retry:
+	/*
+	 * Estimate the folio order to allocate. We are not under the ptl here
+	 * so this estiamte needs to be re-checked later once we have the lock.
+	 */
+	vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+	order = calc_anon_folio_order_alloc(vmf, order);
+	pte_unmap(vmf->pte);
+
+	/* Allocate our own private folio. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
+	folio = try_vma_alloc_movable_folio(vma, vmf->address, order, true);
 	if (!folio)
 		goto oom;

+	/* We may have been granted less than we asked for. */
+	order = folio_order(folio);
+	pgcount = BIT(order);
+	addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
-	cgroup_throttle_swaprate(&folio->page, GFP_KERNEL);
+	folio_throttle_swaprate(folio, GFP_KERNEL);

 	/*
 	 * The memory barrier inside __folio_mark_uptodate makes sure that
-	 * preceding stores to the page contents become visible before
-	 * the set_pte_at() write.
+	 * preceding stores to the folio contents become visible before
+	 * the set_ptes() write.
 	 */
 	__folio_mark_uptodate(folio);

@@ -4125,11 +4230,26 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));

-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
-	if (!pte_none(*vmf->pte)) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
-		goto release;
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
+
+	/*
+	 * Ensure our estimate above is still correct; we could have raced with
+	 * another thread to service a fault in the region.
+	 */
+	if (unlikely(check_ptes_none(vmf->pte, pgcount) != pgcount)) {
+		pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
+
+		/* If faulting pte was allocated by another, exit early. */
+		if (order == 0 || !pte_none(*pte)) {
+			update_mmu_tlb(vma, vmf->address, pte);
+			goto release;
+		}
+
+		/* Else try again, with a lower order. */
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		folio_put(folio);
+		order--;
+		goto retry;
 	}

 	ret = check_stable_address_space(vma->vm_mm);
@@ -4143,14 +4263,16 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}

-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	folio_add_new_anon_rmap(folio, vma, vmf->address);
+	folio_ref_add(folio, pgcount - 1);
+
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
+	folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, addr);
 	folio_add_lru_vma(folio, vma);
-setpte:
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+
+	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);

 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, vmf->address, vmf->pte);
+	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 07/17] mm: Allow deferred splitting of arbitrary large anon folios
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (5 preceding siblings ...)
  2023-04-14 13:02 ` [RFC v2 PATCH 06/17] mm: Allocate large folios for anonymous memory Ryan Roberts
@ 2023-04-14 13:02 ` Ryan Roberts
  2023-04-14 13:02 ` [RFC v2 PATCH 08/17] mm: Implement folio_move_anon_rmap_range() Ryan Roberts
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:02 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

With the introduction of large folios for anonymous memory, we would
like to be able to split them when they have unmapped subpages, in order
to free those unused pages under memory pressure. So remove the
artificial requirement that the large folio needed to be at least
PMD-sized.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index d563d979c005..5148a484f915 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1470,7 +1470,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 		 * page of the folio is unmapped and at least one page
 		 * is still mapped.
 		 */
-		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
+		if (folio_test_large(folio) && folio_test_anon(folio))
 			if (!compound || nr < nr_pmdmapped)
 				deferred_split_folio(folio);
 	}
--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 08/17] mm: Implement folio_move_anon_rmap_range()
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (6 preceding siblings ...)
  2023-04-14 13:02 ` [RFC v2 PATCH 07/17] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts
@ 2023-04-14 13:02 ` Ryan Roberts
  2023-04-14 13:02 ` [RFC v2 PATCH 09/17] mm: Update wp_page_reuse() to operate on range of pages Ryan Roberts
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:02 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Similar to page_move_anon_rmap() except it can batch-move a range of
pages within a folio for increased efficiency. Will be used to enable
reusing multiple pages from a large anonymous folio in one go.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/rmap.h |  2 ++
 mm/rmap.c            | 40 ++++++++++++++++++++++++++++++----------
 2 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 5c707f53d7b5..8cb0ba48d58f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -190,6 +190,8 @@ typedef int __bitwise rmap_t;
  * rmap interfaces called when adding or removing pte of page
  */
 void page_move_anon_rmap(struct page *, struct vm_area_struct *);
+void folio_move_anon_rmap_range(struct folio *folio, struct page *page,
+		int nr, struct vm_area_struct *vma);
 void page_add_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address, rmap_t flags);
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
diff --git a/mm/rmap.c b/mm/rmap.c
index 5148a484f915..1cd8fb0b929f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1103,19 +1103,22 @@ int folio_total_mapcount(struct folio *folio)
 }

 /**
- * page_move_anon_rmap - move a page to our anon_vma
- * @page:	the page to move to our anon_vma
- * @vma:	the vma the page belongs to
+ * folio_move_anon_rmap_range - batch-move a range of pages within a folio to
+ * our anon_vma; a more efficient version of page_move_anon_rmap().
+ * @folio:      folio that owns the range of pages
+ * @page:       the first page to move to our anon_vma
+ * @nr:         number of pages to move to our anon_vma
+ * @vma:        the vma the page belongs to
  *
- * When a page belongs exclusively to one process after a COW event,
- * that page can be moved into the anon_vma that belongs to just that
- * process, so the rmap code will not search the parent or sibling
- * processes.
+ * When a range of pages belongs exclusively to one process after a COW event,
+ * those pages can be moved into the anon_vma that belongs to just that process,
+ * so the rmap code will not search the parent or sibling processes.
  */
-void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
+void folio_move_anon_rmap_range(struct folio *folio, struct page *page,
+					int nr, struct vm_area_struct *vma)
 {
 	void *anon_vma = vma->anon_vma;
-	struct folio *folio = page_folio(page);
+	int i;

 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_VMA(!anon_vma, vma);
@@ -1127,7 +1130,24 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
 	 * folio_test_anon()) will not see one without the other.
 	 */
 	WRITE_ONCE(folio->mapping, anon_vma);
-	SetPageAnonExclusive(page);
+
+	for (i = 0; i < nr; i++)
+		SetPageAnonExclusive(page++);
+}
+
+/**
+ * page_move_anon_rmap - move a page to our anon_vma
+ * @page:	the page to move to our anon_vma
+ * @vma:	the vma the page belongs to
+ *
+ * When a page belongs exclusively to one process after a COW event,
+ * that page can be moved into the anon_vma that belongs to just that
+ * process, so the rmap code will not search the parent or sibling
+ * processes.
+ */
+void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
+{
+	folio_move_anon_rmap_range(page_folio(page), page, 1, vma);
 }

 /**
--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 09/17] mm: Update wp_page_reuse() to operate on range of pages
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (7 preceding siblings ...)
  2023-04-14 13:02 ` [RFC v2 PATCH 08/17] mm: Implement folio_move_anon_rmap_range() Ryan Roberts
@ 2023-04-14 13:02 ` Ryan Roberts
  2023-04-14 13:02 ` [RFC v2 PATCH 10/17] mm: Reuse large folios for anonymous memory Ryan Roberts
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:02 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

We will shortly be updating do_wp_page() to be able to reuse a range of
pages from a large anon folio. As an enabling step, modify
wp_page_reuse() to operate on a range of pages, if a struct
anon_folio_range is passed in. Batching in this way allows us to batch
up the cache maintenance and event counting for small performance
improvements.

Currently all callsites pass range=NULL, so no functional changes
intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 80 +++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 60 insertions(+), 20 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f92a28064596..83835ff5a818 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3030,6 +3030,14 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
 	return ANON_FOLIO_ORDER_MAX;
 }

+struct anon_folio_range {
+	unsigned long va_start;
+	pte_t *pte_start;
+	struct page *pg_start;
+	int nr;
+	bool exclusive;
+};
+
 /*
  * Returns index of first pte that is not none, or nr if all are none.
  */
@@ -3122,31 +3130,63 @@ static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
  * case, all we need to do here is to mark the page as writable and update
  * any related book-keeping.
  */
-static inline void wp_page_reuse(struct vm_fault *vmf)
+static inline void wp_page_reuse(struct vm_fault *vmf,
+					struct anon_folio_range *range)
 	__releases(vmf->ptl)
 {
 	struct vm_area_struct *vma = vmf->vma;
-	struct page *page = vmf->page;
+	unsigned long addr;
+	pte_t *pte;
+	struct page *page;
+	int nr;
 	pte_t entry;
+	int change = 0;
+	int i;

 	VM_BUG_ON(!(vmf->flags & FAULT_FLAG_WRITE));
-	VM_BUG_ON(page && PageAnon(page) && !PageAnonExclusive(page));

-	/*
-	 * Clear the pages cpupid information as the existing
-	 * information potentially belongs to a now completely
-	 * unrelated process.
-	 */
-	if (page)
-		page_cpupid_xchg_last(page, (1 << LAST_CPUPID_SHIFT) - 1);
+	if (range) {
+		addr = range->va_start;
+		pte = range->pte_start;
+		page = range->pg_start;
+		nr = range->nr;
+	} else {
+		addr = vmf->address;
+		pte = vmf->pte;
+		page = vmf->page;
+		nr = 1;
+	}
+
+	if (page) {
+		for (i = 0; i < nr; i++, page++) {
+			VM_BUG_ON(PageAnon(page) && !PageAnonExclusive(page));
+
+			/*
+			 * Clear the pages cpupid information as the existing
+			 * information potentially belongs to a now completely
+			 * unrelated process.
+			 */
+			page_cpupid_xchg_last(page,
+					(1 << LAST_CPUPID_SHIFT) - 1);
+		}
+	}
+
+	flush_cache_range(vma, addr, addr + (nr << PAGE_SHIFT));
+
+	for (i = 0; i < nr; i++) {
+		entry = pte_mkyoung(pte[i]);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		change |= ptep_set_access_flags(vma,
+					addr + (i << PAGE_SHIFT),
+					pte + i,
+					entry, 1);
+	}
+
+	if (change)
+		update_mmu_cache_range(vma, addr, pte, nr);

-	flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
-	entry = pte_mkyoung(vmf->orig_pte);
-	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-	if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
-		update_mmu_cache(vma, vmf->address, vmf->pte);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	count_vm_event(PGREUSE);
+	count_vm_events(PGREUSE, nr);
 }

 /*
@@ -3359,7 +3399,7 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		return VM_FAULT_NOPAGE;
 	}
-	wp_page_reuse(vmf);
+	wp_page_reuse(vmf, NULL);
 	return 0;
 }

@@ -3381,7 +3421,7 @@ static vm_fault_t wp_pfn_shared(struct vm_fault *vmf)
 			return ret;
 		return finish_mkwrite_fault(vmf);
 	}
-	wp_page_reuse(vmf);
+	wp_page_reuse(vmf, NULL);
 	return 0;
 }

@@ -3410,7 +3450,7 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
 			return tmp;
 		}
 	} else {
-		wp_page_reuse(vmf);
+		wp_page_reuse(vmf, NULL);
 		lock_page(vmf->page);
 	}
 	ret |= fault_dirty_shared_page(vmf);
@@ -3534,7 +3574,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return 0;
 		}
-		wp_page_reuse(vmf);
+		wp_page_reuse(vmf, NULL);
 		return 0;
 	}
 copy:
--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 10/17] mm: Reuse large folios for anonymous memory
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (8 preceding siblings ...)
  2023-04-14 13:02 ` [RFC v2 PATCH 09/17] mm: Update wp_page_reuse() to operate on range of pages Ryan Roberts
@ 2023-04-14 13:02 ` Ryan Roberts
  2023-04-14 13:02 ` [RFC v2 PATCH 11/17] mm: Split __wp_page_copy_user() into 2 variants Ryan Roberts
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:02 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

When taking a write fault on an anonymous page, attempt to reuse as much
of the folio as possible if it is exclusive to the process.

This avoids a problem where an exclusive, PTE-mapped THP would
previously have all of its pages except the last one CoWed, then the
last page would be reused, causing the whole original folio to hang
around as well as all the CoWed pages. This problem is exaserbated now
that we are allocating variable-order folios for anonymous memory. The
reason for this behaviour is that a PTE-mapped THP has a reference for
each PTE and the old code thought that meant it was not exclusively
mapped, and therefore could not be reused.

We now take care to find the region that intersects the underlying
folio, the VMA and the PMD entry and for the presence of that number of
references as indicating exclusivity. Note that we are not guarranteed
that this region will cover the whole folio due to munmap and mremap.

The aim is to reuse as much as possible in one go in order to:
- reduce memory consumption
- reduce number of CoWs
- reduce time spent in fault handler

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 160 insertions(+), 9 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 83835ff5a818..7e2af54fe2e0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3038,6 +3038,26 @@ struct anon_folio_range {
 	bool exclusive;
 };

+static inline unsigned long page_addr(struct page *page,
+				struct page *anchor, unsigned long anchor_addr)
+{
+	unsigned long offset;
+	unsigned long addr;
+
+	offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
+	addr = anchor_addr + offset;
+
+	if (anchor > page) {
+		if (addr > anchor_addr)
+			return 0;
+	} else {
+		if (addr < anchor_addr)
+			return ULONG_MAX;
+	}
+
+	return addr;
+}
+
 /*
  * Returns index of first pte that is not none, or nr if all are none.
  */
@@ -3122,6 +3142,122 @@ static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
 	return order;
 }

+static void calc_anon_folio_range_reuse(struct vm_fault *vmf,
+					struct folio *folio,
+					struct anon_folio_range *range_out)
+{
+	/*
+	 * The aim here is to determine the biggest range of pages that can be
+	 * reused for this CoW fault if the identified range is responsible for
+	 * all the references on the folio (i.e. it is exclusive) such that:
+	 * - All pages are contained within folio
+	 * - All pages are within VMA
+	 * - All pages are within the same pmd entry as vmf->address
+	 * - vmf->page is contained within the range
+	 * - All covered ptes must be present, physically contiguous and RO
+	 *
+	 * Note that the folio itself may not be naturally aligned in VA space
+	 * due to mremap. We take the largest range we can in order to increase
+	 * our chances of being the exclusive user of the folio, therefore
+	 * meaning we can reuse. Its possible that the folio crosses a pmd
+	 * boundary, in which case we don't follow it into the next pte because
+	 * this complicates the locking.
+	 *
+	 * Note that the caller may or may not choose to lock the pte. If
+	 * unlocked, the calculation should be considered an estimate that will
+	 * need to be validated under the lock.
+	 */
+
+	struct vm_area_struct *vma = vmf->vma;
+	struct page *page;
+	pte_t *ptep;
+	pte_t pte;
+	bool excl = true;
+	unsigned long start, end;
+	int bloops, floops;
+	int i;
+	unsigned long pfn;
+
+	/*
+	 * Iterate backwards, starting with the page immediately before the
+	 * anchor page. On exit from the loop, start is the inclusive start
+	 * virtual address of the range.
+	 */
+
+	start = page_addr(&folio->page, vmf->page, vmf->address);
+	start = max(start, vma->vm_start);
+	start = max(start, ALIGN_DOWN(vmf->address, PMD_SIZE));
+	bloops = (vmf->address - start) >> PAGE_SHIFT;
+
+	page = vmf->page - 1;
+	ptep = vmf->pte - 1;
+	pfn = page_to_pfn(vmf->page) - 1;
+
+	for (i = 0; i < bloops; i++) {
+		pte = *ptep;
+
+		if (!pte_present(pte) ||
+		    pte_write(pte) ||
+		    pte_protnone(pte) ||
+		    pte_pfn(pte) != pfn) {
+			start = vmf->address - (i << PAGE_SHIFT);
+			break;
+		}
+
+		if (excl && !PageAnonExclusive(page))
+			excl = false;
+
+		pfn--;
+		ptep--;
+		page--;
+	}
+
+	/*
+	 * Iterate forward, starting with the anchor page. On exit from the
+	 * loop, end is the exclusive end virtual address of the range.
+	 */
+
+	end = page_addr(&folio->page + folio_nr_pages(folio),
+			vmf->page, vmf->address);
+	end = min(end, vma->vm_end);
+	end = min(end, ALIGN_DOWN(vmf->address, PMD_SIZE) + PMD_SIZE);
+	floops = (end - vmf->address) >> PAGE_SHIFT;
+
+	page = vmf->page;
+	ptep = vmf->pte;
+	pfn = page_to_pfn(vmf->page);
+
+	for (i = 0; i < floops; i++) {
+		pte = *ptep;
+
+		if (!pte_present(pte) ||
+		    pte_write(pte) ||
+		    pte_protnone(pte) ||
+		    pte_pfn(pte) != pfn) {
+			end = vmf->address + (i << PAGE_SHIFT);
+			break;
+		}
+
+		if (excl && !PageAnonExclusive(page))
+			excl = false;
+
+		pfn++;
+		ptep++;
+		page++;
+	}
+
+	/*
+	 * Fixup vmf to point to the start of the range, and return number of
+	 * pages in range.
+	 */
+
+	range_out->va_start = start;
+	range_out->pg_start = vmf->page - ((vmf->address - start) >> PAGE_SHIFT);
+	range_out->pte_start = vmf->pte - ((vmf->address - start) >> PAGE_SHIFT);
+	range_out->nr = (end - start) >> PAGE_SHIFT;
+	range_out->exclusive = excl;
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *
@@ -3528,13 +3664,23 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 	/*
 	 * Private mapping: create an exclusive anonymous page copy if reuse
 	 * is impossible. We might miss VM_WRITE for FOLL_FORCE handling.
+	 * For anonymous memory, we attempt to copy/reuse in folios rather than
+	 * page-by-page. We always prefer reuse above copy, even if we can only
+	 * reuse a subset of the folio. Note that when reusing pages in a folio,
+	 * due to munmap, mremap and friends, the folio isn't guarranteed to be
+	 * naturally aligned in virtual memory space.
 	 */
 	if (folio && folio_test_anon(folio)) {
+		struct anon_folio_range range;
+		int swaprefs;
+
+		calc_anon_folio_range_reuse(vmf, folio, &range);
+
 		/*
-		 * If the page is exclusive to this process we must reuse the
-		 * page without further checks.
+		 * If the pages have already been proven to be exclusive to this
+		 * process we must reuse the pages without further checks.
 		 */
-		if (PageAnonExclusive(vmf->page))
+		if (range.exclusive)
 			goto reuse;

 		/*
@@ -3544,7 +3690,10 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 		 *
 		 * KSM doesn't necessarily raise the folio refcount.
 		 */
-		if (folio_test_ksm(folio) || folio_ref_count(folio) > 3)
+		swaprefs = folio_test_swapcache(folio) ?
+				folio_nr_pages(folio) : 0;
+		if (folio_test_ksm(folio) ||
+		    folio_ref_count(folio) > range.nr + swaprefs + 1)
 			goto copy;
 		if (!folio_test_lru(folio))
 			/*
@@ -3552,29 +3701,31 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 			 * remote LRU pagevecs or references to LRU folios.
 			 */
 			lru_add_drain();
-		if (folio_ref_count(folio) > 1 + folio_test_swapcache(folio))
+		if (folio_ref_count(folio) > range.nr + swaprefs)
 			goto copy;
 		if (!folio_trylock(folio))
 			goto copy;
 		if (folio_test_swapcache(folio))
 			folio_free_swap(folio);
-		if (folio_test_ksm(folio) || folio_ref_count(folio) != 1) {
+		if (folio_test_ksm(folio) ||
+		    folio_ref_count(folio) != range.nr) {
 			folio_unlock(folio);
 			goto copy;
 		}
 		/*
-		 * Ok, we've got the only folio reference from our mapping
+		 * Ok, we've got the only folio references from our mapping
 		 * and the folio is locked, it's dark out, and we're wearing
 		 * sunglasses. Hit it.
 		 */
-		page_move_anon_rmap(vmf->page, vma);
+		folio_move_anon_rmap_range(folio, range.pg_start,
+							range.nr, vma);
 		folio_unlock(folio);
 reuse:
 		if (unlikely(unshare)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return 0;
 		}
-		wp_page_reuse(vmf, NULL);
+		wp_page_reuse(vmf, &range);
 		return 0;
 	}
 copy:
--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 11/17] mm: Split __wp_page_copy_user() into 2 variants
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (9 preceding siblings ...)
  2023-04-14 13:02 ` [RFC v2 PATCH 10/17] mm: Reuse large folios for anonymous memory Ryan Roberts
@ 2023-04-14 13:02 ` Ryan Roberts
  2023-04-14 13:02 ` [RFC v2 PATCH 12/17] mm: ptep_clear_flush_range_notify() macro for batch operation Ryan Roberts
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:02 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

We will soon support CoWing large folios, so will need support for
copying a contiguous range of pages in the case where there is a source
folio. Therefore, split __wp_page_copy_user() into 2 variants:

__wp_page_copy_user_pfn() copies a single pfn to a destination page.
This is used when CoWing from a source without a folio, and is always
only a single page copy.

__wp_page_copy_user_range() copies a range of pages from source to
destination and is used when the source has an underlying folio. For now
it is only used to copy a single page, but this will change in a future
commit.

In both cases, kmsan_copy_page_meta() is moved into these helper
functions so that the caller does not need to be concerned with calling
it multiple times for the range case.

No functional changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 41 +++++++++++++++++++++++++++++------------
 1 file changed, 29 insertions(+), 12 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e2af54fe2e0..f2b7cfb2efc0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2786,14 +2786,34 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
 	return same;
 }

+/*
+ * Return:
+ *	0:		copied succeeded
+ *	-EHWPOISON:	copy failed due to hwpoison in source page
+ */
+static inline int __wp_page_copy_user_range(struct page *dst, struct page *src,
+						int nr, unsigned long addr,
+						struct vm_area_struct *vma)
+{
+	for (; nr != 0; nr--, dst++, src++, addr += PAGE_SIZE) {
+		if (copy_mc_user_highpage(dst, src, addr, vma)) {
+			memory_failure_queue(page_to_pfn(src), 0);
+			return -EHWPOISON;
+		}
+		kmsan_copy_page_meta(dst, src);
+	}
+
+	return 0;
+}
+
 /*
  * Return:
  *	0:		copied succeeded
  *	-EHWPOISON:	copy failed due to hwpoison in source page
  *	-EAGAIN:	copied failed (some other reason)
  */
-static inline int __wp_page_copy_user(struct page *dst, struct page *src,
-				      struct vm_fault *vmf)
+static inline int __wp_page_copy_user_pfn(struct page *dst,
+						struct vm_fault *vmf)
 {
 	int ret;
 	void *kaddr;
@@ -2803,14 +2823,6 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr = vmf->address;

-	if (likely(src)) {
-		if (copy_mc_user_highpage(dst, src, addr, vma)) {
-			memory_failure_queue(page_to_pfn(src), 0);
-			return -EHWPOISON;
-		}
-		return 0;
-	}
-
 	/*
 	 * If the source page was a PFN mapping, we don't have
 	 * a "struct page" for it. We do a best-effort copy by
@@ -2879,6 +2891,7 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
 		}
 	}

+	kmsan_copy_page_meta(dst, NULL);
 	ret = 0;

 pte_unlock:
@@ -3372,7 +3385,12 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		if (!new_folio)
 			goto oom;

-		ret = __wp_page_copy_user(&new_folio->page, vmf->page, vmf);
+		if (likely(old_folio))
+			ret = __wp_page_copy_user_range(&new_folio->page,
+							vmf->page,
+							1, vmf->address, vma);
+		else
+			ret = __wp_page_copy_user_pfn(&new_folio->page, vmf);
 		if (ret) {
 			/*
 			 * COW failed, if the fault was solved by other,
@@ -3388,7 +3406,6 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			delayacct_wpcopy_end();
 			return ret == -EHWPOISON ? VM_FAULT_HWPOISON : 0;
 		}
-		kmsan_copy_page_meta(&new_folio->page, vmf->page);
 	}

 	if (mem_cgroup_charge(new_folio, mm, GFP_KERNEL))
--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 12/17] mm: ptep_clear_flush_range_notify() macro for batch operation
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (10 preceding siblings ...)
  2023-04-14 13:02 ` [RFC v2 PATCH 11/17] mm: Split __wp_page_copy_user() into 2 variants Ryan Roberts
@ 2023-04-14 13:02 ` Ryan Roberts
  2023-04-14 13:02 ` [RFC v2 PATCH 13/17] mm: Implement folio_remove_rmap_range() Ryan Roberts
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:02 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

We will soon add support for CoWing large anonymous folios, so create a
ranged version of the ptep_clear_flush_notify() macro in preparation for
that. It is able call mmu_notifier_invalidate_range() once for the
entire range, but still calls ptep_clear_flush() per page since there is
no arch support for a batched version of this API yet.

No functional change intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/mmu_notifier.h | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 64a3e051c3c4..527aa89959b4 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -595,6 +595,24 @@ static inline void mmu_notifier_range_init_owner(
 	___pte;								\
 })

+#define	ptep_clear_flush_range_notify(__vma, __address, __ptep, __nr)	\
+({									\
+	struct vm_area_struct *___vma = (__vma);			\
+	unsigned long ___addr = (__address) & PAGE_MASK;		\
+	pte_t *___ptep = (__ptep);					\
+	int ___nr = (__nr);						\
+	struct mm_struct *___mm = ___vma->vm_mm;			\
+	int ___i;							\
+									\
+	for (___i = 0; ___i < ___nr; ___i++)				\
+		ptep_clear_flush(___vma,				\
+				___addr + (___i << PAGE_SHIFT),		\
+				___ptep + ___i);			\
+									\
+	mmu_notifier_invalidate_range(___mm, ___addr,			\
+				___addr + (___nr << PAGE_SHIFT));	\
+})
+
 #define pmdp_huge_clear_flush_notify(__vma, __haddr, __pmd)		\
 ({									\
 	unsigned long ___haddr = __haddr & HPAGE_PMD_MASK;		\
@@ -736,6 +754,19 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
 #define ptep_clear_young_notify ptep_test_and_clear_young
 #define pmdp_clear_young_notify pmdp_test_and_clear_young
 #define	ptep_clear_flush_notify ptep_clear_flush
+#define	ptep_clear_flush_range_notify(__vma, __address, __ptep, __nr)	\
+({									\
+	struct vm_area_struct *___vma = (__vma);			\
+	unsigned long ___addr = (__address) & PAGE_MASK;		\
+	pte_t *___ptep = (__ptep);					\
+	int ___nr = (__nr);						\
+	int ___i;							\
+									\
+	for (___i = 0; ___i < ___nr; ___i++)				\
+		ptep_clear_flush(___vma,				\
+				___addr + (___i << PAGE_SHIFT),		\
+				___ptep + ___i);			\
+})
 #define pmdp_huge_clear_flush_notify pmdp_huge_clear_flush
 #define pudp_huge_clear_flush_notify pudp_huge_clear_flush
 #define set_pte_at_notify set_pte_at
--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 13/17] mm: Implement folio_remove_rmap_range()
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (11 preceding siblings ...)
  2023-04-14 13:02 ` [RFC v2 PATCH 12/17] mm: ptep_clear_flush_range_notify() macro for batch operation Ryan Roberts
@ 2023-04-14 13:02 ` Ryan Roberts
  2023-04-14 13:03 ` [RFC v2 PATCH 14/17] mm: Copy large folios for anonymous memory Ryan Roberts
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:02 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Like page_remove_rmap() but batch-removes the rmap for a range of pages
belonging to a folio, for effciency savings. All pages are accounted as
small pages.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/rmap.h |  2 ++
 mm/rmap.c            | 62 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8cb0ba48d58f..7daf25887049 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -204,6 +204,8 @@ void page_add_file_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
 void page_remove_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
+void folio_remove_rmap_range(struct folio *folio, struct page *page,
+		int nr, struct vm_area_struct *vma);

 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address, rmap_t flags);
diff --git a/mm/rmap.c b/mm/rmap.c
index 1cd8fb0b929f..954e44054d5c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1419,6 +1419,68 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
 	mlock_vma_folio(folio, vma, compound);
 }

+/**
+ * folio_remove_rmap_range - take down pte mappings from a range of pages
+ * belonging to a folio. All pages are accounted as small pages.
+ * @folio:	folio that all pages belong to
+ * @page:       first page in range to remove mapping from
+ * @nr:		number of pages in range to remove mapping from
+ * @vma:        the vm area from which the mapping is removed
+ *
+ * The caller needs to hold the pte lock.
+ */
+void folio_remove_rmap_range(struct folio *folio, struct page *page,
+					int nr, struct vm_area_struct *vma)
+{
+	atomic_t *mapped = &folio->_nr_pages_mapped;
+	int nr_unmapped = 0;
+	int nr_mapped;
+	bool last;
+	enum node_stat_item idx;
+
+	VM_BUG_ON_FOLIO(folio_test_hugetlb(folio), folio);
+
+	if (!folio_test_large(folio)) {
+		/* Is this the page's last map to be removed? */
+		last = atomic_add_negative(-1, &page->_mapcount);
+		nr_unmapped = last;
+	} else {
+		for (; nr != 0; nr--, page++) {
+			/* Is this the page's last map to be removed? */
+			last = atomic_add_negative(-1, &page->_mapcount);
+			if (last) {
+				/* Page still mapped if folio mapped entirely */
+				nr_mapped = atomic_dec_return_relaxed(mapped);
+				if (nr_mapped < COMPOUND_MAPPED)
+					nr_unmapped++;
+			}
+		}
+	}
+
+	if (nr_unmapped) {
+		idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
+		__lruvec_stat_mod_folio(folio, idx, -nr_unmapped);
+
+		/*
+		 * Queue anon THP for deferred split if we have just unmapped at
+		 * least 1 page, while at least 1 page remains mapped.
+		 */
+		if (folio_test_large(folio) && folio_test_anon(folio))
+			if (nr_mapped)
+				deferred_split_folio(folio);
+	}
+
+	/*
+	 * It would be tidy to reset folio_test_anon mapping when fully
+	 * unmapped, but that might overwrite a racing page_add_anon_rmap
+	 * which increments mapcount after us but sets mapping before us:
+	 * so leave the reset to free_pages_prepare, and remember that
+	 * it's only reliable while mapped.
+	 */
+
+	munlock_vma_folio(folio, vma, false);
+}
+
 /**
  * page_remove_rmap - take down pte mapping from a page
  * @page:	page to remove mapping from
--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 14/17] mm: Copy large folios for anonymous memory
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (12 preceding siblings ...)
  2023-04-14 13:02 ` [RFC v2 PATCH 13/17] mm: Implement folio_remove_rmap_range() Ryan Roberts
@ 2023-04-14 13:03 ` Ryan Roberts
  2023-04-14 13:03 ` [RFC v2 PATCH 15/17] mm: Convert zero page to large folios on write Ryan Roberts
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:03 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

When taking a write fault on an anonymous page, if we are unable to
reuse the folio (due to it being mapped by others), do CoW for the
entire folio instead of just a single page.

We assume that the size of the anonymous folio chosen at allocation time
is still a good choice and therefore it is better to copy the entire
folio rather than a single page. It does not seem wise to do this for
file-backed folios, since the folio size chosen there is related to the
system-wide usage of the file. So we continue to CoW a single page for
file-backed mappings.

There are edge cases where the original mapping has been mremapped or
partially munmapped. In this case the source folio may not be naturally
aligned in the virtual address space. In this case, we CoW a power-of-2
portion of the source folio which is aligned. A similar effect happens
when allocation of a high order destination folio fails. In this case,
we reduce the order to 0 until we are successful.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 242 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 207 insertions(+), 35 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f2b7cfb2efc0..61cec97a57f3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3086,6 +3086,30 @@ static inline int check_ptes_none(pte_t *pte, int nr)
 	return nr;
 }

+/*
+ * Returns index of first pte that is not mapped RO and physically contiguously
+ * starting at pfn, or nr if all are correct.
+ */
+static inline int check_ptes_contig_ro(pte_t *pte, int nr, unsigned long pfn)
+{
+	int i;
+	pte_t entry;
+
+	for (i = 0; i < nr; i++) {
+		entry = *pte++;
+
+		if (!pte_present(entry) ||
+		    pte_write(entry) ||
+		    pte_protnone(entry) ||
+		    pte_pfn(entry) != pfn)
+			return i;
+
+		pfn++;
+	}
+
+	return nr;
+}
+
 static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
 {
 	/*
@@ -3155,6 +3179,94 @@ static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
 	return order;
 }

+static int calc_anon_folio_order_copy(struct vm_fault *vmf,
+					struct folio *old_folio, int order)
+{
+	/*
+	 * The aim here is to determine what size of folio we should allocate as
+	 * the destination for this CoW fault. Factors include:
+	 * - Order must not be higher than `order` upon entry
+	 * - Folio must be naturally aligned within VA space
+	 * - Folio must not breach boundaries of vma
+	 * - Folio must be fully contained inside one pmd entry
+	 * - All covered ptes must be present, physically contiguous and RO
+	 * - All covered ptes must be mapped to old_folio
+	 *
+	 * Additionally, we do not allow order-1 since this breaks assumptions
+	 * elsewhere in the mm; THP pages must be at least order-2 (since they
+	 * store state up to the 3rd struct page subpage), and these pages must
+	 * be THP in order to correctly use pre-existing THP infrastructure such
+	 * as folio_split().
+	 *
+	 * As a consequence of relying on the THP infrastructure, if the system
+	 * does not support THP, we always fallback to order-0.
+	 *
+	 * Note that old_folio may not be naturally aligned in VA space due to
+	 * mremap. We deliberately force alignment of the new folio to simplify
+	 * fallback, so in this unaligned case we will end up only copying a
+	 * portion of old_folio.
+	 *
+	 * Note that the caller may or may not choose to lock the pte. If
+	 * unlocked, the calculation should be considered an estimate that will
+	 * need to be validated under the lock.
+	 */
+
+	struct vm_area_struct *vma = vmf->vma;
+	int nr;
+	unsigned long addr;
+	pte_t *pte;
+	pte_t *first_bad = NULL;
+	int ret;
+	unsigned long start, end;
+	unsigned long offset;
+	unsigned long pfn;
+
+	if (has_transparent_hugepage()) {
+		order = min(order, PMD_SHIFT - PAGE_SHIFT);
+
+		start = page_addr(&old_folio->page, vmf->page, vmf->address);
+		start = max(start, vma->vm_start);
+
+		end = page_addr(&old_folio->page + folio_nr_pages(old_folio),
+			vmf->page, vmf->address);
+		end = min(end, vma->vm_end);
+
+		for (; order > 1; order--) {
+			nr = 1 << order;
+			addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
+			offset = ((vmf->address - addr) >> PAGE_SHIFT);
+			pfn = page_to_pfn(vmf->page) - offset;
+			pte = vmf->pte - offset;
+
+			/* Check vma and folio bounds. */
+			if (addr < start ||
+			    addr + (nr << PAGE_SHIFT) > end)
+				continue;
+
+			/* Ptes covered by order already known to be good. */
+			if (pte + nr <= first_bad)
+				break;
+
+			/* Already found bad pte in range covered by order. */
+			if (pte <= first_bad)
+				continue;
+
+			/* Need to check if all the ptes are good. */
+			ret = check_ptes_contig_ro(pte, nr, pfn);
+			if (ret == nr)
+				break;
+
+			first_bad = pte + ret;
+		}
+
+		if (order == 1)
+			order = 0;
+	} else
+		order = 0;
+
+	return order;
+}
+
 static void calc_anon_folio_range_reuse(struct vm_fault *vmf,
 					struct folio *folio,
 					struct anon_folio_range *range_out)
@@ -3366,6 +3478,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	int page_copied = 0;
 	struct mmu_notifier_range range;
 	int ret;
+	pte_t orig_pte;
+	unsigned long addr = vmf->address;
+	int order = 0;
+	int pgcount = BIT(order);
+	unsigned long offset = 0;
+	unsigned long pfn;
+	struct page *page;
+	int i;

 	delayacct_wpcopy_start();

@@ -3375,20 +3495,39 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		goto oom;

 	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
-		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address,
-									0, 0);
+		new_folio = vma_alloc_movable_folio(vma, vmf->address, 0, true);
 		if (!new_folio)
 			goto oom;
 	} else {
-		new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma,
-				vmf->address, false);
+		if (old_folio && folio_test_anon(old_folio)) {
+			order = min_t(int, folio_order(old_folio),
+						max_anon_folio_order(vma));
+retry:
+			/*
+			 * Estimate the folio order to allocate. We are not
+			 * under the ptl here so this estimate needs to be
+			 * re-checked later once we have the lock.
+			 */
+			vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+			order = calc_anon_folio_order_copy(vmf, old_folio, order);
+			pte_unmap(vmf->pte);
+		}
+
+		new_folio = try_vma_alloc_movable_folio(vma, vmf->address,
+							order, false);
 		if (!new_folio)
 			goto oom;

+		/* We may have been granted less than we asked for. */
+		order = folio_order(new_folio);
+		pgcount = BIT(order);
+		addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
+		offset = ((vmf->address - addr) >> PAGE_SHIFT);
+
 		if (likely(old_folio))
 			ret = __wp_page_copy_user_range(&new_folio->page,
-							vmf->page,
-							1, vmf->address, vma);
+							vmf->page - offset,
+							pgcount, addr, vma);
 		else
 			ret = __wp_page_copy_user_pfn(&new_folio->page, vmf);
 		if (ret) {
@@ -3410,39 +3549,31 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)

 	if (mem_cgroup_charge(new_folio, mm, GFP_KERNEL))
 		goto oom_free_new;
-	cgroup_throttle_swaprate(&new_folio->page, GFP_KERNEL);
+	folio_throttle_swaprate(new_folio, GFP_KERNEL);

 	__folio_mark_uptodate(new_folio);

 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
-				vmf->address & PAGE_MASK,
-				(vmf->address & PAGE_MASK) + PAGE_SIZE);
+				addr, addr + (pgcount << PAGE_SHIFT));
 	mmu_notifier_invalidate_range_start(&range);

 	/*
-	 * Re-check the pte - we dropped the lock
+	 * Re-check the pte(s) - we dropped the lock
 	 */
-	vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
-	if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+	vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
+	pfn = pte_pfn(vmf->orig_pte) - offset;
+	if (likely(check_ptes_contig_ro(vmf->pte, pgcount, pfn) == pgcount)) {
 		if (old_folio) {
 			if (!folio_test_anon(old_folio)) {
+				VM_BUG_ON(order != 0);
 				dec_mm_counter(mm, mm_counter_file(&old_folio->page));
 				inc_mm_counter(mm, MM_ANONPAGES);
 			}
 		} else {
+			VM_BUG_ON(order != 0);
 			inc_mm_counter(mm, MM_ANONPAGES);
 		}
-		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
-		entry = mk_pte(&new_folio->page, vma->vm_page_prot);
-		entry = pte_sw_mkyoung(entry);
-		if (unlikely(unshare)) {
-			if (pte_soft_dirty(vmf->orig_pte))
-				entry = pte_mksoft_dirty(entry);
-			if (pte_uffd_wp(vmf->orig_pte))
-				entry = pte_mkuffd_wp(entry);
-		} else {
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		}
+		flush_cache_range(vma, addr, addr + (pgcount << PAGE_SHIFT));

 		/*
 		 * Clear the pte entry and flush it first, before updating the
@@ -3451,17 +3582,40 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * that left a window where the new PTE could be loaded into
 		 * some TLBs while the old PTE remains in others.
 		 */
-		ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
-		folio_add_new_anon_rmap(new_folio, vma, vmf->address);
+		ptep_clear_flush_range_notify(vma, addr, vmf->pte, pgcount);
+		folio_ref_add(new_folio, pgcount - 1);
+		folio_add_new_anon_rmap_range(new_folio, &new_folio->page,
+							pgcount, vma, addr);
 		folio_add_lru_vma(new_folio, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
-		BUG_ON(unshare && pte_write(entry));
-		set_pte_at_notify(mm, vmf->address, vmf->pte, entry);
-		update_mmu_cache(vma, vmf->address, vmf->pte);
+		page = &new_folio->page;
+		for (i = 0; i < pgcount; i++, page++) {
+			entry = mk_pte(page, vma->vm_page_prot);
+			entry = pte_sw_mkyoung(entry);
+			if (unlikely(unshare)) {
+				orig_pte = vmf->pte[i];
+				if (pte_soft_dirty(orig_pte))
+					entry = pte_mksoft_dirty(entry);
+				if (pte_uffd_wp(orig_pte))
+					entry = pte_mkuffd_wp(entry);
+			} else {
+				entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			}
+			/*
+			 * TODO: Batch for !unshare case. Could use set_ptes(),
+			 * but currently there is no arch-agnostic way to
+			 * increment pte values by pfn so can't do the notify
+			 * part. So currently stuck creating the pte from
+			 * scratch every iteration.
+			 */
+			set_pte_at_notify(mm, addr + (i << PAGE_SHIFT),
+						vmf->pte + i, entry);
+		}
+		update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
 		if (old_folio) {
 			/*
 			 * Only after switching the pte to the new page may
@@ -3473,10 +3627,10 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			 * threads.
 			 *
 			 * The critical issue is to order this
-			 * page_remove_rmap with the ptp_clear_flush above.
-			 * Those stores are ordered by (if nothing else,)
+			 * folio_remove_rmap_range with the ptp_clear_flush
+			 * above. Those stores are ordered by (if nothing else,)
 			 * the barrier present in the atomic_add_negative
-			 * in page_remove_rmap.
+			 * in folio_remove_rmap_range.
 			 *
 			 * Then the TLB flush in ptep_clear_flush ensures that
 			 * no process can access the old page before the
@@ -3485,14 +3639,30 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			 * mapcount is visible. So transitively, TLBs to
 			 * old page will be flushed before it can be reused.
 			 */
-			page_remove_rmap(vmf->page, vma, false);
+			folio_remove_rmap_range(old_folio,
+						vmf->page - offset,
+						pgcount, vma);
 		}

 		/* Free the old page.. */
 		new_folio = old_folio;
 		page_copied = 1;
 	} else {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
+		pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
+
+		/*
+		 * If faulting pte was serviced by another, exit early. Else try
+		 * again, with a lower order.
+		 */
+		if (order > 0 && pte_same(*pte, vmf->orig_pte)) {
+			pte_unmap_unlock(vmf->pte, vmf->ptl);
+			mmu_notifier_invalidate_range_only_end(&range);
+			folio_put(new_folio);
+			order--;
+			goto retry;
+		}
+
+		update_mmu_tlb(vma, vmf->address, pte);
 	}

 	if (new_folio)
@@ -3505,9 +3675,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	 */
 	mmu_notifier_invalidate_range_only_end(&range);
 	if (old_folio) {
-		if (page_copied)
+		if (page_copied) {
 			free_swap_cache(&old_folio->page);
-		folio_put(old_folio);
+			folio_put_refs(old_folio, pgcount);
+		} else
+			folio_put(old_folio);
 	}

 	delayacct_wpcopy_end();
--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 15/17] mm: Convert zero page to large folios on write
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (13 preceding siblings ...)
  2023-04-14 13:03 ` [RFC v2 PATCH 14/17] mm: Copy large folios for anonymous memory Ryan Roberts
@ 2023-04-14 13:03 ` Ryan Roberts
  2023-04-14 13:03 ` [RFC v2 PATCH 16/17] mm: mmap: Align unhinted maps to highest anon folio order Ryan Roberts
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:03 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

A read fault causes the zero page to be mapped read-only. A subsequent
write fault causes the zero page to be replaced with a zero-filled
private anonymous page. Change the write fault behaviour to replace the
zero page with a large anonymous folio, allocated using the same policy
as if the write fault had happened without the previous read fault.

Experimentation shows that reading multiple contiguous pages is
extremely rare without interleved writes, so we don't bother to map a
large zero page. We just use the small zero page as a marker and expand
the allocation at the write fault.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 115 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 80 insertions(+), 35 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 61cec97a57f3..fac686e9f895 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3110,6 +3110,23 @@ static inline int check_ptes_contig_ro(pte_t *pte, int nr, unsigned long pfn)
 	return nr;
 }

+/*
+ * Checks that all ptes are none except for the pte at offset, which should be
+ * entry. Returns index of first pte that does not meet expectations, or nr if
+ * all are correct.
+ */
+static inline int check_ptes_none_or_entry(pte_t *pte, int nr,
+					pte_t entry, unsigned long offset)
+{
+	int ret;
+
+	ret = check_ptes_none(pte, offset);
+	if (ret == offset && pte_same(pte[offset], entry))
+		ret += 1 + check_ptes_none(pte + offset + 1, nr - offset - 1);
+
+	return ret;
+}
+
 static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
 {
 	/*
@@ -3141,6 +3158,7 @@ static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
 	pte_t *pte;
 	pte_t *first_set = NULL;
 	int ret;
+	unsigned long offset;

 	if (has_transparent_hugepage()) {
 		order = min(order, PMD_SHIFT - PAGE_SHIFT);
@@ -3148,7 +3166,8 @@ static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
 		for (; order > 1; order--) {
 			nr = 1 << order;
 			addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
-			pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
+			offset = ((vmf->address - addr) >> PAGE_SHIFT);
+			pte = vmf->pte - offset;

 			/* Check vma bounds. */
 			if (addr < vma->vm_start ||
@@ -3163,8 +3182,9 @@ static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
 			if (pte <= first_set)
 				continue;

-			/* Need to check if all the ptes are none. */
-			ret = check_ptes_none(pte, nr);
+			/* Need to check if all the ptes are none or entry. */
+			ret = check_ptes_none_or_entry(pte, nr,
+							vmf->orig_pte, offset);
 			if (ret == nr)
 				break;

@@ -3479,13 +3499,15 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	struct mmu_notifier_range range;
 	int ret;
 	pte_t orig_pte;
-	unsigned long addr = vmf->address;
-	int order = 0;
-	int pgcount = BIT(order);
-	unsigned long offset = 0;
+	unsigned long addr;
+	int order;
+	int pgcount;
+	unsigned long offset;
 	unsigned long pfn;
 	struct page *page;
 	int i;
+	bool zero;
+	bool anon;

 	delayacct_wpcopy_start();

@@ -3494,36 +3516,54 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;

+	/*
+	 * Set the upper bound of the folio allocation order. If we hit a zero
+	 * page, we allocate a folio with the same policy as allocation upon
+	 * write fault. If we are copying an anon folio, then limit ourself to
+	 * its order as we don't want to copy from multiple folios. For all
+	 * other cases (e.g. file-mapped) CoW a single page.
+	 */
 	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
-		new_folio = vma_alloc_movable_folio(vma, vmf->address, 0, true);
-		if (!new_folio)
-			goto oom;
-	} else {
-		if (old_folio && folio_test_anon(old_folio)) {
-			order = min_t(int, folio_order(old_folio),
+		zero = true;
+		anon = false;
+		order = max_anon_folio_order(vma);
+	} else if (old_folio && folio_test_anon(old_folio)) {
+		zero = false;
+		anon = true;
+		order = min_t(int, folio_order(old_folio),
 						max_anon_folio_order(vma));
+	} else {
+		zero = false;
+		anon = false;
+		order = 0;
+	}
+
 retry:
-			/*
-			 * Estimate the folio order to allocate. We are not
-			 * under the ptl here so this estimate needs to be
-			 * re-checked later once we have the lock.
-			 */
-			vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
-			order = calc_anon_folio_order_copy(vmf, old_folio, order);
-			pte_unmap(vmf->pte);
-		}
+	/*
+	 * Estimate the folio order to allocate. We are not under the ptl here
+	 * so this estimate needs to be re-checked later once we have the lock.
+	 */
+	if (zero || anon) {
+		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+		order = zero ? calc_anon_folio_order_alloc(vmf, order) :
+			calc_anon_folio_order_copy(vmf, old_folio, order);
+		pte_unmap(vmf->pte);
+	}

-		new_folio = try_vma_alloc_movable_folio(vma, vmf->address,
-							order, false);
-		if (!new_folio)
-			goto oom;
+	/* Allocate the new folio. */
+	new_folio = try_vma_alloc_movable_folio(vma, vmf->address, order, zero);
+	if (!new_folio)
+		goto oom;

-		/* We may have been granted less than we asked for. */
-		order = folio_order(new_folio);
-		pgcount = BIT(order);
-		addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
-		offset = ((vmf->address - addr) >> PAGE_SHIFT);
+	/* We may have been granted less than we asked for. */
+	order = folio_order(new_folio);
+	pgcount = BIT(order);
+	addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
+	offset = ((vmf->address - addr) >> PAGE_SHIFT);
+	pfn = pte_pfn(vmf->orig_pte) - offset;

+	/* Copy contents. */
+	if (!zero) {
 		if (likely(old_folio))
 			ret = __wp_page_copy_user_range(&new_folio->page,
 							vmf->page - offset,
@@ -3561,8 +3601,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	 * Re-check the pte(s) - we dropped the lock
 	 */
 	vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
-	pfn = pte_pfn(vmf->orig_pte) - offset;
-	if (likely(check_ptes_contig_ro(vmf->pte, pgcount, pfn) == pgcount)) {
+
+	if (zero)
+		ret = check_ptes_none_or_entry(vmf->pte, pgcount,
+						vmf->orig_pte, offset);
+	else
+		ret = check_ptes_contig_ro(vmf->pte, pgcount, pfn);
+
+	if (likely(ret == pgcount)) {
 		if (old_folio) {
 			if (!folio_test_anon(old_folio)) {
 				VM_BUG_ON(order != 0);
@@ -3570,8 +3616,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 				inc_mm_counter(mm, MM_ANONPAGES);
 			}
 		} else {
-			VM_BUG_ON(order != 0);
-			inc_mm_counter(mm, MM_ANONPAGES);
+			add_mm_counter(mm, MM_ANONPAGES, pgcount);
 		}
 		flush_cache_range(vma, addr, addr + (pgcount << PAGE_SHIFT));

--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 16/17] mm: mmap: Align unhinted maps to highest anon folio order
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (14 preceding siblings ...)
  2023-04-14 13:03 ` [RFC v2 PATCH 15/17] mm: Convert zero page to large folios on write Ryan Roberts
@ 2023-04-14 13:03 ` Ryan Roberts
  2023-04-17  8:25   ` Yin, Fengwei
  2023-04-14 13:03 ` [RFC v2 PATCH 17/17] mm: Batch-zap large anonymous folio PTE mappings Ryan Roberts
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:03 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

When allocating large anonymous folios, we want to maximize our chances
of being able to use the highest order we support. Since one of the
constraints is that a folio has to be mapped naturally aligned, let's
have mmap default to that alignment when user space does not provide a
hint.

With this in place, an extra 2% of all allocated anonymous memory
belongs to a folio of the highest order, when compiling the kernel.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/mmap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index ff68a67a2a7c..e7652001a32e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1627,7 +1627,7 @@ generic_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.length = len;
 	info.low_limit = mm->mmap_base;
 	info.high_limit = mmap_end;
-	info.align_mask = 0;
+	info.align_mask = BIT(PAGE_SHIFT + ANON_FOLIO_ORDER_MAX) - 1;
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
 }
@@ -1677,7 +1677,7 @@ generic_get_unmapped_area_topdown(struct file *filp, unsigned long addr,
 	info.length = len;
 	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
 	info.high_limit = arch_get_mmap_base(addr, mm->mmap_base);
-	info.align_mask = 0;
+	info.align_mask = BIT(PAGE_SHIFT + ANON_FOLIO_ORDER_MAX) - 1;
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);

--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v2 PATCH 17/17] mm: Batch-zap large anonymous folio PTE mappings
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (15 preceding siblings ...)
  2023-04-14 13:03 ` [RFC v2 PATCH 16/17] mm: mmap: Align unhinted maps to highest anon folio order Ryan Roberts
@ 2023-04-14 13:03 ` Ryan Roberts
  2023-04-17  8:04 ` [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Yin, Fengwei
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 13:03 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao, Yin, Fengwei
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

This allows batching the rmap removal with folio_remove_rmap_range(),
which means we avoid spuriously adding a partially unmapped folio to the
deferrred split queue in the common case, which reduces split queue lock
contention.

Previously each page was removed from the rmap individually with
page_remove_rmap(). If the first page belonged to a large folio, this
would cause page_remove_rmap() to conclude that the folio was now
partially mapped and add the folio to the deferred split queue. But
subsequent calls would cause the folio to become fully unmapped, meaning
there is no value to adding it to the split queue.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 139 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 119 insertions(+), 20 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fac686e9f895..e1cb4bf6fd5d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1351,6 +1351,95 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
 	pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
 }

+static inline unsigned long page_addr(struct page *page,
+				struct page *anchor, unsigned long anchor_addr)
+{
+	unsigned long offset;
+	unsigned long addr;
+
+	offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
+	addr = anchor_addr + offset;
+
+	if (anchor > page) {
+		if (addr > anchor_addr)
+			return 0;
+	} else {
+		if (addr < anchor_addr)
+			return ULONG_MAX;
+	}
+
+	return addr;
+}
+
+static int calc_anon_folio_map_pgcount(struct folio *folio,
+				       struct page *page, pte_t *pte,
+				       unsigned long addr, unsigned long end)
+{
+	pte_t ptent;
+	int floops;
+	int i;
+	unsigned long pfn;
+
+	end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr),
+		  end);
+	floops = (end - addr) >> PAGE_SHIFT;
+	pfn = page_to_pfn(page);
+	pfn++;
+	pte++;
+
+	for (i = 1; i < floops; i++) {
+		ptent = *pte;
+
+		if (!pte_present(ptent) ||
+		    pte_pfn(ptent) != pfn) {
+			return i;
+		}
+
+		pfn++;
+		pte++;
+	}
+
+	return floops;
+}
+
+static unsigned long zap_anon_pte_range(struct mmu_gather *tlb,
+					struct vm_area_struct *vma,
+					struct page *page, pte_t *pte,
+					unsigned long addr, unsigned long end,
+					bool *full_out)
+{
+	struct folio *folio = page_folio(page);
+	struct mm_struct *mm = tlb->mm;
+	pte_t ptent;
+	int pgcount;
+	int i;
+	bool full;
+
+	pgcount = calc_anon_folio_map_pgcount(folio, page, pte, addr, end);
+
+	for (i = 0; i < pgcount;) {
+		ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+		tlb_remove_tlb_entry(tlb, pte, addr);
+		full = __tlb_remove_page(tlb, page, 0);
+
+		if (unlikely(page_mapcount(page) < 1))
+			print_bad_pte(vma, addr, ptent, page);
+
+		i++;
+		page++;
+		pte++;
+		addr += PAGE_SIZE;
+
+		if (unlikely(full))
+			break;
+	}
+
+	folio_remove_rmap_range(folio, page - i, i, vma);
+
+	*full_out = full;
+	return i;
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -1387,6 +1476,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			page = vm_normal_page(vma, addr, ptent);
 			if (unlikely(!should_zap_page(details, page)))
 				continue;
+
+			/*
+			 * Batch zap large anonymous folio mappings. This allows
+			 * batching the rmap removal, which means we avoid
+			 * spuriously adding a partially unmapped folio to the
+			 * deferrred split queue in the common case, which
+			 * reduces split queue lock contention. Require the VMA
+			 * to be anonymous to ensure that none of the PTEs in
+			 * the range require zap_install_uffd_wp_if_needed().
+			 */
+			if (page && PageAnon(page) && vma_is_anonymous(vma)) {
+				bool full;
+				int pgcount;
+
+				pgcount = zap_anon_pte_range(tlb, vma,
+						page, pte, addr, end, &full);
+
+				rss[mm_counter(page)] -= pgcount;
+				pgcount--;
+				pte += pgcount;
+				addr += pgcount << PAGE_SHIFT;
+
+				if (unlikely(full)) {
+					force_flush = 1;
+					addr += PAGE_SIZE;
+					break;
+				}
+				continue;
+			}
+
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
 			tlb_remove_tlb_entry(tlb, pte, addr);
@@ -3051,26 +3170,6 @@ struct anon_folio_range {
 	bool exclusive;
 };

-static inline unsigned long page_addr(struct page *page,
-				struct page *anchor, unsigned long anchor_addr)
-{
-	unsigned long offset;
-	unsigned long addr;
-
-	offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
-	addr = anchor_addr + offset;
-
-	if (anchor > page) {
-		if (addr > anchor_addr)
-			return 0;
-	} else {
-		if (addr < anchor_addr)
-			return ULONG_MAX;
-	}
-
-	return addr;
-}
-
 /*
  * Returns index of first pte that is not none, or nr if all are none.
  */
--
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order
  2023-04-14 13:02 ` [RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order Ryan Roberts
@ 2023-04-14 14:09   ` Kirill A. Shutemov
  2023-04-14 14:38     ` Ryan Roberts
  0 siblings, 1 reply; 44+ messages in thread
From: Kirill A. Shutemov @ 2023-04-14 14:09 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Yu Zhao, Yin, Fengwei, linux-mm, linux-arm-kernel

On Fri, Apr 14, 2023 at 02:02:51PM +0100, Ryan Roberts wrote:
> For variable-order anonymous folios, we want to tune the order that we
> prefer to allocate based on the vma. Add the routines to manage that
> heuristic.
> 
> TODO: Currently we always use the global maximum. Add per-vma logic!
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/mm.h | 5 +++++
>  mm/memory.c        | 8 ++++++++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cdb8c6031d0f..cc8d0b239116 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3674,4 +3674,9 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>  }
>  #endif
> 
> +/*
> + * TODO: Should this be set per-architecture?
> + */
> +#define ANON_FOLIO_ORDER_MAX	4
> +

I think it has to be derived from size in bytes, not directly specifies
page order. For 4K pages, order 4 is 64k and for 64k pages it is 1M.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order
  2023-04-14 14:09   ` Kirill A. Shutemov
@ 2023-04-14 14:38     ` Ryan Roberts
  2023-04-14 15:37       ` Kirill A. Shutemov
  0 siblings, 1 reply; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 14:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Yu Zhao, Yin, Fengwei, linux-mm, linux-arm-kernel

On 14/04/2023 15:09, Kirill A. Shutemov wrote:
> On Fri, Apr 14, 2023 at 02:02:51PM +0100, Ryan Roberts wrote:
>> For variable-order anonymous folios, we want to tune the order that we
>> prefer to allocate based on the vma. Add the routines to manage that
>> heuristic.
>>
>> TODO: Currently we always use the global maximum. Add per-vma logic!
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/mm.h | 5 +++++
>>  mm/memory.c        | 8 ++++++++
>>  2 files changed, 13 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index cdb8c6031d0f..cc8d0b239116 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3674,4 +3674,9 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>>  }
>>  #endif
>>
>> +/*
>> + * TODO: Should this be set per-architecture?
>> + */
>> +#define ANON_FOLIO_ORDER_MAX	4
>> +
> 
> I think it has to be derived from size in bytes, not directly specifies
> page order. For 4K pages, order 4 is 64k and for 64k pages it is 1M.
> 

Yes I see where you are coming from. What's your feel for what a sensible upper
bound in bytes is?

My difficulty is that I would like to be able to use this allocation mechanism
to enable using the "contiguous bit" on arm64; that's a set of contiguous PTEs
that are mapped to physically contiguous memory, and the HW can use that hint to
coalesce the TLB entries.

For 4KB pages, the contig size is 64KB (order-4), so that works nicely. But for
16KB and 64KB pages, its 2MB (order-7 and order-5 respectively). Do you think
allocating 2MB pages here is going to lead to too much memory wastage?



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order
  2023-04-14 14:38     ` Ryan Roberts
@ 2023-04-14 15:37       ` Kirill A. Shutemov
  2023-04-14 16:06         ` Ryan Roberts
  0 siblings, 1 reply; 44+ messages in thread
From: Kirill A. Shutemov @ 2023-04-14 15:37 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Yu Zhao, Yin, Fengwei, linux-mm, linux-arm-kernel

On Fri, Apr 14, 2023 at 03:38:35PM +0100, Ryan Roberts wrote:
> On 14/04/2023 15:09, Kirill A. Shutemov wrote:
> > On Fri, Apr 14, 2023 at 02:02:51PM +0100, Ryan Roberts wrote:
> >> For variable-order anonymous folios, we want to tune the order that we
> >> prefer to allocate based on the vma. Add the routines to manage that
> >> heuristic.
> >>
> >> TODO: Currently we always use the global maximum. Add per-vma logic!
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  include/linux/mm.h | 5 +++++
> >>  mm/memory.c        | 8 ++++++++
> >>  2 files changed, 13 insertions(+)
> >>
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index cdb8c6031d0f..cc8d0b239116 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -3674,4 +3674,9 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> >>  }
> >>  #endif
> >>
> >> +/*
> >> + * TODO: Should this be set per-architecture?
> >> + */
> >> +#define ANON_FOLIO_ORDER_MAX	4
> >> +
> > 
> > I think it has to be derived from size in bytes, not directly specifies
> > page order. For 4K pages, order 4 is 64k and for 64k pages it is 1M.
> > 
> 
> Yes I see where you are coming from. What's your feel for what a sensible upper
> bound in bytes is?
> 
> My difficulty is that I would like to be able to use this allocation mechanism
> to enable using the "contiguous bit" on arm64; that's a set of contiguous PTEs
> that are mapped to physically contiguous memory, and the HW can use that hint to
> coalesce the TLB entries.
> 
> For 4KB pages, the contig size is 64KB (order-4), so that works nicely. But for
> 16KB and 64KB pages, its 2MB (order-7 and order-5 respectively). Do you think
> allocating 2MB pages here is going to lead to too much memory wastage?

I think it boils down to the specifics of the microarchitecture.

We can justify 2M PMD-mapped THP in many cases. But PMD-mapped THP is not
only reduces TLB pressure (that contiguous bit does too, I believe), but
also saves one more memory access on page table walk.

It may or may not matter for the processor. It has to be evaluated.

Maybe moving it to per-arch is the right way. With default in generic code
to be ilog2(SZ_64K >> PAGE_SIZE) or something.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order
  2023-04-14 15:37       ` Kirill A. Shutemov
@ 2023-04-14 16:06         ` Ryan Roberts
  2023-04-14 16:18           ` Matthew Wilcox
  0 siblings, 1 reply; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 16:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Yu Zhao, Yin, Fengwei, linux-mm, linux-arm-kernel

On 14/04/2023 16:37, Kirill A. Shutemov wrote:
> On Fri, Apr 14, 2023 at 03:38:35PM +0100, Ryan Roberts wrote:
>> On 14/04/2023 15:09, Kirill A. Shutemov wrote:
>>> On Fri, Apr 14, 2023 at 02:02:51PM +0100, Ryan Roberts wrote:
>>>> For variable-order anonymous folios, we want to tune the order that we
>>>> prefer to allocate based on the vma. Add the routines to manage that
>>>> heuristic.
>>>>
>>>> TODO: Currently we always use the global maximum. Add per-vma logic!
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/mm.h | 5 +++++
>>>>  mm/memory.c        | 8 ++++++++
>>>>  2 files changed, 13 insertions(+)
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index cdb8c6031d0f..cc8d0b239116 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3674,4 +3674,9 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>>>>  }
>>>>  #endif
>>>>
>>>> +/*
>>>> + * TODO: Should this be set per-architecture?
>>>> + */
>>>> +#define ANON_FOLIO_ORDER_MAX	4
>>>> +
>>>
>>> I think it has to be derived from size in bytes, not directly specifies
>>> page order. For 4K pages, order 4 is 64k and for 64k pages it is 1M.
>>>
>>
>> Yes I see where you are coming from. What's your feel for what a sensible upper
>> bound in bytes is?
>>
>> My difficulty is that I would like to be able to use this allocation mechanism
>> to enable using the "contiguous bit" on arm64; that's a set of contiguous PTEs
>> that are mapped to physically contiguous memory, and the HW can use that hint to
>> coalesce the TLB entries.
>>
>> For 4KB pages, the contig size is 64KB (order-4), so that works nicely. But for
>> 16KB and 64KB pages, its 2MB (order-7 and order-5 respectively). Do you think
>> allocating 2MB pages here is going to lead to too much memory wastage?
> 
> I think it boils down to the specifics of the microarchitecture.
> 
> We can justify 2M PMD-mapped THP in many cases. But PMD-mapped THP is not
> only reduces TLB pressure (that contiguous bit does too, I believe), but
> also saves one more memory access on page table walk.
> 
> It may or may not matter for the processor. It has to be evaluated.

I think you are saying that if the performance uplift is good, then some extra
memory wastage can be justified?

The point I'm thinking about is for 4K pages, we need to allocate 64K blocks to
use the contig bit. Roughly I guess that means going from average of 2K wastage
per anon VMA to 32K. Perhaps you can get away with that for a decent perf uplift.

But for 64K pages, we need to allocate 2M blocks to use the contig bit. So that
takes average wastage from 32K to 1M. That feels a bit harder to justify.
Perhaps here, we should make a decision based on MADV_HUGEPAGE?

So perhaps we actually want 2 values: one for if MADV_HUGEPAGE is not set on the
VMA, and one if it is? (with 64K pages I'm guessing there are many cases where
we won't PMD-map THPs - its 512MB).

> 
> Maybe moving it to per-arch is the right way. With default in generic code
> to be ilog2(SZ_64K >> PAGE_SIZE) or something.

Yes, I agree that sounds like a good starting point for the !MADV_HUGEPAGE case.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order
  2023-04-14 16:06         ` Ryan Roberts
@ 2023-04-14 16:18           ` Matthew Wilcox
  2023-04-14 16:31             ` Ryan Roberts
  0 siblings, 1 reply; 44+ messages in thread
From: Matthew Wilcox @ 2023-04-14 16:18 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Kirill A. Shutemov, Andrew Morton, Yu Zhao, Yin, Fengwei,
	linux-mm, linux-arm-kernel

On Fri, Apr 14, 2023 at 05:06:49PM +0100, Ryan Roberts wrote:
> The point I'm thinking about is for 4K pages, we need to allocate 64K blocks to
> use the contig bit. Roughly I guess that means going from average of 2K wastage
> per anon VMA to 32K. Perhaps you can get away with that for a decent perf uplift.
> 
> But for 64K pages, we need to allocate 2M blocks to use the contig bit. So that
> takes average wastage from 32K to 1M. That feels a bit harder to justify.
> Perhaps here, we should make a decision based on MADV_HUGEPAGE?
> 
> So perhaps we actually want 2 values: one for if MADV_HUGEPAGE is not set on the
> VMA, and one if it is? (with 64K pages I'm guessing there are many cases where
> we won't PMD-map THPs - its 512MB).

I'm kind of hoping that all this work takes away the benefit from
CONFIG_PAGE_SIZE_64K, and we can just use 4k pages everywhere.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order
  2023-04-14 16:18           ` Matthew Wilcox
@ 2023-04-14 16:31             ` Ryan Roberts
  0 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-14 16:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, Andrew Morton, Yu Zhao, Yin, Fengwei,
	linux-mm, linux-arm-kernel

On 14/04/2023 17:18, Matthew Wilcox wrote:
> On Fri, Apr 14, 2023 at 05:06:49PM +0100, Ryan Roberts wrote:
>> The point I'm thinking about is for 4K pages, we need to allocate 64K blocks to
>> use the contig bit. Roughly I guess that means going from average of 2K wastage
>> per anon VMA to 32K. Perhaps you can get away with that for a decent perf uplift.
>>
>> But for 64K pages, we need to allocate 2M blocks to use the contig bit. So that
>> takes average wastage from 32K to 1M. That feels a bit harder to justify.
>> Perhaps here, we should make a decision based on MADV_HUGEPAGE?
>>
>> So perhaps we actually want 2 values: one for if MADV_HUGEPAGE is not set on the
>> VMA, and one if it is? (with 64K pages I'm guessing there are many cases where
>> we won't PMD-map THPs - its 512MB).
> 
> I'm kind of hoping that all this work takes away the benefit from
> CONFIG_PAGE_SIZE_64K, and we can just use 4k pages everywhere.

That sounds great. I'm not sure I share your confidence though ;-)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (16 preceding siblings ...)
  2023-04-14 13:03 ` [RFC v2 PATCH 17/17] mm: Batch-zap large anonymous folio PTE mappings Ryan Roberts
@ 2023-04-17  8:04 ` Yin, Fengwei
  2023-04-17 10:19   ` Ryan Roberts
  2023-04-17  8:19 ` Yin, Fengwei
  2023-04-17 10:54 ` David Hildenbrand
  19 siblings, 1 reply; 44+ messages in thread
From: Yin, Fengwei @ 2023-04-17  8:04 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel



On 4/14/2023 9:02 PM, Ryan Roberts wrote:
> Hi All,
> 
> This is a second RFC and my first proper attempt at implementing variable order,
> large folios for anonymous memory. The first RFC [1], was a partial
> implementation and a plea for help in debugging an issue I was hitting; thanks
> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
> 
> The objective of variable order anonymous folios is to improve performance by
> allocating larger chunks of memory during anonymous page faults:
> 
>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>    overhead. This should benefit all architectures.
>  - Since we are now mapping physically contiguous chunks of memory, we can take
>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
> 
> This patch set deals with the SW side of things only but sets us up nicely for
> taking advantage of the HW improvements in the near future.
> 
> I'm not yet benchmarking a wide variety of use cases, but those that I have
> looked at are positive; I see kernel compilation time improved by up to 10%,
> which I expect to improve further once I add in the arm64 "contiguous bit".
> Memory consumption is somewhere between 1% less and 2% more, depending on how
> its measured. More on perf and memory below.
> 
> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
> conflict resolution). I have a tree at [4].
> 
> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
> 
> Approach
> ========
> 
> There are 4 fault paths that have been modified:
>  - write fault on unallocated address: do_anonymous_page()
>  - write fault on zero page: wp_page_copy()
>  - write fault on non-exclusive CoW page: wp_page_copy()
>  - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
> 
> In the first 2 cases, we will determine the preferred order folio to allocate,
> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
> folio as the source, subject to constraints that may arise if the source has
> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
> the folio as we can, subject to the same mremap/munmap constraints.
> 
> If allocation of our preferred folio order fails, we gracefully fall back to
> lower orders all the way to 0.
> 
> Note that none of this affects the behavior of traditional PMD-sized THP. If we
> take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings.
> 
> Open Questions
> ==============
> 
> How to Move Forwards
> --------------------
> 
> While the series is a small-ish code change, it represents a big shift in the
> way things are done. So I'd appreciate any help in scaling up performance
> testing, review and general advice on how best to guide a change like this into
> the kernel.
> 
> Folio Allocation Order Policy
> -----------------------------
> 
> The current code is hardcoded to use a maximum order of 4. This was chosen for a
> couple of reasons:
>  - From the SW performance perspective, I see a knee around here where
>    increasing it doesn't lead to much more performance gain.
>  - Intuitively I assume that higher orders become increasingly difficult to
>    allocate.
>  - From the HW performance perspective, arm64's HPA works on order-2 blocks and
>    "the contiguous bit" works on order-4 for 4KB base pages (although it's
>    order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going
>    any higher.
> 
> I suggest that ultimately setting the max order should be left to the
> architecture. arm64 would take advantage of this and set it to the order
> required for the contiguous bit for the configured base page size.
> 
> However, I also have a (mild) concern about increased memory consumption. If an
> app has a pathological fault pattern (e.g. sparsely touches memory every 64KB)
> we would end up allocating 16x as much memory as we used to. One potential
> approach I see here is to track fault addresses per-VMA, and increase a per-VMA
> max allocation order for consecutive faults that extend a contiguous range, and
> decrement when discontiguous. Alternatively/additionally, we could use the VMA
> size as an indicator. I'd be interested in your thoughts/opinions.
> 
> Deferred Split Queue Lock Contention
> ------------------------------------
> 
> The results below show that we are spending a much greater proportion of time in
> the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs.
> 
> I think this is (at least partially) related for contention on the deferred
> split queue lock. This is a per-memcg spinlock, which means a single spinlock
> shared among all 160 CPUs. I've solved part of the problem with the last patch
> in the series (which cuts down the need to take the lock), but at folio free
> time (free_transhuge_page()), the lock is still taken and I think this could be
> a problem. Now that most anonymous pages are large folios, this lock is taken a
> lot more.
> 
> I think we could probably avoid taking the lock unless !list_empty(), but I
> haven't convinced myself its definitely safe, so haven't applied it yet.
Yes. It's safe. We also identified other lock contention with large folio
for anonymous mapping like lru lock and zone lock. My understanding is that
the anonymous page has much higher alloc/free frequency than page cache.

So the lock contention was not exposed by large folio for page cache.


I posted the related patch to:
https://lore.kernel.org/linux-mm/20230417075643.3287513-1-fengwei.yin@intel.com/T/#t


Regards
Yin, Fengwei


> 
> Roadmap
> =======
> 
> Beyond scaling up perf testing, I'm planning to enable use of the "contiguous
> bit" on arm64 to validate predictions about HW speedups.
> 
> I also think there are some opportunities with madvise to split folios to non-0
> orders, which might improve performance in some cases. madvise is also mistaking
> exclusive large folios for non-exclusive ones at the moment (due to the "small
> pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly
> frees the folio.
> 
> Results
> =======
> 
> Performance
> -----------
> 
> Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and
> with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned
> before each run.
> 
> make defconfig && time make -jN Image
> 
> First with -j8:
> 
> |           | baseline time  | anonfolio time | percent change |
> |           | to compile (s) | to compile (s) | SMALLER=better |
> |-----------|---------------:|---------------:|---------------:|
> | real-time |          373.0 |          342.8 |          -8.1% |
> | user-time |         2333.9 |         2275.3 |          -2.5% |
> | sys-time  |          510.7 |          340.9 |         -33.3% |
> 
> Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel
> execution. The next 2 tables show a breakdown of the cycles spent in the kernel
> for the 8 job config:
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (cycles) | (cycles)  | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | data abort           |     683B |      316B |         -53.8% |
> | instruction abort    |      93B |       76B |         -18.4% |
> | syscall              |     887B |      767B |         -13.6% |
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (cycles) | (cycles)  | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | arm64_sys_openat     |     194B |      188B |          -3.3% |
> | arm64_sys_exit_group |     192B |      124B |         -35.7% |
> | arm64_sys_read       |     124B |      108B |         -12.7% |
> | arm64_sys_execve     |      75B |       67B |         -11.0% |
> | arm64_sys_mmap       |      51B |       50B |          -3.0% |
> | arm64_sys_mprotect   |      15B |       13B |         -12.0% |
> | arm64_sys_write      |      43B |       42B |          -2.9% |
> | arm64_sys_munmap     |      15B |       12B |         -17.0% |
> | arm64_sys_newfstatat |      46B |       41B |          -9.7% |
> | arm64_sys_clone      |      26B |       24B |         -10.0% |
> 
> And now with -j160:
> 
> |           | baseline time  | anonfolio time | percent change |
> |           | to compile (s) | to compile (s) | SMALLER=better |
> |-----------|---------------:|---------------:|---------------:|
> | real-time |           53.7 |           48.2 |         -10.2% |
> | user-time |         2705.8 |         2842.1 |           5.0% |
> | sys-time  |         1370.4 |         1064.3 |         -22.3% |
> 
> Above shows a 10.2% improvement in real time execution. But ~3x more time is
> spent in the kernel than for the -j8 config. I think this is related to the lock
> contention issue I highlighted above, but haven't bottomed it out yet. It's also
> not yet clear to me why user-time increases by 5%.
> 
> I've also run all the will-it-scale microbenchmarks for a single task, using the
> process mode. Results for multiple runs on the same kernel are noisy - I see ~5%
> fluctuation. So I'm just calling out tests with results that have gt 5%
> improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests
> are regressed:
> 
> | benchmark            | baseline | anonfolio | percent change |
> |                      | ops/s    | ops/s     | BIGGER=better  |
> | ---------------------|---------:|----------:|---------------:|
> | context_switch1.csv  |   328744 |    351150 |          6.8%  |
> | malloc1.csv          |    96214 |     50890 |        -47.1%  |
> | mmap1.csv            |   410253 |    375746 |         -8.4%  |
> | page_fault1.csv      |   624061 |   3185678 |        410.5%  |
> | page_fault2.csv      |   416483 |    557448 |         33.8%  |
> | page_fault3.csv      |   724566 |   1152726 |         59.1%  |
> | read1.csv            |  1806908 |   1905752 |          5.5%  |
> | read2.csv            |   587722 |   1942062 |        230.4%  |
> | tlb_flush1.csv       |   143910 |    152097 |          5.7%  |
> | tlb_flush2.csv       |   266763 |    322320 |         20.8%  |
> 
> I believe malloc1 is an unrealistic test, since it does malloc/free for 128M
> object in a loop and never touches the allocated memory. I think the malloc
> implementation is maintaining a header just before the allocated object, which
> causes a single page fault. Previously that page fault allocated 1 page. Now it
> is allocating 16 pages. This cost would be repaid if the test code wrote to the
> allocated object. Alternatively the folio allocation order policy described
> above would also solve this.
> 
> It is not clear to me why mmap1 has slowed down. This remains a todo.
> 
> Memory
> ------
> 
> I measured memory consumption while doing a kernel compile with 8 jobs on a
> system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the
> workload, then calcualted "memory used" high and low watermarks using both
> MemFree and MemAvailable. If there is a better way of measuring system memory
> consumption, please let me know!
> 
> mem-used = 4GB - /proc/meminfo:MemFree
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (MB)     | (MB)      | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | mem-used-low         |      825 |       842 |           2.1% |
> | mem-used-high        |     2697 |      2672 |          -0.9% |
> 
> mem-used = 4GB - /proc/meminfo:MemAvailable
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (MB)     | (MB)      | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | mem-used-low         |      518 |       530 |           2.3% |
> | mem-used-high        |     1522 |      1537 |           1.0% |
> 
> For the high watermark, the methods disagree; we are either saving 1% or using
> 1% more. For the low watermark, both methods agree that we are using about 2%
> more. I plan to investigate whether the proposed folio allocation order policy
> can reduce this to zero.
> 
> Thanks for making it this far!
> Ryan
> 
> 
> Ryan Roberts (17):
>   mm: Expose clear_huge_page() unconditionally
>   mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
>   mm: Introduce try_vma_alloc_movable_folio()
>   mm: Implement folio_add_new_anon_rmap_range()
>   mm: Routines to determine max anon folio allocation order
>   mm: Allocate large folios for anonymous memory
>   mm: Allow deferred splitting of arbitrary large anon folios
>   mm: Implement folio_move_anon_rmap_range()
>   mm: Update wp_page_reuse() to operate on range of pages
>   mm: Reuse large folios for anonymous memory
>   mm: Split __wp_page_copy_user() into 2 variants
>   mm: ptep_clear_flush_range_notify() macro for batch operation
>   mm: Implement folio_remove_rmap_range()
>   mm: Copy large folios for anonymous memory
>   mm: Convert zero page to large folios on write
>   mm: mmap: Align unhinted maps to highest anon folio order
>   mm: Batch-zap large anonymous folio PTE mappings
> 
>  arch/alpha/include/asm/page.h   |   5 +-
>  arch/arm64/include/asm/page.h   |   3 +-
>  arch/arm64/mm/fault.c           |   7 +-
>  arch/ia64/include/asm/page.h    |   5 +-
>  arch/m68k/include/asm/page_no.h |   7 +-
>  arch/s390/include/asm/page.h    |   5 +-
>  arch/x86/include/asm/page.h     |   5 +-
>  include/linux/highmem.h         |  23 +-
>  include/linux/mm.h              |   8 +-
>  include/linux/mmu_notifier.h    |  31 ++
>  include/linux/rmap.h            |   6 +
>  mm/memory.c                     | 877 ++++++++++++++++++++++++++++----
>  mm/mmap.c                       |   4 +-
>  mm/rmap.c                       | 147 +++++-
>  14 files changed, 1000 insertions(+), 133 deletions(-)
> 
> --
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (17 preceding siblings ...)
  2023-04-17  8:04 ` [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Yin, Fengwei
@ 2023-04-17  8:19 ` Yin, Fengwei
  2023-04-17 10:28   ` Ryan Roberts
  2023-04-17 10:54 ` David Hildenbrand
  19 siblings, 1 reply; 44+ messages in thread
From: Yin, Fengwei @ 2023-04-17  8:19 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel



On 4/14/2023 9:02 PM, Ryan Roberts wrote:
> Hi All,
> 
> This is a second RFC and my first proper attempt at implementing variable order,
> large folios for anonymous memory. The first RFC [1], was a partial
> implementation and a plea for help in debugging an issue I was hitting; thanks
> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
> 
> The objective of variable order anonymous folios is to improve performance by
> allocating larger chunks of memory during anonymous page faults:
> 
>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>    overhead. This should benefit all architectures.
>  - Since we are now mapping physically contiguous chunks of memory, we can take
>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
> 
> This patch set deals with the SW side of things only but sets us up nicely for
> taking advantage of the HW improvements in the near future.
> 
> I'm not yet benchmarking a wide variety of use cases, but those that I have
> looked at are positive; I see kernel compilation time improved by up to 10%,
> which I expect to improve further once I add in the arm64 "contiguous bit".
> Memory consumption is somewhere between 1% less and 2% more, depending on how
> its measured. More on perf and memory below.
> 
> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
> conflict resolution). I have a tree at [4].
> 
> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
> 
> Approach
> ========
> 
> There are 4 fault paths that have been modified:
>  - write fault on unallocated address: do_anonymous_page()
>  - write fault on zero page: wp_page_copy()
>  - write fault on non-exclusive CoW page: wp_page_copy()
>  - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
> 
> In the first 2 cases, we will determine the preferred order folio to allocate,
> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
> folio as the source, subject to constraints that may arise if the source has
> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
> the folio as we can, subject to the same mremap/munmap constraints.
> 
> If allocation of our preferred folio order fails, we gracefully fall back to
> lower orders all the way to 0.
> 
> Note that none of this affects the behavior of traditional PMD-sized THP. If we
> take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings.
> 
> Open Questions
> ==============
> 
> How to Move Forwards
> --------------------
> 
> While the series is a small-ish code change, it represents a big shift in the
> way things are done. So I'd appreciate any help in scaling up performance
> testing, review and general advice on how best to guide a change like this into
> the kernel.
> 
> Folio Allocation Order Policy
> -----------------------------
> 
> The current code is hardcoded to use a maximum order of 4. This was chosen for a
> couple of reasons:
>  - From the SW performance perspective, I see a knee around here where
>    increasing it doesn't lead to much more performance gain.
>  - Intuitively I assume that higher orders become increasingly difficult to
>    allocate.
>  - From the HW performance perspective, arm64's HPA works on order-2 blocks and
>    "the contiguous bit" works on order-4 for 4KB base pages (although it's
>    order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going
>    any higher.
> 
> I suggest that ultimately setting the max order should be left to the
> architecture. arm64 would take advantage of this and set it to the order
> required for the contiguous bit for the configured base page size.
> 
> However, I also have a (mild) concern about increased memory consumption. If an
> app has a pathological fault pattern (e.g. sparsely touches memory every 64KB)
> we would end up allocating 16x as much memory as we used to. One potential
> approach I see here is to track fault addresses per-VMA, and increase a per-VMA
> max allocation order for consecutive faults that extend a contiguous range, and
> decrement when discontiguous. Alternatively/additionally, we could use the VMA
> size as an indicator. I'd be interested in your thoughts/opinions.
> 
> Deferred Split Queue Lock Contention
> ------------------------------------
> 
> The results below show that we are spending a much greater proportion of time in
> the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs.
> 
> I think this is (at least partially) related for contention on the deferred
> split queue lock. This is a per-memcg spinlock, which means a single spinlock
> shared among all 160 CPUs. I've solved part of the problem with the last patch
> in the series (which cuts down the need to take the lock), but at folio free
> time (free_transhuge_page()), the lock is still taken and I think this could be
> a problem. Now that most anonymous pages are large folios, this lock is taken a
> lot more.
> 
> I think we could probably avoid taking the lock unless !list_empty(), but I
> haven't convinced myself its definitely safe, so haven't applied it yet.
> 
> Roadmap
> =======
> 
> Beyond scaling up perf testing, I'm planning to enable use of the "contiguous
> bit" on arm64 to validate predictions about HW speedups.
> 
> I also think there are some opportunities with madvise to split folios to non-0
> orders, which might improve performance in some cases. madvise is also mistaking
> exclusive large folios for non-exclusive ones at the moment (due to the "small
> pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly
> frees the folio.
> 
> Results
> =======
> 
> Performance
> -----------
> 
> Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and
> with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned
> before each run.
> 
> make defconfig && time make -jN Image
> 
> First with -j8:
> 
> |           | baseline time  | anonfolio time | percent change |
> |           | to compile (s) | to compile (s) | SMALLER=better |
> |-----------|---------------:|---------------:|---------------:|
> | real-time |          373.0 |          342.8 |          -8.1% |
> | user-time |         2333.9 |         2275.3 |          -2.5% |
> | sys-time  |          510.7 |          340.9 |         -33.3% |
> 
> Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel
> execution. The next 2 tables show a breakdown of the cycles spent in the kernel
> for the 8 job config:
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (cycles) | (cycles)  | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | data abort           |     683B |      316B |         -53.8% |
> | instruction abort    |      93B |       76B |         -18.4% |
> | syscall              |     887B |      767B |         -13.6% |
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (cycles) | (cycles)  | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | arm64_sys_openat     |     194B |      188B |          -3.3% |
> | arm64_sys_exit_group |     192B |      124B |         -35.7% |
> | arm64_sys_read       |     124B |      108B |         -12.7% |
> | arm64_sys_execve     |      75B |       67B |         -11.0% |
> | arm64_sys_mmap       |      51B |       50B |          -3.0% |
> | arm64_sys_mprotect   |      15B |       13B |         -12.0% |
> | arm64_sys_write      |      43B |       42B |          -2.9% |
> | arm64_sys_munmap     |      15B |       12B |         -17.0% |
> | arm64_sys_newfstatat |      46B |       41B |          -9.7% |
> | arm64_sys_clone      |      26B |       24B |         -10.0% |
> 
> And now with -j160:
> 
> |           | baseline time  | anonfolio time | percent change |
> |           | to compile (s) | to compile (s) | SMALLER=better |
> |-----------|---------------:|---------------:|---------------:|
> | real-time |           53.7 |           48.2 |         -10.2% |
> | user-time |         2705.8 |         2842.1 |           5.0% |
> | sys-time  |         1370.4 |         1064.3 |         -22.3% |
> 
> Above shows a 10.2% improvement in real time execution. But ~3x more time is
> spent in the kernel than for the -j8 config. I think this is related to the lock
> contention issue I highlighted above, but haven't bottomed it out yet. It's also
> not yet clear to me why user-time increases by 5%.
> 
> I've also run all the will-it-scale microbenchmarks for a single task, using the
> process mode. Results for multiple runs on the same kernel are noisy - I see ~5%
> fluctuation. So I'm just calling out tests with results that have gt 5%
> improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests
> are regressed:
> 
> | benchmark            | baseline | anonfolio | percent change |
> |                      | ops/s    | ops/s     | BIGGER=better  |
> | ---------------------|---------:|----------:|---------------:|
> | context_switch1.csv  |   328744 |    351150 |          6.8%  |
> | malloc1.csv          |    96214 |     50890 |        -47.1%  |
> | mmap1.csv            |   410253 |    375746 |         -8.4%  |
> | page_fault1.csv      |   624061 |   3185678 |        410.5%  |
> | page_fault2.csv      |   416483 |    557448 |         33.8%  |
> | page_fault3.csv      |   724566 |   1152726 |         59.1%  |
> | read1.csv            |  1806908 |   1905752 |          5.5%  |
> | read2.csv            |   587722 |   1942062 |        230.4%  |
> | tlb_flush1.csv       |   143910 |    152097 |          5.7%  |
> | tlb_flush2.csv       |   266763 |    322320 |         20.8%  |
> 
> I believe malloc1 is an unrealistic test, since it does malloc/free for 128M
> object in a loop and never touches the allocated memory. I think the malloc
> implementation is maintaining a header just before the allocated object, which
> causes a single page fault. Previously that page fault allocated 1 page. Now it
> is allocating 16 pages. This cost would be repaid if the test code wrote to the
> allocated object. Alternatively the folio allocation order policy described
> above would also solve this.
> 
> It is not clear to me why mmap1 has slowed down. This remains a todo.
> 
> Memory
> ------
> 
> I measured memory consumption while doing a kernel compile with 8 jobs on a
> system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the
> workload, then calcualted "memory used" high and low watermarks using both
> MemFree and MemAvailable. If there is a better way of measuring system memory
> consumption, please let me know!
> 
> mem-used = 4GB - /proc/meminfo:MemFree
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (MB)     | (MB)      | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | mem-used-low         |      825 |       842 |           2.1% |
> | mem-used-high        |     2697 |      2672 |          -0.9% |
> 
> mem-used = 4GB - /proc/meminfo:MemAvailable
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (MB)     | (MB)      | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | mem-used-low         |      518 |       530 |           2.3% |
> | mem-used-high        |     1522 |      1537 |           1.0% |
> 
> For the high watermark, the methods disagree; we are either saving 1% or using
> 1% more. For the low watermark, both methods agree that we are using about 2%
> more. I plan to investigate whether the proposed folio allocation order policy
> can reduce this to zero.

Besides the memory consumption, the large folio could impact the tail latency
of page allocation also (extra zeroing memory, more operations on slow path of
page allocation).

Again, my understanding is the tail latency of page allocation is more important
to anonymous page than page cache page because anonymous page is allocated/freed
more frequently.


Regards
Yin, Fengwei

> 
> Thanks for making it this far!
> Ryan
> 
> 
> Ryan Roberts (17):
>   mm: Expose clear_huge_page() unconditionally
>   mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
>   mm: Introduce try_vma_alloc_movable_folio()
>   mm: Implement folio_add_new_anon_rmap_range()
>   mm: Routines to determine max anon folio allocation order
>   mm: Allocate large folios for anonymous memory
>   mm: Allow deferred splitting of arbitrary large anon folios
>   mm: Implement folio_move_anon_rmap_range()
>   mm: Update wp_page_reuse() to operate on range of pages
>   mm: Reuse large folios for anonymous memory
>   mm: Split __wp_page_copy_user() into 2 variants
>   mm: ptep_clear_flush_range_notify() macro for batch operation
>   mm: Implement folio_remove_rmap_range()
>   mm: Copy large folios for anonymous memory
>   mm: Convert zero page to large folios on write
>   mm: mmap: Align unhinted maps to highest anon folio order
>   mm: Batch-zap large anonymous folio PTE mappings
> 
>  arch/alpha/include/asm/page.h   |   5 +-
>  arch/arm64/include/asm/page.h   |   3 +-
>  arch/arm64/mm/fault.c           |   7 +-
>  arch/ia64/include/asm/page.h    |   5 +-
>  arch/m68k/include/asm/page_no.h |   7 +-
>  arch/s390/include/asm/page.h    |   5 +-
>  arch/x86/include/asm/page.h     |   5 +-
>  include/linux/highmem.h         |  23 +-
>  include/linux/mm.h              |   8 +-
>  include/linux/mmu_notifier.h    |  31 ++
>  include/linux/rmap.h            |   6 +
>  mm/memory.c                     | 877 ++++++++++++++++++++++++++++----
>  mm/mmap.c                       |   4 +-
>  mm/rmap.c                       | 147 +++++-
>  14 files changed, 1000 insertions(+), 133 deletions(-)
> 
> --
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 16/17] mm: mmap: Align unhinted maps to highest anon folio order
  2023-04-14 13:03 ` [RFC v2 PATCH 16/17] mm: mmap: Align unhinted maps to highest anon folio order Ryan Roberts
@ 2023-04-17  8:25   ` Yin, Fengwei
  2023-04-17 10:13     ` Ryan Roberts
  0 siblings, 1 reply; 44+ messages in thread
From: Yin, Fengwei @ 2023-04-17  8:25 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel



On 4/14/2023 9:03 PM, Ryan Roberts wrote:
> When allocating large anonymous folios, we want to maximize our chances
> of being able to use the highest order we support. Since one of the
> constraints is that a folio has to be mapped naturally aligned, let's
> have mmap default to that alignment when user space does not provide a
> hint.
> 
> With this in place, an extra 2% of all allocated anonymous memory
> belongs to a folio of the highest order, when compiling the kernel.
This change has side effect: reduce the chance of VMA merging.
So benefit to per-VMA lock also. But find VMA need searching more VMAs.


Regards
Yin, Fengwei

> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/mmap.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index ff68a67a2a7c..e7652001a32e 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1627,7 +1627,7 @@ generic_get_unmapped_area(struct file *filp, unsigned long addr,
>  	info.length = len;
>  	info.low_limit = mm->mmap_base;
>  	info.high_limit = mmap_end;
> -	info.align_mask = 0;
> +	info.align_mask = BIT(PAGE_SHIFT + ANON_FOLIO_ORDER_MAX) - 1;
>  	info.align_offset = 0;
>  	return vm_unmapped_area(&info);
>  }
> @@ -1677,7 +1677,7 @@ generic_get_unmapped_area_topdown(struct file *filp, unsigned long addr,
>  	info.length = len;
>  	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
>  	info.high_limit = arch_get_mmap_base(addr, mm->mmap_base);
> -	info.align_mask = 0;
> +	info.align_mask = BIT(PAGE_SHIFT + ANON_FOLIO_ORDER_MAX) - 1;
>  	info.align_offset = 0;
>  	addr = vm_unmapped_area(&info);
> 
> --
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 03/17] mm: Introduce try_vma_alloc_movable_folio()
  2023-04-14 13:02 ` [RFC v2 PATCH 03/17] mm: Introduce try_vma_alloc_movable_folio() Ryan Roberts
@ 2023-04-17  8:49   ` Yin, Fengwei
  2023-04-17 10:11     ` Ryan Roberts
  0 siblings, 1 reply; 44+ messages in thread
From: Yin, Fengwei @ 2023-04-17  8:49 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel



On 4/14/2023 9:02 PM, Ryan Roberts wrote:
> Opportunistically attempt to allocate high-order folios in highmem,
> optionally zeroed. Retry with lower orders all the way to order-0, until
> success. Although, of note, order-1 allocations are skipped since a
> large folio must be at least order-2 to work with the THP machinery. The
> user must check what they got with folio_order().
> 
> This will be used to oportunistically allocate large folios for
> anonymous memory with a sensible fallback under memory pressure.
> 
> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
> high latency due to reclaim, instead preferring to just try for a lower
> order. The same approach is used by the readahead code when allocating
> large folios.
I am not sure whether anonymous page can share the same approach as page
cache. The latency of new page cache is dominated by IO. So it may be not
big deal to retry with different order some times.

Retry too many times could bring latency for anonymous page allocation.

Regards
Yin, Fengwei

> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 9d5e8be49f3b..ca32f59acef2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2989,6 +2989,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>  	return 0;
>  }
> 
> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
> +				unsigned long vaddr, int order, bool zeroed)
> +{
> +	gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
> +
> +	if (zeroed)
> +		return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
> +	else
> +		return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
> +								vaddr, false);
> +}
> +
> +/*
> + * Opportunistically attempt to allocate high-order folios, retrying with lower
> + * orders all the way to order-0, until success. order-1 allocations are skipped
> + * since a folio must be at least order-2 to work with the THP machinery. The
> + * user must check what they got with folio_order(). vaddr can be any virtual
> + * address that will be mapped by the allocated folio.
> + */
> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
> +				unsigned long vaddr, int order, bool zeroed)
> +{
> +	struct folio *folio;
> +
> +	for (; order > 1; order--) {
> +		folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
> +		if (folio)
> +			return folio;
> +	}
> +
> +	return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
> +}
> +
>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> --
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 03/17] mm: Introduce try_vma_alloc_movable_folio()
  2023-04-17  8:49   ` Yin, Fengwei
@ 2023-04-17 10:11     ` Ryan Roberts
  0 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-17 10:11 UTC (permalink / raw)
  To: Yin, Fengwei, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel

On 17/04/2023 09:49, Yin, Fengwei wrote:
> 
> 
> On 4/14/2023 9:02 PM, Ryan Roberts wrote:
>> Opportunistically attempt to allocate high-order folios in highmem,
>> optionally zeroed. Retry with lower orders all the way to order-0, until
>> success. Although, of note, order-1 allocations are skipped since a
>> large folio must be at least order-2 to work with the THP machinery. The
>> user must check what they got with folio_order().
>>
>> This will be used to oportunistically allocate large folios for
>> anonymous memory with a sensible fallback under memory pressure.
>>
>> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
>> high latency due to reclaim, instead preferring to just try for a lower
>> order. The same approach is used by the readahead code when allocating
>> large folios.
> I am not sure whether anonymous page can share the same approach as page
> cache. The latency of new page cache is dominated by IO. So it may be not
> big deal to retry with different order some times.
> 
> Retry too many times could bring latency for anonymous page allocation.

Perhaps I'm better off just using vma_thp_gfp_mask(), or at least taking
inspiration from it?

> 
> Regards
> Yin, Fengwei
> 
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
>>  1 file changed, 33 insertions(+)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 9d5e8be49f3b..ca32f59acef2 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2989,6 +2989,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>  	return 0;
>>  }
>>
>> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
>> +				unsigned long vaddr, int order, bool zeroed)
>> +{
>> +	gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
>> +
>> +	if (zeroed)
>> +		return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
>> +	else
>> +		return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
>> +								vaddr, false);
>> +}
>> +
>> +/*
>> + * Opportunistically attempt to allocate high-order folios, retrying with lower
>> + * orders all the way to order-0, until success. order-1 allocations are skipped
>> + * since a folio must be at least order-2 to work with the THP machinery. The
>> + * user must check what they got with folio_order(). vaddr can be any virtual
>> + * address that will be mapped by the allocated folio.
>> + */
>> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>> +				unsigned long vaddr, int order, bool zeroed)
>> +{
>> +	struct folio *folio;
>> +
>> +	for (; order > 1; order--) {
>> +		folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
>> +		if (folio)
>> +			return folio;
>> +	}
>> +
>> +	return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>> +}
>> +
>>  /*
>>   * Handle write page faults for pages that can be reused in the current vma
>>   *
>> --
>> 2.25.1
>>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 16/17] mm: mmap: Align unhinted maps to highest anon folio order
  2023-04-17  8:25   ` Yin, Fengwei
@ 2023-04-17 10:13     ` Ryan Roberts
  0 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-17 10:13 UTC (permalink / raw)
  To: Yin, Fengwei, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel

On 17/04/2023 09:25, Yin, Fengwei wrote:
> 
> 
> On 4/14/2023 9:03 PM, Ryan Roberts wrote:
>> When allocating large anonymous folios, we want to maximize our chances
>> of being able to use the highest order we support. Since one of the
>> constraints is that a folio has to be mapped naturally aligned, let's
>> have mmap default to that alignment when user space does not provide a
>> hint.
>>
>> With this in place, an extra 2% of all allocated anonymous memory
>> belongs to a folio of the highest order, when compiling the kernel.
> This change has side effect: reduce the chance of VMA merging.
> So benefit to per-VMA lock also. But find VMA need searching more VMAs.

Good point. This change brings only a very marginal benefit anyway, so I think I
might just drop the it from the series to avoid any unexpected issues.

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/mmap.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index ff68a67a2a7c..e7652001a32e 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -1627,7 +1627,7 @@ generic_get_unmapped_area(struct file *filp, unsigned long addr,
>>  	info.length = len;
>>  	info.low_limit = mm->mmap_base;
>>  	info.high_limit = mmap_end;
>> -	info.align_mask = 0;
>> +	info.align_mask = BIT(PAGE_SHIFT + ANON_FOLIO_ORDER_MAX) - 1;
>>  	info.align_offset = 0;
>>  	return vm_unmapped_area(&info);
>>  }
>> @@ -1677,7 +1677,7 @@ generic_get_unmapped_area_topdown(struct file *filp, unsigned long addr,
>>  	info.length = len;
>>  	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
>>  	info.high_limit = arch_get_mmap_base(addr, mm->mmap_base);
>> -	info.align_mask = 0;
>> +	info.align_mask = BIT(PAGE_SHIFT + ANON_FOLIO_ORDER_MAX) - 1;
>>  	info.align_offset = 0;
>>  	addr = vm_unmapped_area(&info);
>>
>> --
>> 2.25.1
>>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
  2023-04-17  8:04 ` [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Yin, Fengwei
@ 2023-04-17 10:19   ` Ryan Roberts
  0 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-17 10:19 UTC (permalink / raw)
  To: Yin, Fengwei, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel

On 17/04/2023 09:04, Yin, Fengwei wrote:
> 
> 
> On 4/14/2023 9:02 PM, Ryan Roberts wrote:
>> Hi All,
>>
>> This is a second RFC and my first proper attempt at implementing variable order,
>> large folios for anonymous memory. The first RFC [1], was a partial
>> implementation and a plea for help in debugging an issue I was hitting; thanks
>> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
>>
>> The objective of variable order anonymous folios is to improve performance by
>> allocating larger chunks of memory during anonymous page faults:
>>
>>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>>    overhead. This should benefit all architectures.
>>  - Since we are now mapping physically contiguous chunks of memory, we can take
>>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
>>
>> This patch set deals with the SW side of things only but sets us up nicely for
>> taking advantage of the HW improvements in the near future.
>>
>> I'm not yet benchmarking a wide variety of use cases, but those that I have
>> looked at are positive; I see kernel compilation time improved by up to 10%,
>> which I expect to improve further once I add in the arm64 "contiguous bit".
>> Memory consumption is somewhere between 1% less and 2% more, depending on how
>> its measured. More on perf and memory below.
>>
>> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
>> conflict resolution). I have a tree at [4].
>>
>> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
>> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
>> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
>> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
>>
>> Approach
>> ========
>>
>> There are 4 fault paths that have been modified:
>>  - write fault on unallocated address: do_anonymous_page()
>>  - write fault on zero page: wp_page_copy()
>>  - write fault on non-exclusive CoW page: wp_page_copy()
>>  - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
>>
>> In the first 2 cases, we will determine the preferred order folio to allocate,
>> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
>> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
>> folio as the source, subject to constraints that may arise if the source has
>> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
>> the folio as we can, subject to the same mremap/munmap constraints.
>>
>> If allocation of our preferred folio order fails, we gracefully fall back to
>> lower orders all the way to 0.
>>
>> Note that none of this affects the behavior of traditional PMD-sized THP. If we
>> take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings.
>>
>> Open Questions
>> ==============
>>
>> How to Move Forwards
>> --------------------
>>
>> While the series is a small-ish code change, it represents a big shift in the
>> way things are done. So I'd appreciate any help in scaling up performance
>> testing, review and general advice on how best to guide a change like this into
>> the kernel.
>>
>> Folio Allocation Order Policy
>> -----------------------------
>>
>> The current code is hardcoded to use a maximum order of 4. This was chosen for a
>> couple of reasons:
>>  - From the SW performance perspective, I see a knee around here where
>>    increasing it doesn't lead to much more performance gain.
>>  - Intuitively I assume that higher orders become increasingly difficult to
>>    allocate.
>>  - From the HW performance perspective, arm64's HPA works on order-2 blocks and
>>    "the contiguous bit" works on order-4 for 4KB base pages (although it's
>>    order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going
>>    any higher.
>>
>> I suggest that ultimately setting the max order should be left to the
>> architecture. arm64 would take advantage of this and set it to the order
>> required for the contiguous bit for the configured base page size.
>>
>> However, I also have a (mild) concern about increased memory consumption. If an
>> app has a pathological fault pattern (e.g. sparsely touches memory every 64KB)
>> we would end up allocating 16x as much memory as we used to. One potential
>> approach I see here is to track fault addresses per-VMA, and increase a per-VMA
>> max allocation order for consecutive faults that extend a contiguous range, and
>> decrement when discontiguous. Alternatively/additionally, we could use the VMA
>> size as an indicator. I'd be interested in your thoughts/opinions.
>>
>> Deferred Split Queue Lock Contention
>> ------------------------------------
>>
>> The results below show that we are spending a much greater proportion of time in
>> the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs.
>>
>> I think this is (at least partially) related for contention on the deferred
>> split queue lock. This is a per-memcg spinlock, which means a single spinlock
>> shared among all 160 CPUs. I've solved part of the problem with the last patch
>> in the series (which cuts down the need to take the lock), but at folio free
>> time (free_transhuge_page()), the lock is still taken and I think this could be
>> a problem. Now that most anonymous pages are large folios, this lock is taken a
>> lot more.
>>
>> I think we could probably avoid taking the lock unless !list_empty(), but I
>> haven't convinced myself its definitely safe, so haven't applied it yet.
> Yes. It's safe. We also identified other lock contention with large folio
> for anonymous mapping like lru lock and zone lock. My understanding is that
> the anonymous page has much higher alloc/free frequency than page cache.
> 
> So the lock contention was not exposed by large folio for page cache.
> 
> 
> I posted the related patch to:
> https://lore.kernel.org/linux-mm/20230417075643.3287513-1-fengwei.yin@intel.com/T/#t

Thanks for posting! I added these patches on top of mine and reran my perf test
for -j160. Results as follows:

|           | baseline time  | anonfolio time | percent change |
|           | to compile (s) | to compile (s) | SMALLER=better |
|-----------|---------------:|---------------:|---------------:|
| real-time |           53.7 |           49.1 |          -8.4% |
| user-time |         2705.8 |         2939.1 |           8.6% |
| sys-time  |         1370.4 |          993.4 |         -27.5% |

So comparing to the original table below, its definitely reduced the time spent
in the kernel, but time in user space has increased, which is unexpected! I
think this is something I will have to dig into more deeply, but I don't have
access to the HW for the next couple of weeks so will come back to it then.



> 
> 
> Regards
> Yin, Fengwei
> 
> 
>>
>> Roadmap
>> =======
>>
>> Beyond scaling up perf testing, I'm planning to enable use of the "contiguous
>> bit" on arm64 to validate predictions about HW speedups.
>>
>> I also think there are some opportunities with madvise to split folios to non-0
>> orders, which might improve performance in some cases. madvise is also mistaking
>> exclusive large folios for non-exclusive ones at the moment (due to the "small
>> pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly
>> frees the folio.
>>
>> Results
>> =======
>>
>> Performance
>> -----------
>>
>> Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and
>> with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned
>> before each run.
>>
>> make defconfig && time make -jN Image
>>
>> First with -j8:
>>
>> |           | baseline time  | anonfolio time | percent change |
>> |           | to compile (s) | to compile (s) | SMALLER=better |
>> |-----------|---------------:|---------------:|---------------:|
>> | real-time |          373.0 |          342.8 |          -8.1% |
>> | user-time |         2333.9 |         2275.3 |          -2.5% |
>> | sys-time  |          510.7 |          340.9 |         -33.3% |
>>
>> Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel
>> execution. The next 2 tables show a breakdown of the cycles spent in the kernel
>> for the 8 job config:
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (cycles) | (cycles)  | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | data abort           |     683B |      316B |         -53.8% |
>> | instruction abort    |      93B |       76B |         -18.4% |
>> | syscall              |     887B |      767B |         -13.6% |
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (cycles) | (cycles)  | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | arm64_sys_openat     |     194B |      188B |          -3.3% |
>> | arm64_sys_exit_group |     192B |      124B |         -35.7% |
>> | arm64_sys_read       |     124B |      108B |         -12.7% |
>> | arm64_sys_execve     |      75B |       67B |         -11.0% |
>> | arm64_sys_mmap       |      51B |       50B |          -3.0% |
>> | arm64_sys_mprotect   |      15B |       13B |         -12.0% |
>> | arm64_sys_write      |      43B |       42B |          -2.9% |
>> | arm64_sys_munmap     |      15B |       12B |         -17.0% |
>> | arm64_sys_newfstatat |      46B |       41B |          -9.7% |
>> | arm64_sys_clone      |      26B |       24B |         -10.0% |
>>
>> And now with -j160:
>>
>> |           | baseline time  | anonfolio time | percent change |
>> |           | to compile (s) | to compile (s) | SMALLER=better |
>> |-----------|---------------:|---------------:|---------------:|
>> | real-time |           53.7 |           48.2 |         -10.2% |
>> | user-time |         2705.8 |         2842.1 |           5.0% |
>> | sys-time  |         1370.4 |         1064.3 |         -22.3% |
>>
>> Above shows a 10.2% improvement in real time execution. But ~3x more time is
>> spent in the kernel than for the -j8 config. I think this is related to the lock
>> contention issue I highlighted above, but haven't bottomed it out yet. It's also
>> not yet clear to me why user-time increases by 5%.
>>
>> I've also run all the will-it-scale microbenchmarks for a single task, using the
>> process mode. Results for multiple runs on the same kernel are noisy - I see ~5%
>> fluctuation. So I'm just calling out tests with results that have gt 5%
>> improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests
>> are regressed:
>>
>> | benchmark            | baseline | anonfolio | percent change |
>> |                      | ops/s    | ops/s     | BIGGER=better  |
>> | ---------------------|---------:|----------:|---------------:|
>> | context_switch1.csv  |   328744 |    351150 |          6.8%  |
>> | malloc1.csv          |    96214 |     50890 |        -47.1%  |
>> | mmap1.csv            |   410253 |    375746 |         -8.4%  |
>> | page_fault1.csv      |   624061 |   3185678 |        410.5%  |
>> | page_fault2.csv      |   416483 |    557448 |         33.8%  |
>> | page_fault3.csv      |   724566 |   1152726 |         59.1%  |
>> | read1.csv            |  1806908 |   1905752 |          5.5%  |
>> | read2.csv            |   587722 |   1942062 |        230.4%  |
>> | tlb_flush1.csv       |   143910 |    152097 |          5.7%  |
>> | tlb_flush2.csv       |   266763 |    322320 |         20.8%  |
>>
>> I believe malloc1 is an unrealistic test, since it does malloc/free for 128M
>> object in a loop and never touches the allocated memory. I think the malloc
>> implementation is maintaining a header just before the allocated object, which
>> causes a single page fault. Previously that page fault allocated 1 page. Now it
>> is allocating 16 pages. This cost would be repaid if the test code wrote to the
>> allocated object. Alternatively the folio allocation order policy described
>> above would also solve this.
>>
>> It is not clear to me why mmap1 has slowed down. This remains a todo.
>>
>> Memory
>> ------
>>
>> I measured memory consumption while doing a kernel compile with 8 jobs on a
>> system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the
>> workload, then calcualted "memory used" high and low watermarks using both
>> MemFree and MemAvailable. If there is a better way of measuring system memory
>> consumption, please let me know!
>>
>> mem-used = 4GB - /proc/meminfo:MemFree
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (MB)     | (MB)      | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | mem-used-low         |      825 |       842 |           2.1% |
>> | mem-used-high        |     2697 |      2672 |          -0.9% |
>>
>> mem-used = 4GB - /proc/meminfo:MemAvailable
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (MB)     | (MB)      | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | mem-used-low         |      518 |       530 |           2.3% |
>> | mem-used-high        |     1522 |      1537 |           1.0% |
>>
>> For the high watermark, the methods disagree; we are either saving 1% or using
>> 1% more. For the low watermark, both methods agree that we are using about 2%
>> more. I plan to investigate whether the proposed folio allocation order policy
>> can reduce this to zero.
>>
>> Thanks for making it this far!
>> Ryan
>>
>>
>> Ryan Roberts (17):
>>   mm: Expose clear_huge_page() unconditionally
>>   mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
>>   mm: Introduce try_vma_alloc_movable_folio()
>>   mm: Implement folio_add_new_anon_rmap_range()
>>   mm: Routines to determine max anon folio allocation order
>>   mm: Allocate large folios for anonymous memory
>>   mm: Allow deferred splitting of arbitrary large anon folios
>>   mm: Implement folio_move_anon_rmap_range()
>>   mm: Update wp_page_reuse() to operate on range of pages
>>   mm: Reuse large folios for anonymous memory
>>   mm: Split __wp_page_copy_user() into 2 variants
>>   mm: ptep_clear_flush_range_notify() macro for batch operation
>>   mm: Implement folio_remove_rmap_range()
>>   mm: Copy large folios for anonymous memory
>>   mm: Convert zero page to large folios on write
>>   mm: mmap: Align unhinted maps to highest anon folio order
>>   mm: Batch-zap large anonymous folio PTE mappings
>>
>>  arch/alpha/include/asm/page.h   |   5 +-
>>  arch/arm64/include/asm/page.h   |   3 +-
>>  arch/arm64/mm/fault.c           |   7 +-
>>  arch/ia64/include/asm/page.h    |   5 +-
>>  arch/m68k/include/asm/page_no.h |   7 +-
>>  arch/s390/include/asm/page.h    |   5 +-
>>  arch/x86/include/asm/page.h     |   5 +-
>>  include/linux/highmem.h         |  23 +-
>>  include/linux/mm.h              |   8 +-
>>  include/linux/mmu_notifier.h    |  31 ++
>>  include/linux/rmap.h            |   6 +
>>  mm/memory.c                     | 877 ++++++++++++++++++++++++++++----
>>  mm/mmap.c                       |   4 +-
>>  mm/rmap.c                       | 147 +++++-
>>  14 files changed, 1000 insertions(+), 133 deletions(-)
>>
>> --
>> 2.25.1
>>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
  2023-04-17  8:19 ` Yin, Fengwei
@ 2023-04-17 10:28   ` Ryan Roberts
  0 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-17 10:28 UTC (permalink / raw)
  To: Yin, Fengwei, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel

On 17/04/2023 09:19, Yin, Fengwei wrote:
> 
> 
> On 4/14/2023 9:02 PM, Ryan Roberts wrote:
>> Hi All,
>>
>> This is a second RFC and my first proper attempt at implementing variable order,
>> large folios for anonymous memory. The first RFC [1], was a partial
>> implementation and a plea for help in debugging an issue I was hitting; thanks
>> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
>>
>> The objective of variable order anonymous folios is to improve performance by
>> allocating larger chunks of memory during anonymous page faults:
>>
>>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>>    overhead. This should benefit all architectures.
>>  - Since we are now mapping physically contiguous chunks of memory, we can take
>>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
>>
>> This patch set deals with the SW side of things only but sets us up nicely for
>> taking advantage of the HW improvements in the near future.
>>
>> I'm not yet benchmarking a wide variety of use cases, but those that I have
>> looked at are positive; I see kernel compilation time improved by up to 10%,
>> which I expect to improve further once I add in the arm64 "contiguous bit".
>> Memory consumption is somewhere between 1% less and 2% more, depending on how
>> its measured. More on perf and memory below.
>>
>> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
>> conflict resolution). I have a tree at [4].
>>
>> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
>> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
>> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
>> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
>>
>> Approach
>> ========
>>
>> There are 4 fault paths that have been modified:
>>  - write fault on unallocated address: do_anonymous_page()
>>  - write fault on zero page: wp_page_copy()
>>  - write fault on non-exclusive CoW page: wp_page_copy()
>>  - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
>>
>> In the first 2 cases, we will determine the preferred order folio to allocate,
>> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
>> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
>> folio as the source, subject to constraints that may arise if the source has
>> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
>> the folio as we can, subject to the same mremap/munmap constraints.
>>
>> If allocation of our preferred folio order fails, we gracefully fall back to
>> lower orders all the way to 0.
>>
>> Note that none of this affects the behavior of traditional PMD-sized THP. If we
>> take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings.
>>
>> Open Questions
>> ==============
>>
>> How to Move Forwards
>> --------------------
>>
>> While the series is a small-ish code change, it represents a big shift in the
>> way things are done. So I'd appreciate any help in scaling up performance
>> testing, review and general advice on how best to guide a change like this into
>> the kernel.
>>
>> Folio Allocation Order Policy
>> -----------------------------
>>
>> The current code is hardcoded to use a maximum order of 4. This was chosen for a
>> couple of reasons:
>>  - From the SW performance perspective, I see a knee around here where
>>    increasing it doesn't lead to much more performance gain.
>>  - Intuitively I assume that higher orders become increasingly difficult to
>>    allocate.
>>  - From the HW performance perspective, arm64's HPA works on order-2 blocks and
>>    "the contiguous bit" works on order-4 for 4KB base pages (although it's
>>    order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going
>>    any higher.
>>
>> I suggest that ultimately setting the max order should be left to the
>> architecture. arm64 would take advantage of this and set it to the order
>> required for the contiguous bit for the configured base page size.
>>
>> However, I also have a (mild) concern about increased memory consumption. If an
>> app has a pathological fault pattern (e.g. sparsely touches memory every 64KB)
>> we would end up allocating 16x as much memory as we used to. One potential
>> approach I see here is to track fault addresses per-VMA, and increase a per-VMA
>> max allocation order for consecutive faults that extend a contiguous range, and
>> decrement when discontiguous. Alternatively/additionally, we could use the VMA
>> size as an indicator. I'd be interested in your thoughts/opinions.
>>
>> Deferred Split Queue Lock Contention
>> ------------------------------------
>>
>> The results below show that we are spending a much greater proportion of time in
>> the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs.
>>
>> I think this is (at least partially) related for contention on the deferred
>> split queue lock. This is a per-memcg spinlock, which means a single spinlock
>> shared among all 160 CPUs. I've solved part of the problem with the last patch
>> in the series (which cuts down the need to take the lock), but at folio free
>> time (free_transhuge_page()), the lock is still taken and I think this could be
>> a problem. Now that most anonymous pages are large folios, this lock is taken a
>> lot more.
>>
>> I think we could probably avoid taking the lock unless !list_empty(), but I
>> haven't convinced myself its definitely safe, so haven't applied it yet.
>>
>> Roadmap
>> =======
>>
>> Beyond scaling up perf testing, I'm planning to enable use of the "contiguous
>> bit" on arm64 to validate predictions about HW speedups.
>>
>> I also think there are some opportunities with madvise to split folios to non-0
>> orders, which might improve performance in some cases. madvise is also mistaking
>> exclusive large folios for non-exclusive ones at the moment (due to the "small
>> pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly
>> frees the folio.
>>
>> Results
>> =======
>>
>> Performance
>> -----------
>>
>> Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and
>> with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned
>> before each run.
>>
>> make defconfig && time make -jN Image
>>
>> First with -j8:
>>
>> |           | baseline time  | anonfolio time | percent change |
>> |           | to compile (s) | to compile (s) | SMALLER=better |
>> |-----------|---------------:|---------------:|---------------:|
>> | real-time |          373.0 |          342.8 |          -8.1% |
>> | user-time |         2333.9 |         2275.3 |          -2.5% |
>> | sys-time  |          510.7 |          340.9 |         -33.3% |
>>
>> Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel
>> execution. The next 2 tables show a breakdown of the cycles spent in the kernel
>> for the 8 job config:
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (cycles) | (cycles)  | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | data abort           |     683B |      316B |         -53.8% |
>> | instruction abort    |      93B |       76B |         -18.4% |
>> | syscall              |     887B |      767B |         -13.6% |
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (cycles) | (cycles)  | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | arm64_sys_openat     |     194B |      188B |          -3.3% |
>> | arm64_sys_exit_group |     192B |      124B |         -35.7% |
>> | arm64_sys_read       |     124B |      108B |         -12.7% |
>> | arm64_sys_execve     |      75B |       67B |         -11.0% |
>> | arm64_sys_mmap       |      51B |       50B |          -3.0% |
>> | arm64_sys_mprotect   |      15B |       13B |         -12.0% |
>> | arm64_sys_write      |      43B |       42B |          -2.9% |
>> | arm64_sys_munmap     |      15B |       12B |         -17.0% |
>> | arm64_sys_newfstatat |      46B |       41B |          -9.7% |
>> | arm64_sys_clone      |      26B |       24B |         -10.0% |
>>
>> And now with -j160:
>>
>> |           | baseline time  | anonfolio time | percent change |
>> |           | to compile (s) | to compile (s) | SMALLER=better |
>> |-----------|---------------:|---------------:|---------------:|
>> | real-time |           53.7 |           48.2 |         -10.2% |
>> | user-time |         2705.8 |         2842.1 |           5.0% |
>> | sys-time  |         1370.4 |         1064.3 |         -22.3% |
>>
>> Above shows a 10.2% improvement in real time execution. But ~3x more time is
>> spent in the kernel than for the -j8 config. I think this is related to the lock
>> contention issue I highlighted above, but haven't bottomed it out yet. It's also
>> not yet clear to me why user-time increases by 5%.
>>
>> I've also run all the will-it-scale microbenchmarks for a single task, using the
>> process mode. Results for multiple runs on the same kernel are noisy - I see ~5%
>> fluctuation. So I'm just calling out tests with results that have gt 5%
>> improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests
>> are regressed:
>>
>> | benchmark            | baseline | anonfolio | percent change |
>> |                      | ops/s    | ops/s     | BIGGER=better  |
>> | ---------------------|---------:|----------:|---------------:|
>> | context_switch1.csv  |   328744 |    351150 |          6.8%  |
>> | malloc1.csv          |    96214 |     50890 |        -47.1%  |
>> | mmap1.csv            |   410253 |    375746 |         -8.4%  |
>> | page_fault1.csv      |   624061 |   3185678 |        410.5%  |
>> | page_fault2.csv      |   416483 |    557448 |         33.8%  |
>> | page_fault3.csv      |   724566 |   1152726 |         59.1%  |
>> | read1.csv            |  1806908 |   1905752 |          5.5%  |
>> | read2.csv            |   587722 |   1942062 |        230.4%  |
>> | tlb_flush1.csv       |   143910 |    152097 |          5.7%  |
>> | tlb_flush2.csv       |   266763 |    322320 |         20.8%  |
>>
>> I believe malloc1 is an unrealistic test, since it does malloc/free for 128M
>> object in a loop and never touches the allocated memory. I think the malloc
>> implementation is maintaining a header just before the allocated object, which
>> causes a single page fault. Previously that page fault allocated 1 page. Now it
>> is allocating 16 pages. This cost would be repaid if the test code wrote to the
>> allocated object. Alternatively the folio allocation order policy described
>> above would also solve this.
>>
>> It is not clear to me why mmap1 has slowed down. This remains a todo.
>>
>> Memory
>> ------
>>
>> I measured memory consumption while doing a kernel compile with 8 jobs on a
>> system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the
>> workload, then calcualted "memory used" high and low watermarks using both
>> MemFree and MemAvailable. If there is a better way of measuring system memory
>> consumption, please let me know!
>>
>> mem-used = 4GB - /proc/meminfo:MemFree
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (MB)     | (MB)      | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | mem-used-low         |      825 |       842 |           2.1% |
>> | mem-used-high        |     2697 |      2672 |          -0.9% |
>>
>> mem-used = 4GB - /proc/meminfo:MemAvailable
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (MB)     | (MB)      | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | mem-used-low         |      518 |       530 |           2.3% |
>> | mem-used-high        |     1522 |      1537 |           1.0% |
>>
>> For the high watermark, the methods disagree; we are either saving 1% or using
>> 1% more. For the low watermark, both methods agree that we are using about 2%
>> more. I plan to investigate whether the proposed folio allocation order policy
>> can reduce this to zero.
> 
> Besides the memory consumption, the large folio could impact the tail latency
> of page allocation also (extra zeroing memory, more operations on slow path of
> page allocation).

I agree; this series could be thought of as trading latency for throughput.
There are a couple of mitigations to try to limit the extra latency; on the
Altra at least, we are taking advantage of the CPU's streaming mode for the
memory zeroing - its >2x faster to zero a block of 64K than it is to zero 16x
4K. And currently I'm using __GPF_NORETRY for high order folio allocations,
which I understood to mean we shouldn't suffer high allocation latency due to
reclaim, etc. Although, I know we are discussing whether this is correct in the
thread for patch 3.

> 
> Again, my understanding is the tail latency of page allocation is more important
> to anonymous page than page cache page because anonymous page is allocated/freed
> more frequently.

I had assumed that any serious application that needs to guarantee no latency
due to page faults would preallocate and lock all memory during init? Perhaps
that's wishful thinking?

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> Thanks for making it this far!
>> Ryan
>>
>>
>> Ryan Roberts (17):
>>   mm: Expose clear_huge_page() unconditionally
>>   mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
>>   mm: Introduce try_vma_alloc_movable_folio()
>>   mm: Implement folio_add_new_anon_rmap_range()
>>   mm: Routines to determine max anon folio allocation order
>>   mm: Allocate large folios for anonymous memory
>>   mm: Allow deferred splitting of arbitrary large anon folios
>>   mm: Implement folio_move_anon_rmap_range()
>>   mm: Update wp_page_reuse() to operate on range of pages
>>   mm: Reuse large folios for anonymous memory
>>   mm: Split __wp_page_copy_user() into 2 variants
>>   mm: ptep_clear_flush_range_notify() macro for batch operation
>>   mm: Implement folio_remove_rmap_range()
>>   mm: Copy large folios for anonymous memory
>>   mm: Convert zero page to large folios on write
>>   mm: mmap: Align unhinted maps to highest anon folio order
>>   mm: Batch-zap large anonymous folio PTE mappings
>>
>>  arch/alpha/include/asm/page.h   |   5 +-
>>  arch/arm64/include/asm/page.h   |   3 +-
>>  arch/arm64/mm/fault.c           |   7 +-
>>  arch/ia64/include/asm/page.h    |   5 +-
>>  arch/m68k/include/asm/page_no.h |   7 +-
>>  arch/s390/include/asm/page.h    |   5 +-
>>  arch/x86/include/asm/page.h     |   5 +-
>>  include/linux/highmem.h         |  23 +-
>>  include/linux/mm.h              |   8 +-
>>  include/linux/mmu_notifier.h    |  31 ++
>>  include/linux/rmap.h            |   6 +
>>  mm/memory.c                     | 877 ++++++++++++++++++++++++++++----
>>  mm/mmap.c                       |   4 +-
>>  mm/rmap.c                       | 147 +++++-
>>  14 files changed, 1000 insertions(+), 133 deletions(-)
>>
>> --
>> 2.25.1
>>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
  2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
                   ` (18 preceding siblings ...)
  2023-04-17  8:19 ` Yin, Fengwei
@ 2023-04-17 10:54 ` David Hildenbrand
  2023-04-17 11:43   ` Ryan Roberts
  19 siblings, 1 reply; 44+ messages in thread
From: David Hildenbrand @ 2023-04-17 10:54 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Yu Zhao, Yin, Fengwei
  Cc: linux-mm, linux-arm-kernel

On 14.04.23 15:02, Ryan Roberts wrote:
> Hi All,
> 
> This is a second RFC and my first proper attempt at implementing variable order,
> large folios for anonymous memory. The first RFC [1], was a partial
> implementation and a plea for help in debugging an issue I was hitting; thanks
> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
> 
> The objective of variable order anonymous folios is to improve performance by
> allocating larger chunks of memory during anonymous page faults:
> 
>   - Since SW (the kernel) is dealing with larger chunks of memory than base
>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>     and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>     overhead. This should benefit all architectures.
>   - Since we are now mapping physically contiguous chunks of memory, we can take
>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
> 
> This patch set deals with the SW side of things only but sets us up nicely for
> taking advantage of the HW improvements in the near future.
> 
> I'm not yet benchmarking a wide variety of use cases, but those that I have
> looked at are positive; I see kernel compilation time improved by up to 10%,
> which I expect to improve further once I add in the arm64 "contiguous bit".
> Memory consumption is somewhere between 1% less and 2% more, depending on how
> its measured. More on perf and memory below.
> 
> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
> conflict resolution). I have a tree at [4].
> 
> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
> 
> Approach
> ========
> 
> There are 4 fault paths that have been modified:
>   - write fault on unallocated address: do_anonymous_page()
>   - write fault on zero page: wp_page_copy()
>   - write fault on non-exclusive CoW page: wp_page_copy()
>   - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
> 
> In the first 2 cases, we will determine the preferred order folio to allocate,
> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
> folio as the source, subject to constraints that may arise if the source has
> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
> the folio as we can, subject to the same mremap/munmap constraints.

Just a note (that you maybe already know) that we have to be a bit 
careful in the wp_copy path with replacing sub-pages that are marked 
exclusive.

Currently, we always only replace a single shared anon (sub)page by a 
fresh exclusive base-page during a write-fault/unsharing. As the 
sub-page is already marked "maybe shared", it cannot get pinned 
concurrently and everybody is happy.

If you now decide to replace more subpages, you have to be careful that 
none of them are still exclusive -- because they could get pinned 
concurrently and replacing them would result in memory corruptions.

There are scenarios (most prominently: MADV_WIPEONFORK), but also failed 
partial fork() that could result in something like that.

Further, we have to be a bit careful regarding replacing ranges that are 
backed by different anon pages (for example, due to fork() deciding to 
copy some sub-pages of a PTE-mapped folio instead of sharing all sub-pages).


So what should be safe is replacing all sub-pages of a folio that are 
marked "maybe shared" by a new folio under PT lock. However, I wonder if 
it's really worth the complexity. For THP we were happy so far to *not* 
optimize this, implying that maybe we shouldn't worry about optimizing 
the fork() case for now that heavily.


One optimization once could think of instead (that I raised previously 
in other context) is the detection of exclusivity after fork()+exit in 
the child (IOW, only the parent continues to exist). Once 
PG_anon_exclusive was cleared for all sub-pages of the THP-mapped folio 
during fork(), we'd always decide to copy instead of reuse (because 
page_count() > 1, as the folio is PTE mapped). Scanning the surrounding 
page table if it makes sense (e.g., page_count() <= folio_nr_pages()), 
to test if all page references are from the current process would allow 
for reusing the folio (setting PG_anon_exclusive) for the sub-pages. The 
smaller the folio order, the cheaper this "scan surrounding PTEs" scan 
is. For THP, which are usually PMD-mapped even after fork()+exit, we 
didn't add this optimization.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
  2023-04-17 10:54 ` David Hildenbrand
@ 2023-04-17 11:43   ` Ryan Roberts
  2023-04-17 14:05     ` David Hildenbrand
  0 siblings, 1 reply; 44+ messages in thread
From: Ryan Roberts @ 2023-04-17 11:43 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Yu Zhao, Yin, Fengwei
  Cc: linux-mm, linux-arm-kernel

On 17/04/2023 11:54, David Hildenbrand wrote:
> On 14.04.23 15:02, Ryan Roberts wrote:
>> Hi All,
>>
>> This is a second RFC and my first proper attempt at implementing variable order,
>> large folios for anonymous memory. The first RFC [1], was a partial
>> implementation and a plea for help in debugging an issue I was hitting; thanks
>> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
>>
>> The objective of variable order anonymous folios is to improve performance by
>> allocating larger chunks of memory during anonymous page faults:
>>
>>   - Since SW (the kernel) is dealing with larger chunks of memory than base
>>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>     and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>>     overhead. This should benefit all architectures.
>>   - Since we are now mapping physically contiguous chunks of memory, we can take
>>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
>>
>> This patch set deals with the SW side of things only but sets us up nicely for
>> taking advantage of the HW improvements in the near future.
>>
>> I'm not yet benchmarking a wide variety of use cases, but those that I have
>> looked at are positive; I see kernel compilation time improved by up to 10%,
>> which I expect to improve further once I add in the arm64 "contiguous bit".
>> Memory consumption is somewhere between 1% less and 2% more, depending on how
>> its measured. More on perf and memory below.
>>
>> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
>> conflict resolution). I have a tree at [4].
>>
>> [1]
>> https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
>> [2]
>> https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
>> [3]
>> https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
>> [4]
>> https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
>>
>> Approach
>> ========
>>
>> There are 4 fault paths that have been modified:
>>   - write fault on unallocated address: do_anonymous_page()
>>   - write fault on zero page: wp_page_copy()
>>   - write fault on non-exclusive CoW page: wp_page_copy()
>>   - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
>>
>> In the first 2 cases, we will determine the preferred order folio to allocate,
>> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
>> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
>> folio as the source, subject to constraints that may arise if the source has
>> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
>> the folio as we can, subject to the same mremap/munmap constraints.
> 
> Just a note (that you maybe already know) that we have to be a bit careful in
> the wp_copy path with replacing sub-pages that are marked exclusive.

Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I
think I have a bug.

(I'm guessing the GUP fast path assumes that if it sees an exclusive page then
that page can't go away? And if I then wp_copy it, I'm breaking the assumption?
But surely user space can munmap it at any time and the consequences are
similar? It's probably clear that I don't know much about the GUP implementation
details...)

My current patch always prefers reuse over copy, and expands the size of the
reuse to the biggest set of pages that are all exclusive (determined either by
the presence of the anon_exclusive flag or from the refcount), and covered by
the same folio (and a few other bounds constraints - see
calc_anon_folio_range_reuse()).

If I determine I must copy (because the "anchor" page is not exclusive), then I
determine the size of the copy region based on a few constraints (see
calc_anon_folio_order_copy()). But I think you are saying that no pages in that
region are allowed to have the anon_exclusive flag set? In which case, this
could be fixed by adding that check in the function.

> 
> Currently, we always only replace a single shared anon (sub)page by a fresh
> exclusive base-page during a write-fault/unsharing. As the sub-page is already
> marked "maybe shared", it cannot get pinned concurrently and everybody is happy.

When you say, "maybe shared" is that determined by the absence of the
"exclusive" flag?

> 
> If you now decide to replace more subpages, you have to be careful that none of
> them are still exclusive -- because they could get pinned concurrently and
> replacing them would result in memory corruptions.
> 
> There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial
> fork() that could result in something like that.

Are there any test cases that stress the kernel in this area that I could use to
validate my fix?

> 
> Further, we have to be a bit careful regarding replacing ranges that are backed
> by different anon pages (for example, due to fork() deciding to copy some
> sub-pages of a PTE-mapped folio instead of sharing all sub-pages).

I don't understand this statement; do you mean "different anon _folios_"? I am
scanning the page table to expand the region that I reuse/copy and as part of
that scan, make sure that I only cover a single folio. So I think I conform here
- the scan would give up once it gets to the hole.

> 
> 
> So what should be safe is replacing all sub-pages of a folio that are marked
> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
> worth the complexity. For THP we were happy so far to *not* optimize this,
> implying that maybe we shouldn't worry about optimizing the fork() case for now
> that heavily.

I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
large copies was contributing a measurable amount to the performance
improvement. (Certainly, the zero-page copy case, is definitely a big
contributer). I don't have access to the HW at the moment but can rerun later
with and without to double check.

> 
> 
> One optimization once could think of instead (that I raised previously in other
> context) is the detection of exclusivity after fork()+exit in the child (IOW,
> only the parent continues to exist). Once PG_anon_exclusive was cleared for all
> sub-pages of the THP-mapped folio during fork(), we'd always decide to copy
> instead of reuse (because page_count() > 1, as the folio is PTE mapped).
> Scanning the surrounding page table if it makes sense (e.g., page_count() <=
> folio_nr_pages()), to test if all page references are from the current process
> would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages.
> The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is.
> For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this
> optimization.

Yes, I have already implemented this in my series; see patch 10.

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
  2023-04-17 11:43   ` Ryan Roberts
@ 2023-04-17 14:05     ` David Hildenbrand
  2023-04-17 15:38       ` Ryan Roberts
  2023-04-19 10:12       ` Ryan Roberts
  0 siblings, 2 replies; 44+ messages in thread
From: David Hildenbrand @ 2023-04-17 14:05 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Yu Zhao, Yin, Fengwei
  Cc: linux-mm, linux-arm-kernel

[...]

>> Just a note (that you maybe already know) that we have to be a bit careful in
>> the wp_copy path with replacing sub-pages that are marked exclusive.
> 
> Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I
> think I have a bug.
> 
> (I'm guessing the GUP fast path assumes that if it sees an exclusive page then
> that page can't go away? And if I then wp_copy it, I'm breaking the assumption?
> But surely user space can munmap it at any time and the consequences are
> similar? It's probably clear that I don't know much about the GUP implementation
> details...)

If GUP finds a read-only PTE pointing at an exclusive subpage, it 
assumes that this page cannot randomly be replaced by core MM due to 
COW. See gup_must_unshare(). So it can go ahead and pin the page. As 
long as user space doesn't do something stupid with the mapping 
(MADV_DONTNEED, munmap()) the pinned page must correspond to the mapped 
page.

If GUP finds a writeable PTE, it assumes that this page cannot randomly 
be replaced by core MM due to COW -- because writable implies exclusive. 
See, for example the VM_BUG_ON_PAGE() in follow_page_pte(). So, 
similarly, GUP can simply go ahead and pin the page.

GUP-fast runs lockless, not even taking the PT locks. It syncs against 
concurrent fork() using a special seqlock, and essentially unpins 
whatever it temporarily pinned when it detects that fork() was running 
concurrently. But it might result in some pages temporarily being 
flagged as "maybe pinned".

In other cases (!fork()), GUP-fast synchronizes against concurrent 
sharing (KSM) or unmapping (migration, swapout) that implies clearing of 
the PG_anon_flag of the subpage by first unmapping the PTE and 
conditionally remapping it. See mm/ksm.c:write_protect_page() as an 
example for the sharing side (especially: if page_try_share_anon_rmap() 
fails because the page may be pinned).

Long story short: replacing a r-o "maybe shared" (!exclusive) PTE is 
easy. Replacing an exclusive PTE (including writable PTEs) requires some 
work to sync with GUP-fast and goes rather in the "maybe just don't 
bother" terriroty.

> 
> My current patch always prefers reuse over copy, and expands the size of the
> reuse to the biggest set of pages that are all exclusive (determined either by
> the presence of the anon_exclusive flag or from the refcount), and covered by
> the same folio (and a few other bounds constraints - see
> calc_anon_folio_range_reuse()).
> 
> If I determine I must copy (because the "anchor" page is not exclusive), then I
> determine the size of the copy region based on a few constraints (see
> calc_anon_folio_order_copy()). But I think you are saying that no pages in that
> region are allowed to have the anon_exclusive flag set? In which case, this
> could be fixed by adding that check in the function.

Yes, changing a PTE that points at an anonymous subpage that has the 
"exclusive" flag set requires more care.

> 
>>
>> Currently, we always only replace a single shared anon (sub)page by a fresh
>> exclusive base-page during a write-fault/unsharing. As the sub-page is already
>> marked "maybe shared", it cannot get pinned concurrently and everybody is happy.
> 
> When you say, "maybe shared" is that determined by the absence of the
> "exclusive" flag?

Yes. Semantics of PG_anon_exclusive are "exclusive" vs. "maybe shared". 
Once "maybe shared", we must only go back to "exclusive (set the flag) 
if we are sure that there are no other references to the page.

> 
>>
>> If you now decide to replace more subpages, you have to be careful that none of
>> them are still exclusive -- because they could get pinned concurrently and
>> replacing them would result in memory corruptions.
>>
>> There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial
>> fork() that could result in something like that.
> 
> Are there any test cases that stress the kernel in this area that I could use to
> validate my fix?

tools/testing/selftests/mm/cow.c does excessive tests (including some 
MADV_DONTFORK -- that's what I actually meant --  and partial mremap 
tests), but mostly focuses on ordinary base pages (order-0), THP, and 
hugetlb.

We don't have any "GUP-fast racing with fork()" tests or similar yet 
(tests that rely on races are not a good candidate for selftests).

We might want to extend tools/testing/selftests/mm/cow.c to test for 
some of the cases you extend.

We may also change the detection of THP (I think, by luck, it would 
currently also test your patches to some degree set the way it tests for 
THP)

if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) {
	ksft_test_result_skip("Did not get a THP populated\n");
	goto munmap;
}

Would have to be, for example,

if (!pagemap_is_populated(pagemap_fd, mem + thpsize - pagesize)) {
	ksft_test_result_skip("Did not get a THP populated\n");
	goto munmap;
}

Because we touch the first PTE in a PMD and want to test if core-mm gave 
us a full THP (last PTE also populated).


Extending the tests to cover other anon THP sizes could work by aligning 
a VMA to THP/2 size (so we're sure we don't get a full THP), and then 
testing if we get more PTEs populated -> your code active.

> 
>>
>> Further, we have to be a bit careful regarding replacing ranges that are backed
>> by different anon pages (for example, due to fork() deciding to copy some
>> sub-pages of a PTE-mapped folio instead of sharing all sub-pages).
> 
> I don't understand this statement; do you mean "different anon _folios_"? I am
> scanning the page table to expand the region that I reuse/copy and as part of
> that scan, make sure that I only cover a single folio. So I think I conform here
> - the scan would give up once it gets to the hole.

During fork(), what could happen (temporary detection of pinned page 
resulting in a copy) is something weird like:

PTE 0: subpage0 of anon page #1 (maybe shared)
PTE 1: subpage1 of anon page #1 (maybe shared
PTE 2: anon page #2 (exclusive)
PTE 3: subpage2 of anon page #1 (maybe shared

Of course, any combination of above.

Further, with mremap() we might get completely crazy layouts, randomly 
mapping sub-pages of anon pages, mixed with other sub-pages or base-page 
folios.

Maybe it's all handled already by your code, just pointing out which 
kind of mess we might get :)

> 
>>
>>
>> So what should be safe is replacing all sub-pages of a folio that are marked
>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
>> worth the complexity. For THP we were happy so far to *not* optimize this,
>> implying that maybe we shouldn't worry about optimizing the fork() case for now
>> that heavily.
> 
> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
> large copies was contributing a measurable amount to the performance
> improvement. (Certainly, the zero-page copy case, is definitely a big
> contributer). I don't have access to the HW at the moment but can rerun later
> with and without to double check.

In which test exactly? Some micro-benchmark?

> 
>>
>>
>> One optimization once could think of instead (that I raised previously in other
>> context) is the detection of exclusivity after fork()+exit in the child (IOW,
>> only the parent continues to exist). Once PG_anon_exclusive was cleared for all
>> sub-pages of the THP-mapped folio during fork(), we'd always decide to copy
>> instead of reuse (because page_count() > 1, as the folio is PTE mapped).
>> Scanning the surrounding page table if it makes sense (e.g., page_count() <=
>> folio_nr_pages()), to test if all page references are from the current process
>> would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages.
>> The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is.
>> For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this
>> optimization.
> 
> Yes, I have already implemented this in my series; see patch 10.

Oh, good! That's the most important part.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
  2023-04-17 14:05     ` David Hildenbrand
@ 2023-04-17 15:38       ` Ryan Roberts
  2023-04-17 15:44         ` David Hildenbrand
  2023-04-19 10:12       ` Ryan Roberts
  1 sibling, 1 reply; 44+ messages in thread
From: Ryan Roberts @ 2023-04-17 15:38 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Yu Zhao, Yin, Fengwei
  Cc: linux-mm, linux-arm-kernel

On 17/04/2023 15:05, David Hildenbrand wrote:
> [...]
> 
>>> Just a note (that you maybe already know) that we have to be a bit careful in
>>> the wp_copy path with replacing sub-pages that are marked exclusive.
>>
>> Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I
>> think I have a bug.
>>
>> (I'm guessing the GUP fast path assumes that if it sees an exclusive page then
>> that page can't go away? And if I then wp_copy it, I'm breaking the assumption?
>> But surely user space can munmap it at any time and the consequences are
>> similar? It's probably clear that I don't know much about the GUP implementation
>> details...)
> 
> If GUP finds a read-only PTE pointing at an exclusive subpage, it assumes that
> this page cannot randomly be replaced by core MM due to COW. See
> gup_must_unshare(). So it can go ahead and pin the page. As long as user space
> doesn't do something stupid with the mapping (MADV_DONTNEED, munmap()) the
> pinned page must correspond to the mapped page.
> 
> If GUP finds a writeable PTE, it assumes that this page cannot randomly be
> replaced by core MM due to COW -- because writable implies exclusive. See, for
> example the VM_BUG_ON_PAGE() in follow_page_pte(). So, similarly, GUP can simply
> go ahead and pin the page.
> 
> GUP-fast runs lockless, not even taking the PT locks. It syncs against
> concurrent fork() using a special seqlock, and essentially unpins whatever it
> temporarily pinned when it detects that fork() was running concurrently. But it
> might result in some pages temporarily being flagged as "maybe pinned".
> 
> In other cases (!fork()), GUP-fast synchronizes against concurrent sharing (KSM)
> or unmapping (migration, swapout) that implies clearing of the PG_anon_flag of
> the subpage by first unmapping the PTE and conditionally remapping it. See
> mm/ksm.c:write_protect_page() as an example for the sharing side (especially: if
> page_try_share_anon_rmap() fails because the page may be pinned).
> 
> Long story short: replacing a r-o "maybe shared" (!exclusive) PTE is easy.
> Replacing an exclusive PTE (including writable PTEs) requires some work to sync
> with GUP-fast and goes rather in the "maybe just don't bother" terriroty.

Yep agreed. I'll plan to fix this by adding the constraint that all pages of the
copy range (calc_anon_folio_order_copy()) must be "maybe shared".

> 
>>
>> My current patch always prefers reuse over copy, and expands the size of the
>> reuse to the biggest set of pages that are all exclusive (determined either by
>> the presence of the anon_exclusive flag or from the refcount), and covered by
>> the same folio (and a few other bounds constraints - see
>> calc_anon_folio_range_reuse()).
>>
>> If I determine I must copy (because the "anchor" page is not exclusive), then I
>> determine the size of the copy region based on a few constraints (see
>> calc_anon_folio_order_copy()). But I think you are saying that no pages in that
>> region are allowed to have the anon_exclusive flag set? In which case, this
>> could be fixed by adding that check in the function.
> 
> Yes, changing a PTE that points at an anonymous subpage that has the "exclusive"
> flag set requires more care.
> 
>>
>>>
>>> Currently, we always only replace a single shared anon (sub)page by a fresh
>>> exclusive base-page during a write-fault/unsharing. As the sub-page is already
>>> marked "maybe shared", it cannot get pinned concurrently and everybody is happy.
>>
>> When you say, "maybe shared" is that determined by the absence of the
>> "exclusive" flag?
> 
> Yes. Semantics of PG_anon_exclusive are "exclusive" vs. "maybe shared". Once
> "maybe shared", we must only go back to "exclusive (set the flag) if we are sure
> that there are no other references to the page.
> 
>>
>>>
>>> If you now decide to replace more subpages, you have to be careful that none of
>>> them are still exclusive -- because they could get pinned concurrently and
>>> replacing them would result in memory corruptions.
>>>
>>> There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial
>>> fork() that could result in something like that.
>>
>> Are there any test cases that stress the kernel in this area that I could use to
>> validate my fix?
> 
> tools/testing/selftests/mm/cow.c does excessive tests (including some
> MADV_DONTFORK -- that's what I actually meant --  and partial mremap tests), but
> mostly focuses on ordinary base pages (order-0), THP, and hugetlb.
> 
> We don't have any "GUP-fast racing with fork()" tests or similar yet (tests that
> rely on races are not a good candidate for selftests).
> 
> We might want to extend tools/testing/selftests/mm/cow.c to test for some of the
> cases you extend.
> 
> We may also change the detection of THP (I think, by luck, it would currently
> also test your patches to some degree set the way it tests for THP)
> 
> if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) {
>     ksft_test_result_skip("Did not get a THP populated\n");
>     goto munmap;
> }
> 
> Would have to be, for example,
> 
> if (!pagemap_is_populated(pagemap_fd, mem + thpsize - pagesize)) {
>     ksft_test_result_skip("Did not get a THP populated\n");
>     goto munmap;
> }
> 
> Because we touch the first PTE in a PMD and want to test if core-mm gave us a
> full THP (last PTE also populated).
> 
> 
> Extending the tests to cover other anon THP sizes could work by aligning a VMA
> to THP/2 size (so we're sure we don't get a full THP), and then testing if we
> get more PTEs populated -> your code active.

Thanks. I'll run all these and make sure they pass and look at adding new
variants for the next rev.

> 
>>
>>>
>>> Further, we have to be a bit careful regarding replacing ranges that are backed
>>> by different anon pages (for example, due to fork() deciding to copy some
>>> sub-pages of a PTE-mapped folio instead of sharing all sub-pages).
>>
>> I don't understand this statement; do you mean "different anon _folios_"? I am
>> scanning the page table to expand the region that I reuse/copy and as part of
>> that scan, make sure that I only cover a single folio. So I think I conform here
>> - the scan would give up once it gets to the hole.
> 
> During fork(), what could happen (temporary detection of pinned page resulting
> in a copy) is something weird like:
> 
> PTE 0: subpage0 of anon page #1 (maybe shared)
> PTE 1: subpage1 of anon page #1 (maybe shared
> PTE 2: anon page #2 (exclusive)
> PTE 3: subpage2 of anon page #1 (maybe shared

Hmm... I can see how this could happen if you mremap PTE2 to PTE3, then mmap
something new in PTE2. But I don't see how it happens at fork. For PTE3, did you
mean subpage _3_?

> 
> Of course, any combination of above.
> 
> Further, with mremap() we might get completely crazy layouts, randomly mapping
> sub-pages of anon pages, mixed with other sub-pages or base-page folios.
> 
> Maybe it's all handled already by your code, just pointing out which kind of
> mess we might get :)

Yep, this is already handled; the scan to expand the range ensures that all the
PTEs map to the expected contiguous pages in the same folio.

> 
>>
>>>
>>>
>>> So what should be safe is replacing all sub-pages of a folio that are marked
>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
>>> worth the complexity. For THP we were happy so far to *not* optimize this,
>>> implying that maybe we shouldn't worry about optimizing the fork() case for now
>>> that heavily.
>>
>> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
>> large copies was contributing a measurable amount to the performance
>> improvement. (Certainly, the zero-page copy case, is definitely a big
>> contributer). I don't have access to the HW at the moment but can rerun later
>> with and without to double check.
> 
> In which test exactly? Some micro-benchmark?

The kernel compile benchmark that I quoted numbers for in the cover letter. I
have some trace points (not part of the submitted series) that tell me how many
mappings of each order we get for each code path. I'm pretty sure I remember all
of these 4 code paths contributing non-negligible amounts.

> 
>>
>>>
>>>
>>> One optimization once could think of instead (that I raised previously in other
>>> context) is the detection of exclusivity after fork()+exit in the child (IOW,
>>> only the parent continues to exist). Once PG_anon_exclusive was cleared for all
>>> sub-pages of the THP-mapped folio during fork(), we'd always decide to copy
>>> instead of reuse (because page_count() > 1, as the folio is PTE mapped).
>>> Scanning the surrounding page table if it makes sense (e.g., page_count() <=
>>> folio_nr_pages()), to test if all page references are from the current process
>>> would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages.
>>> The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is.
>>> For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this
>>> optimization.
>>
>> Yes, I have already implemented this in my series; see patch 10.
> 
> Oh, good! That's the most important part.
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
  2023-04-17 15:38       ` Ryan Roberts
@ 2023-04-17 15:44         ` David Hildenbrand
  2023-04-17 16:15           ` Ryan Roberts
  2023-04-26 10:41           ` Ryan Roberts
  0 siblings, 2 replies; 44+ messages in thread
From: David Hildenbrand @ 2023-04-17 15:44 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Yu Zhao, Yin, Fengwei
  Cc: linux-mm, linux-arm-kernel

>>>
>>>>
>>>> Further, we have to be a bit careful regarding replacing ranges that are backed
>>>> by different anon pages (for example, due to fork() deciding to copy some
>>>> sub-pages of a PTE-mapped folio instead of sharing all sub-pages).
>>>
>>> I don't understand this statement; do you mean "different anon _folios_"? I am
>>> scanning the page table to expand the region that I reuse/copy and as part of
>>> that scan, make sure that I only cover a single folio. So I think I conform here
>>> - the scan would give up once it gets to the hole.
>>
>> During fork(), what could happen (temporary detection of pinned page resulting
>> in a copy) is something weird like:
>>
>> PTE 0: subpage0 of anon page #1 (maybe shared)
>> PTE 1: subpage1 of anon page #1 (maybe shared
>> PTE 2: anon page #2 (exclusive)
>> PTE 3: subpage2 of anon page #1 (maybe shared
> 
> Hmm... I can see how this could happen if you mremap PTE2 to PTE3, then mmap
> something new in PTE2. But I don't see how it happens at fork. For PTE3, did you
> mean subpage _3_?
>

Yes, fat fingers :) Thanks for paying attention!

Above could be optimized by processing all consecutive PTEs at once: 
meaning, we check if the page maybe pinned only once, and then either 
copy all PTEs or share all PTEs. It's unlikely to happen in practice, I 
guess, though.


>>
>> Of course, any combination of above.
>>
>> Further, with mremap() we might get completely crazy layouts, randomly mapping
>> sub-pages of anon pages, mixed with other sub-pages or base-page folios.
>>
>> Maybe it's all handled already by your code, just pointing out which kind of
>> mess we might get :)
> 
> Yep, this is already handled; the scan to expand the range ensures that all the
> PTEs map to the expected contiguous pages in the same folio.

Okay, great.

> 
>>
>>>
>>>>
>>>>
>>>> So what should be safe is replacing all sub-pages of a folio that are marked
>>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
>>>> worth the complexity. For THP we were happy so far to *not* optimize this,
>>>> implying that maybe we shouldn't worry about optimizing the fork() case for now
>>>> that heavily.
>>>
>>> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
>>> large copies was contributing a measurable amount to the performance
>>> improvement. (Certainly, the zero-page copy case, is definitely a big
>>> contributer). I don't have access to the HW at the moment but can rerun later
>>> with and without to double check.
>>
>> In which test exactly? Some micro-benchmark?
> 
> The kernel compile benchmark that I quoted numbers for in the cover letter. I
> have some trace points (not part of the submitted series) that tell me how many
> mappings of each order we get for each code path. I'm pretty sure I remember all
> of these 4 code paths contributing non-negligible amounts.

Interesting! It would be great to see if there is an actual difference 
after patch #10 was applied without the other COW replacement.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
  2023-04-17 15:44         ` David Hildenbrand
@ 2023-04-17 16:15           ` Ryan Roberts
  2023-04-26 10:41           ` Ryan Roberts
  1 sibling, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-17 16:15 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Yu Zhao, Yin, Fengwei
  Cc: linux-mm, linux-arm-kernel

On 17/04/2023 16:44, David Hildenbrand wrote:
>>>>
>>>>>
>>>>>
>>>>> So what should be safe is replacing all sub-pages of a folio that are marked
>>>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
>>>>> worth the complexity. For THP we were happy so far to *not* optimize this,
>>>>> implying that maybe we shouldn't worry about optimizing the fork() case for
>>>>> now
>>>>> that heavily.
>>>>
>>>> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
>>>> large copies was contributing a measurable amount to the performance
>>>> improvement. (Certainly, the zero-page copy case, is definitely a big
>>>> contributer). I don't have access to the HW at the moment but can rerun later
>>>> with and without to double check.
>>>
>>> In which test exactly? Some micro-benchmark?
>>
>> The kernel compile benchmark that I quoted numbers for in the cover letter. I
>> have some trace points (not part of the submitted series) that tell me how many
>> mappings of each order we get for each code path. I'm pretty sure I remember all
>> of these 4 code paths contributing non-negligible amounts.
> 
> Interesting! It would be great to see if there is an actual difference after
> patch #10 was applied without the other COW replacement.
> 

I'll aim to get some formal numbers when I next have access to the HW.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
  2023-04-17 14:05     ` David Hildenbrand
  2023-04-17 15:38       ` Ryan Roberts
@ 2023-04-19 10:12       ` Ryan Roberts
  2023-04-19 10:51         ` David Hildenbrand
  1 sibling, 1 reply; 44+ messages in thread
From: Ryan Roberts @ 2023-04-19 10:12 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Yu Zhao, Yin, Fengwei
  Cc: linux-mm, linux-arm-kernel

Hi David,

On 17/04/2023 15:05, David Hildenbrand wrote:
> [...]
> 
>>> Just a note (that you maybe already know) that we have to be a bit careful in
>>> the wp_copy path with replacing sub-pages that are marked exclusive.
>>
>> Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I
>> think I have a bug.
>>
>> (I'm guessing the GUP fast path assumes that if it sees an exclusive page then
>> that page can't go away? And if I then wp_copy it, I'm breaking the assumption?
>> But surely user space can munmap it at any time and the consequences are
>> similar? It's probably clear that I don't know much about the GUP implementation
>> details...)
> 
> If GUP finds a read-only PTE pointing at an exclusive subpage, it assumes that
> this page cannot randomly be replaced by core MM due to COW. See
> gup_must_unshare(). So it can go ahead and pin the page. As long as user space
> doesn't do something stupid with the mapping (MADV_DONTNEED, munmap()) the
> pinned page must correspond to the mapped page.
> 
> If GUP finds a writeable PTE, it assumes that this page cannot randomly be
> replaced by core MM due to COW -- because writable implies exclusive. See, for
> example the VM_BUG_ON_PAGE() in follow_page_pte(). So, similarly, GUP can simply
> go ahead and pin the page.
> 
> GUP-fast runs lockless, not even taking the PT locks. It syncs against
> concurrent fork() using a special seqlock, and essentially unpins whatever it
> temporarily pinned when it detects that fork() was running concurrently. But it
> might result in some pages temporarily being flagged as "maybe pinned".
> 
> In other cases (!fork()), GUP-fast synchronizes against concurrent sharing (KSM)
> or unmapping (migration, swapout) that implies clearing of the PG_anon_flag of
> the subpage by first unmapping the PTE and conditionally remapping it. See
> mm/ksm.c:write_protect_page() as an example for the sharing side (especially: if
> page_try_share_anon_rmap() fails because the page may be pinned).
> 
> Long story short: replacing a r-o "maybe shared" (!exclusive) PTE is easy.
> Replacing an exclusive PTE (including writable PTEs) requires some work to sync
> with GUP-fast and goes rather in the "maybe just don't bother" terriroty.

I'm looking to fix this problem in my code, but am struggling to see how the
current code is safe. I'm thinking about the following scenario:

 - A page is CoW mapped into processes A and B.
 - The page takes a fault in process A, and do_wp_page() determines that it is
   "maybe-shared" and therefore must copy. So drops the PTL and calls
   wp_page_copy().
 - Process B exits.
 - Another thread in process A faults on the page. This time dw_wp_page()
   determines that the page is exclusive (due to the ref count), and reuses it,
   marking it exclusive along the way.
 - wp_page_copy() from the original thread in process A retakes the PTL and
   copies the _now exclusive_ page.

Having typed it up, I guess this can't happen, because wp_page_copy() will only
do the copy if the PTE hasn't changed and it will have changed because it is now
writable? So this is safe?

To make things more convoluted, what happens if the second thread does an
mprotect() to make the page RO after its write fault was handled? I think
mprotect() will serialize on the mmap write lock so this is safe too?

Sorry if this is a bit rambly, just trying to make sure I've understood
everything correctly.

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
  2023-04-19 10:12       ` Ryan Roberts
@ 2023-04-19 10:51         ` David Hildenbrand
  2023-04-19 11:13           ` Ryan Roberts
  0 siblings, 1 reply; 44+ messages in thread
From: David Hildenbrand @ 2023-04-19 10:51 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Yu Zhao, Yin, Fengwei
  Cc: linux-mm, linux-arm-kernel

> I'm looking to fix this problem in my code, but am struggling to see how the
> current code is safe. I'm thinking about the following scenario:
> 

Let's see :)

>   - A page is CoW mapped into processes A and B.
>   - The page takes a fault in process A, and do_wp_page() determines that it is
>     "maybe-shared" and therefore must copy. So drops the PTL and calls
>     wp_page_copy().

Note that before calling wp_page_copy(), we do a folio_get(folio); 
Further, the page table reference is only dropped once we actually 
replace the page in the page table. So while in wp_page_copy(), the 
folio should have at least 2 references if the page is still mapped.

>   - Process B exits.
>   - Another thread in process A faults on the page. This time dw_wp_page()
>     determines that the page is exclusive (due to the ref count), and reuses it,
>     marking it exclusive along the way.

The refcount should not be 1 (other reference from the wp_page_copy() 
caller), so A won't be able to reuse it, and ...

>   - wp_page_copy() from the original thread in process A retakes the PTL and
>     copies the _now exclusive_ page.
> 
> Having typed it up, I guess this can't happen, because wp_page_copy() will only
> do the copy if the PTE hasn't changed and it will have changed because it is now
> writable? So this is safe?

this applies as well. If the pte changed (when reusing due to a write 
failt it's now writable, or someone else broke COW), we back off. For 
FAULT_FLAG_UNSHARE, however, the PTE may not change. But the additional 
reference should make it work.

I think it works as intended. It would be clearer if we'd also recheck 
in wp_page_copy() whether we still don't have an exclusive anon page 
under PT lock --  and if we would, back off.

> 
> To make things more convoluted, what happens if the second thread does an
> mprotect() to make the page RO after its write fault was handled? I think
> mprotect() will serialize on the mmap write lock so this is safe too?

Yes, mprotect() synchronizes that. There are other mechanisms to 
write-protect a page, though, under mmap lock in read mode (uffd-wp). So 
it's a valid concern.

In all of these cases, reuse should be prevented due to the additional 
reference on the folio when entering wp_page_copy() right from the 
start, not turning the page exclusive but instead replacing it by a 
copy. An additional sanity check sounds like the right thing to do.

> 
> Sorry if this is a bit rambly, just trying to make sure I've understood
> everything correctly.

It's a very interesting corner case, thanks for bringing that up. I 
think the old mapcount based approach could have suffered from this 
theoretical issue, but I might be wrong.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
  2023-04-19 10:51         ` David Hildenbrand
@ 2023-04-19 11:13           ` Ryan Roberts
  0 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-04-19 11:13 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Yu Zhao, Yin, Fengwei
  Cc: linux-mm, linux-arm-kernel

On 19/04/2023 11:51, David Hildenbrand wrote:
>> I'm looking to fix this problem in my code, but am struggling to see how the
>> current code is safe. I'm thinking about the following scenario:
>>
> 
> Let's see :)
> 
>>   - A page is CoW mapped into processes A and B.
>>   - The page takes a fault in process A, and do_wp_page() determines that it is
>>     "maybe-shared" and therefore must copy. So drops the PTL and calls
>>     wp_page_copy().
> 
> Note that before calling wp_page_copy(), we do a folio_get(folio); Further, the
> page table reference is only dropped once we actually replace the page in the
> page table. So while in wp_page_copy(), the folio should have at least 2
> references if the page is still mapped.

Ahh, yes, of course!

> 
>>   - Process B exits.
>>   - Another thread in process A faults on the page. This time dw_wp_page()
>>     determines that the page is exclusive (due to the ref count), and reuses it,
>>     marking it exclusive along the way.
> 
> The refcount should not be 1 (other reference from the wp_page_copy() caller),
> so A won't be able to reuse it, and ...
> 
>>   - wp_page_copy() from the original thread in process A retakes the PTL and
>>     copies the _now exclusive_ page.
>>
>> Having typed it up, I guess this can't happen, because wp_page_copy() will only
>> do the copy if the PTE hasn't changed and it will have changed because it is now
>> writable? So this is safe?
> 
> this applies as well. If the pte changed (when reusing due to a write failt it's
> now writable, or someone else broke COW), we back off. For FAULT_FLAG_UNSHARE,
> however, the PTE may not change. But the additional reference should make it work.
> 
> I think it works as intended. It would be clearer if we'd also recheck in
> wp_page_copy() whether we still don't have an exclusive anon page under PT lock
> --  and if we would, back off.

Yes, agreed. I now have a fix for my code that adds a check during the PTE scan
that each page is non-exclusive. And the scan is (already) done once without the
PTL to determine the size of destination folio, and then again with the lock to
ensure there was no race. I'm not seeing any change in the relative counts of
folio orders (which is expected due to the case being rare).

Thanks!


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
  2023-04-17 15:44         ` David Hildenbrand
  2023-04-17 16:15           ` Ryan Roberts
@ 2023-04-26 10:41           ` Ryan Roberts
  2023-05-17 13:58             ` David Hildenbrand
  1 sibling, 1 reply; 44+ messages in thread
From: Ryan Roberts @ 2023-04-26 10:41 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Yu Zhao, Yin, Fengwei
  Cc: linux-mm, linux-arm-kernel

Hi David,

On 17/04/2023 16:44, David Hildenbrand wrote:

>>>>>
>>>>> So what should be safe is replacing all sub-pages of a folio that are marked
>>>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
>>>>> worth the complexity. For THP we were happy so far to *not* optimize this,
>>>>> implying that maybe we shouldn't worry about optimizing the fork() case for
>>>>> now
>>>>> that heavily.
>>>>
>>>> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
>>>> large copies was contributing a measurable amount to the performance
>>>> improvement. (Certainly, the zero-page copy case, is definitely a big
>>>> contributer). I don't have access to the HW at the moment but can rerun later
>>>> with and without to double check.
>>>
>>> In which test exactly? Some micro-benchmark?
>>
>> The kernel compile benchmark that I quoted numbers for in the cover letter. I
>> have some trace points (not part of the submitted series) that tell me how many
>> mappings of each order we get for each code path. I'm pretty sure I remember all
>> of these 4 code paths contributing non-negligible amounts.
> 
> Interesting! It would be great to see if there is an actual difference after
> patch #10 was applied without the other COW replacement.
> 

Sorry about the delay. I now have some numbers for this...

I rearranged the patch order so that all the "utility" stuff (new rmap
functions, etc) are first (1, 2, 3, 4, 5, 8, 9, 11, 12, 13), followed by a
couple of general improvements (7, 17), which should be dormant until we have
the final patches, then finally (6, 10, 14, 15), which implement large anon
folios the allocate, reuse, copy-non-zero and copy-zero paths respectively. I've
dropped patch 16 and fixed the copy-exclusive bug you spotted (by ensuring we
never replace an exclusive page).

I've measured performance at the following locations in the patch set:

- baseline: none of my patches applied
- utility: has utility and general improvement patches applied
- alloc: utility + 6
- reuse: utility + 6 + 10
- copy: utility + 6 + 10 + 14
- zero-alloc: utility + 6 + 19 + 14 + 15

The test is `make defconfig && time make -jN Image` for a clean checkout of
v6.3-rc3. The first result is thrown away, and the next 3 are kept. I saw some
per-boot variance (probably down to kaslr, etc). So have booted each kernel 7
times for a total of 3x7=21 samples per kernel. Then I've taken the mean:


jobs=8:

| label      |   real |   user |   kernel |
|:-----------|-------:|-------:|---------:|
| baseline   |   0.0% |   0.0% |     0.0% |
| utility    |  -2.7% |  -2.8% |    -3.1% |
| alloc      |  -6.0% |  -2.3% |   -24.1% |
| reuse      |  -9.5% |  -5.8% |   -28.5% |
| copy       | -10.6% |  -6.9% |   -29.4% |
| zero-alloc |  -9.2% |  -5.1% |   -29.8% |


jobs=160:

| label      |   real |   user |   kernel |
|:-----------|-------:|-------:|---------:|
| baseline   |   0.0% |   0.0% |     0.0% |
| utility    |  -1.8% |  -0.0% |    -7.7% |
| alloc      |  -6.0% |   1.8% |   -20.9% |
| reuse      |  -7.8% |  -1.6% |   -24.1% |
| copy       |  -7.8% |  -2.5% |   -26.8% |
| zero-alloc |  -7.7% |   1.5% |   -29.4% |


So it looks like patch 10 (reuse) is making a difference, but copy and
zero-alloc are not adding a huge amount, as you hypothesized. Personally I would
prefer not to drop those patches though, as it will all help towards utilization
of contiguous PTEs on arm64, which is the second part of the change that I'm now
working on.


For the final config ("zero-alloc") I also collected stats on how many
operations each of the 4 paths was performing, using ftrace and histograms.
"pnr" is the number of pages allocated/reused/copied, and "fnr" is the number of
pages in the source folio):


do_anonymous_page:

{ pnr:          1 } hitcount:    2749722
{ pnr:          4 } hitcount:     387832
{ pnr:          8 } hitcount:     409628
{ pnr:         16 } hitcount:    4296115

pages: 76315914
faults: 7843297
pages per fault: 9.7


wp_page_reuse (anon):

{ pnr:          1, fnr:          1 } hitcount:      47887
{ pnr:          3, fnr:          4 } hitcount:          2
{ pnr:          4, fnr:          4 } hitcount:       6131
{ pnr:          6, fnr:          8 } hitcount:          1
{ pnr:          7, fnr:          8 } hitcount:         10
{ pnr:          8, fnr:          8 } hitcount:       3794
{ pnr:          1, fnr:         16 } hitcount:         36
{ pnr:          2, fnr:         16 } hitcount:         23
{ pnr:          3, fnr:         16 } hitcount:          5
{ pnr:          4, fnr:         16 } hitcount:          9
{ pnr:          5, fnr:         16 } hitcount:          8
{ pnr:          6, fnr:         16 } hitcount:          9
{ pnr:          7, fnr:         16 } hitcount:          3
{ pnr:          8, fnr:         16 } hitcount:         24
{ pnr:          9, fnr:         16 } hitcount:          2
{ pnr:         10, fnr:         16 } hitcount:          1
{ pnr:         11, fnr:         16 } hitcount:          9
{ pnr:         12, fnr:         16 } hitcount:          2
{ pnr:         13, fnr:         16 } hitcount:         27
{ pnr:         14, fnr:         16 } hitcount:          2
{ pnr:         15, fnr:         16 } hitcount:         54
{ pnr:         16, fnr:         16 } hitcount:       6673

pages: 211393
faults: 64712
pages per fault: 3.3


wp_page_copy (anon):

{ pnr:          1, fnr:          1 } hitcount:      81242
{ pnr:          4, fnr:          4 } hitcount:       5974
{ pnr:          1, fnr:          8 } hitcount:          1
{ pnr:          4, fnr:          8 } hitcount:          1
{ pnr:          8, fnr:          8 } hitcount:      12933
{ pnr:          1, fnr:         16 } hitcount:         19
{ pnr:          4, fnr:         16 } hitcount:          3
{ pnr:          8, fnr:         16 } hitcount:          7
{ pnr:         16, fnr:         16 } hitcount:       4106

pages: 274390
faults: 104286
pages per fault: 2.6


wp_page_copy (zero):

{ pnr:          1 } hitcount:     178699
{ pnr:          4 } hitcount:      14498
{ pnr:          8 } hitcount:      23644
{ pnr:         16 } hitcount:     257940

pages: 4552883
faults: 474781
pages per fault: 9.6


Thanks,
Ryan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
  2023-04-26 10:41           ` Ryan Roberts
@ 2023-05-17 13:58             ` David Hildenbrand
  2023-05-18 11:23               ` Ryan Roberts
  0 siblings, 1 reply; 44+ messages in thread
From: David Hildenbrand @ 2023-05-17 13:58 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Yu Zhao, Yin, Fengwei
  Cc: linux-mm, linux-arm-kernel

On 26.04.23 12:41, Ryan Roberts wrote:
> Hi David,
> 
> On 17/04/2023 16:44, David Hildenbrand wrote:
> 
>>>>>>
>>>>>> So what should be safe is replacing all sub-pages of a folio that are marked
>>>>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
>>>>>> worth the complexity. For THP we were happy so far to *not* optimize this,
>>>>>> implying that maybe we shouldn't worry about optimizing the fork() case for
>>>>>> now
>>>>>> that heavily.
>>>>>
>>>>> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
>>>>> large copies was contributing a measurable amount to the performance
>>>>> improvement. (Certainly, the zero-page copy case, is definitely a big
>>>>> contributer). I don't have access to the HW at the moment but can rerun later
>>>>> with and without to double check.
>>>>
>>>> In which test exactly? Some micro-benchmark?
>>>
>>> The kernel compile benchmark that I quoted numbers for in the cover letter. I
>>> have some trace points (not part of the submitted series) that tell me how many
>>> mappings of each order we get for each code path. I'm pretty sure I remember all
>>> of these 4 code paths contributing non-negligible amounts.
>>
>> Interesting! It would be great to see if there is an actual difference after
>> patch #10 was applied without the other COW replacement.
>>
> 
> Sorry about the delay. I now have some numbers for this...
> 

Dito, I'm swamped :) Thanks for running these benchmarks!

As LSF/MM reminded me again of this topic ...

> I rearranged the patch order so that all the "utility" stuff (new rmap
> functions, etc) are first (1, 2, 3, 4, 5, 8, 9, 11, 12, 13), followed by a
> couple of general improvements (7, 17), which should be dormant until we have
> the final patches, then finally (6, 10, 14, 15), which implement large anon
> folios the allocate, reuse, copy-non-zero and copy-zero paths respectively. I've
> dropped patch 16 and fixed the copy-exclusive bug you spotted (by ensuring we
> never replace an exclusive page).
> 
> I've measured performance at the following locations in the patch set:
> 
> - baseline: none of my patches applied
> - utility: has utility and general improvement patches applied
> - alloc: utility + 6
> - reuse: utility + 6 + 10
> - copy: utility + 6 + 10 + 14
> - zero-alloc: utility + 6 + 19 + 14 + 15
> 
> The test is `make defconfig && time make -jN Image` for a clean checkout of
> v6.3-rc3. The first result is thrown away, and the next 3 are kept. I saw some
> per-boot variance (probably down to kaslr, etc). So have booted each kernel 7
> times for a total of 3x7=21 samples per kernel. Then I've taken the mean:
> 
> 
> jobs=8:
> 
> | label      |   real |   user |   kernel |
> |:-----------|-------:|-------:|---------:|
> | baseline   |   0.0% |   0.0% |     0.0% |
> | utility    |  -2.7% |  -2.8% |    -3.1% |
> | alloc      |  -6.0% |  -2.3% |   -24.1% |
> | reuse      |  -9.5% |  -5.8% |   -28.5% |
> | copy       | -10.6% |  -6.9% |   -29.4% |
> | zero-alloc |  -9.2% |  -5.1% |   -29.8% |
> 
> 
> jobs=160:
> 
> | label      |   real |   user |   kernel |
> |:-----------|-------:|-------:|---------:|
> | baseline   |   0.0% |   0.0% |     0.0% |
> | utility    |  -1.8% |  -0.0% |    -7.7% |
> | alloc      |  -6.0% |   1.8% |   -20.9% |
> | reuse      |  -7.8% |  -1.6% |   -24.1% |
> | copy       |  -7.8% |  -2.5% |   -26.8% |
> | zero-alloc |  -7.7% |   1.5% |   -29.4% |
> 
> 
> So it looks like patch 10 (reuse) is making a difference, but copy and
> zero-alloc are not adding a huge amount, as you hypothesized. Personally I would
> prefer not to drop those patches though, as it will all help towards utilization
> of contiguous PTEs on arm64, which is the second part of the change that I'm now
> working on.

Yes, pretty much what I expected :) I can only suggest to

(1) Make the initial support as simple and minimal as possible. That
     means, strip anything that is not absolutely required. That is,
     exclude *at least* copy and zero-alloc. We can always add selected
     optimizations on top later.

     You'll do yourself a favor to get as much review coverage, faster
     review for inclusion, and less chances for nasty BUGs.

(2) Keep the COW logic simple. We've had too many issues
     in that area for my taste already. As 09854ba94c6a ("mm:
     do_wp_page() simplification") from Linus puts it: "Simplify,
     simplify, simplify.". If it doesn't add significant benefit, rather
     keep it simple.

> 
> 
> For the final config ("zero-alloc") I also collected stats on how many
> operations each of the 4 paths was performing, using ftrace and histograms.
> "pnr" is the number of pages allocated/reused/copied, and "fnr" is the number of
> pages in the source folio):
> 
> 
> do_anonymous_page:
> 
> { pnr:          1 } hitcount:    2749722
> { pnr:          4 } hitcount:     387832
> { pnr:          8 } hitcount:     409628
> { pnr:         16 } hitcount:    4296115
> 
> pages: 76315914
> faults: 7843297
> pages per fault: 9.7
> 
> 
> wp_page_reuse (anon):
> 
> { pnr:          1, fnr:          1 } hitcount:      47887
> { pnr:          3, fnr:          4 } hitcount:          2
> { pnr:          4, fnr:          4 } hitcount:       6131
> { pnr:          6, fnr:          8 } hitcount:          1
> { pnr:          7, fnr:          8 } hitcount:         10
> { pnr:          8, fnr:          8 } hitcount:       3794
> { pnr:          1, fnr:         16 } hitcount:         36
> { pnr:          2, fnr:         16 } hitcount:         23
> { pnr:          3, fnr:         16 } hitcount:          5
> { pnr:          4, fnr:         16 } hitcount:          9
> { pnr:          5, fnr:         16 } hitcount:          8
> { pnr:          6, fnr:         16 } hitcount:          9
> { pnr:          7, fnr:         16 } hitcount:          3
> { pnr:          8, fnr:         16 } hitcount:         24
> { pnr:          9, fnr:         16 } hitcount:          2
> { pnr:         10, fnr:         16 } hitcount:          1
> { pnr:         11, fnr:         16 } hitcount:          9
> { pnr:         12, fnr:         16 } hitcount:          2
> { pnr:         13, fnr:         16 } hitcount:         27
> { pnr:         14, fnr:         16 } hitcount:          2
> { pnr:         15, fnr:         16 } hitcount:         54
> { pnr:         16, fnr:         16 } hitcount:       6673
> 
> pages: 211393
> faults: 64712
> pages per fault: 3.3
> 
> 
> wp_page_copy (anon):
> 
> { pnr:          1, fnr:          1 } hitcount:      81242
> { pnr:          4, fnr:          4 } hitcount:       5974
> { pnr:          1, fnr:          8 } hitcount:          1
> { pnr:          4, fnr:          8 } hitcount:          1
> { pnr:          8, fnr:          8 } hitcount:      12933
> { pnr:          1, fnr:         16 } hitcount:         19
> { pnr:          4, fnr:         16 } hitcount:          3
> { pnr:          8, fnr:         16 } hitcount:          7
> { pnr:         16, fnr:         16 } hitcount:       4106
> 
> pages: 274390
> faults: 104286
> pages per fault: 2.6
> 
> 
> wp_page_copy (zero):
> 
> { pnr:          1 } hitcount:     178699
> { pnr:          4 } hitcount:      14498
> { pnr:          8 } hitcount:      23644
> { pnr:         16 } hitcount:     257940
> 
> pages: 4552883
> faults: 474781
> pages per fault: 9.6

I'll have to set aside more time to digest these values :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
  2023-05-17 13:58             ` David Hildenbrand
@ 2023-05-18 11:23               ` Ryan Roberts
  0 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2023-05-18 11:23 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Yu Zhao, Yin, Fengwei
  Cc: linux-mm, linux-arm-kernel

On 17/05/2023 14:58, David Hildenbrand wrote:
> On 26.04.23 12:41, Ryan Roberts wrote:
>> Hi David,
>>
>> On 17/04/2023 16:44, David Hildenbrand wrote:
>>
>>>>>>>
>>>>>>> So what should be safe is replacing all sub-pages of a folio that are marked
>>>>>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's
>>>>>>> really
>>>>>>> worth the complexity. For THP we were happy so far to *not* optimize this,
>>>>>>> implying that maybe we shouldn't worry about optimizing the fork() case for
>>>>>>> now
>>>>>>> that heavily.
>>>>>>
>>>>>> I don't have the exact numbers to hand, but I'm pretty sure I remember
>>>>>> enabling
>>>>>> large copies was contributing a measurable amount to the performance
>>>>>> improvement. (Certainly, the zero-page copy case, is definitely a big
>>>>>> contributer). I don't have access to the HW at the moment but can rerun later
>>>>>> with and without to double check.
>>>>>
>>>>> In which test exactly? Some micro-benchmark?
>>>>
>>>> The kernel compile benchmark that I quoted numbers for in the cover letter. I
>>>> have some trace points (not part of the submitted series) that tell me how many
>>>> mappings of each order we get for each code path. I'm pretty sure I remember
>>>> all
>>>> of these 4 code paths contributing non-negligible amounts.
>>>
>>> Interesting! It would be great to see if there is an actual difference after
>>> patch #10 was applied without the other COW replacement.
>>>
>>
>> Sorry about the delay. I now have some numbers for this...
>>
> 
> Dito, I'm swamped :) Thanks for running these benchmarks!
> 
> As LSF/MM reminded me again of this topic ...
> 
>> I rearranged the patch order so that all the "utility" stuff (new rmap
>> functions, etc) are first (1, 2, 3, 4, 5, 8, 9, 11, 12, 13), followed by a
>> couple of general improvements (7, 17), which should be dormant until we have
>> the final patches, then finally (6, 10, 14, 15), which implement large anon
>> folios the allocate, reuse, copy-non-zero and copy-zero paths respectively. I've
>> dropped patch 16 and fixed the copy-exclusive bug you spotted (by ensuring we
>> never replace an exclusive page).
>>
>> I've measured performance at the following locations in the patch set:
>>
>> - baseline: none of my patches applied
>> - utility: has utility and general improvement patches applied
>> - alloc: utility + 6
>> - reuse: utility + 6 + 10
>> - copy: utility + 6 + 10 + 14
>> - zero-alloc: utility + 6 + 19 + 14 + 15
>>
>> The test is `make defconfig && time make -jN Image` for a clean checkout of
>> v6.3-rc3. The first result is thrown away, and the next 3 are kept. I saw some
>> per-boot variance (probably down to kaslr, etc). So have booted each kernel 7
>> times for a total of 3x7=21 samples per kernel. Then I've taken the mean:
>>
>>
>> jobs=8:
>>
>> | label      |   real |   user |   kernel |
>> |:-----------|-------:|-------:|---------:|
>> | baseline   |   0.0% |   0.0% |     0.0% |
>> | utility    |  -2.7% |  -2.8% |    -3.1% |
>> | alloc      |  -6.0% |  -2.3% |   -24.1% |
>> | reuse      |  -9.5% |  -5.8% |   -28.5% |
>> | copy       | -10.6% |  -6.9% |   -29.4% |
>> | zero-alloc |  -9.2% |  -5.1% |   -29.8% |
>>
>>
>> jobs=160:
>>
>> | label      |   real |   user |   kernel |
>> |:-----------|-------:|-------:|---------:|
>> | baseline   |   0.0% |   0.0% |     0.0% |
>> | utility    |  -1.8% |  -0.0% |    -7.7% |
>> | alloc      |  -6.0% |   1.8% |   -20.9% |
>> | reuse      |  -7.8% |  -1.6% |   -24.1% |
>> | copy       |  -7.8% |  -2.5% |   -26.8% |
>> | zero-alloc |  -7.7% |   1.5% |   -29.4% |
>>
>>
>> So it looks like patch 10 (reuse) is making a difference, but copy and
>> zero-alloc are not adding a huge amount, as you hypothesized. Personally I would
>> prefer not to drop those patches though, as it will all help towards utilization
>> of contiguous PTEs on arm64, which is the second part of the change that I'm now
>> working on.
> 
> Yes, pretty much what I expected :) I can only suggest to
> 
> (1) Make the initial support as simple and minimal as possible. That
>     means, strip anything that is not absolutely required. That is,
>     exclude *at least* copy and zero-alloc. We can always add selected
>     optimizations on top later.
> 
>     You'll do yourself a favor to get as much review coverage, faster
>     review for inclusion, and less chances for nasty BUGs.

As I'm building out the testing capability, I'm seeing that with different HW
configs and workloads, things move a bit and zero-alloc can account for up to 1%
in some cases. Copy is still pretty marginal, but I wonder if we might see more
value from it on Android where the Zygote is constantly forked?

Regardless, I appreciate your point about making the initial contribution as
small and simple as possible, so as I get closer to posting a v1, I'll keep it
in mind and make sure I follow your advice.

Thanks,
Ryan

> 
> (2) Keep the COW logic simple. We've had too many issues
>     in that area for my taste already. As 09854ba94c6a ("mm:
>     do_wp_page() simplification") from Linus puts it: "Simplify,
>     simplify, simplify.". If it doesn't add significant benefit, rather
>     keep it simple.
> 
>>
>>
>> For the final config ("zero-alloc") I also collected stats on how many
>> operations each of the 4 paths was performing, using ftrace and histograms.
>> "pnr" is the number of pages allocated/reused/copied, and "fnr" is the number of
>> pages in the source folio):
>>
>>
>> do_anonymous_page:
>>
>> { pnr:          1 } hitcount:    2749722
>> { pnr:          4 } hitcount:     387832
>> { pnr:          8 } hitcount:     409628
>> { pnr:         16 } hitcount:    4296115
>>
>> pages: 76315914
>> faults: 7843297
>> pages per fault: 9.7
>>
>>
>> wp_page_reuse (anon):
>>
>> { pnr:          1, fnr:          1 } hitcount:      47887
>> { pnr:          3, fnr:          4 } hitcount:          2
>> { pnr:          4, fnr:          4 } hitcount:       6131
>> { pnr:          6, fnr:          8 } hitcount:          1
>> { pnr:          7, fnr:          8 } hitcount:         10
>> { pnr:          8, fnr:          8 } hitcount:       3794
>> { pnr:          1, fnr:         16 } hitcount:         36
>> { pnr:          2, fnr:         16 } hitcount:         23
>> { pnr:          3, fnr:         16 } hitcount:          5
>> { pnr:          4, fnr:         16 } hitcount:          9
>> { pnr:          5, fnr:         16 } hitcount:          8
>> { pnr:          6, fnr:         16 } hitcount:          9
>> { pnr:          7, fnr:         16 } hitcount:          3
>> { pnr:          8, fnr:         16 } hitcount:         24
>> { pnr:          9, fnr:         16 } hitcount:          2
>> { pnr:         10, fnr:         16 } hitcount:          1
>> { pnr:         11, fnr:         16 } hitcount:          9
>> { pnr:         12, fnr:         16 } hitcount:          2
>> { pnr:         13, fnr:         16 } hitcount:         27
>> { pnr:         14, fnr:         16 } hitcount:          2
>> { pnr:         15, fnr:         16 } hitcount:         54
>> { pnr:         16, fnr:         16 } hitcount:       6673
>>
>> pages: 211393
>> faults: 64712
>> pages per fault: 3.3
>>
>>
>> wp_page_copy (anon):
>>
>> { pnr:          1, fnr:          1 } hitcount:      81242
>> { pnr:          4, fnr:          4 } hitcount:       5974
>> { pnr:          1, fnr:          8 } hitcount:          1
>> { pnr:          4, fnr:          8 } hitcount:          1
>> { pnr:          8, fnr:          8 } hitcount:      12933
>> { pnr:          1, fnr:         16 } hitcount:         19
>> { pnr:          4, fnr:         16 } hitcount:          3
>> { pnr:          8, fnr:         16 } hitcount:          7
>> { pnr:         16, fnr:         16 } hitcount:       4106
>>
>> pages: 274390
>> faults: 104286
>> pages per fault: 2.6
>>
>>
>> wp_page_copy (zero):
>>
>> { pnr:          1 } hitcount:     178699
>> { pnr:          4 } hitcount:      14498
>> { pnr:          8 } hitcount:      23644
>> { pnr:         16 } hitcount:     257940
>>
>> pages: 4552883
>> faults: 474781
>> pages per fault: 9.6
> 
> I'll have to set aside more time to digest these values :)
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2023-05-18 11:24 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 01/17] mm: Expose clear_huge_page() unconditionally Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 02/17] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 03/17] mm: Introduce try_vma_alloc_movable_folio() Ryan Roberts
2023-04-17  8:49   ` Yin, Fengwei
2023-04-17 10:11     ` Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 04/17] mm: Implement folio_add_new_anon_rmap_range() Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order Ryan Roberts
2023-04-14 14:09   ` Kirill A. Shutemov
2023-04-14 14:38     ` Ryan Roberts
2023-04-14 15:37       ` Kirill A. Shutemov
2023-04-14 16:06         ` Ryan Roberts
2023-04-14 16:18           ` Matthew Wilcox
2023-04-14 16:31             ` Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 06/17] mm: Allocate large folios for anonymous memory Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 07/17] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 08/17] mm: Implement folio_move_anon_rmap_range() Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 09/17] mm: Update wp_page_reuse() to operate on range of pages Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 10/17] mm: Reuse large folios for anonymous memory Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 11/17] mm: Split __wp_page_copy_user() into 2 variants Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 12/17] mm: ptep_clear_flush_range_notify() macro for batch operation Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 13/17] mm: Implement folio_remove_rmap_range() Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 14/17] mm: Copy large folios for anonymous memory Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 15/17] mm: Convert zero page to large folios on write Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 16/17] mm: mmap: Align unhinted maps to highest anon folio order Ryan Roberts
2023-04-17  8:25   ` Yin, Fengwei
2023-04-17 10:13     ` Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 17/17] mm: Batch-zap large anonymous folio PTE mappings Ryan Roberts
2023-04-17  8:04 ` [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Yin, Fengwei
2023-04-17 10:19   ` Ryan Roberts
2023-04-17  8:19 ` Yin, Fengwei
2023-04-17 10:28   ` Ryan Roberts
2023-04-17 10:54 ` David Hildenbrand
2023-04-17 11:43   ` Ryan Roberts
2023-04-17 14:05     ` David Hildenbrand
2023-04-17 15:38       ` Ryan Roberts
2023-04-17 15:44         ` David Hildenbrand
2023-04-17 16:15           ` Ryan Roberts
2023-04-26 10:41           ` Ryan Roberts
2023-05-17 13:58             ` David Hildenbrand
2023-05-18 11:23               ` Ryan Roberts
2023-04-19 10:12       ` Ryan Roberts
2023-04-19 10:51         ` David Hildenbrand
2023-04-19 11:13           ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).