All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-26 17:14 ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

Hi All,

Following on from the previous RFCv2 [1], this series implements variable order,
large folios for anonymous memory. The objective of this is to improve
performance by allocating larger chunks of memory during anonymous page faults:

 - Since SW (the kernel) is dealing with larger chunks of memory than base
   pages, there are efficiency savings to be had; fewer page faults, batched PTE
   and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
   overhead. This should benefit all architectures.
 - Since we are now mapping physically contiguous chunks of memory, we can take
   advantage of HW TLB compression techniques. A reduction in TLB pressure
   speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
   TLB entries; "the contiguous bit" (architectural) and HPA (uarch).

This patch set deals with the SW side of things only and based on feedback from
the RFC, aims to be the most minimal initial change, upon which future
incremental changes can be added. For this reason, the new behaviour is hidden
behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
default. Although the code has been refactored to parameterize the desired order
of the allocation, when the feature is disabled (by forcing the order to be
always 0) my performance tests measure no regression. So I'm hoping this will be
a suitable mechanism to allow incremental submissions to the kernel without
affecting the rest of the world.

The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
[2], which is a hard dependency. I'm not sure of Matthew's exact plans for
getting that series into the kernel, but I'm hoping we can start the review
process on this patch set independently. I have a branch at [3].

I've posted a separate series concerning the HW part (contpte mapping) for arm64
at [4].


Performance
-----------

Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
javascript benchmark running in Chromium). Both cases are running on Ampere
Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
is repeated 15 times over 5 reboots and averaged.

All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
'anonfolio' is the full patch set similar to the RFC with the additional changes
to the extra 3 fault paths. The rest of the configs are described at [4].

Kernel Compilation (smaller is better):

| kernel          |   real-time |   kern-time |   user-time |
|:----------------|------------:|------------:|------------:|
| baseline-4k     |        0.0% |        0.0% |        0.0% |
| anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
| anonfolio       |       -5.4% |      -46.0% |       -0.3% |
| contpte         |       -6.8% |      -45.7% |       -2.1% |
| exefolio        |       -8.4% |      -46.4% |       -3.7% |
| baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
| baseline-64k    |      -10.5% |      -66.0% |       -3.5% |

Speedometer 2.0 (bigger is better):

| kernel          |   runs_per_min |
|:----------------|---------------:|
| baseline-4k     |           0.0% |
| anonfolio-basic |           0.7% |
| anonfolio       |           1.2% |
| contpte         |           3.1% |
| exefolio        |           4.2% |
| baseline-16k    |           5.3% |


Changes since RFCv2
-------------------

  - Simplified series to bare minimum (on David Hildenbrand's advice)
      - Removed changes to 3 fault paths:
          - write fault on zero page: wp_page_copy()
          - write fault on non-exclusive CoW page: wp_page_copy()
          - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
      - Only 1 fault path change remains:
          - write fault on unallocated address: do_anonymous_page()
      - Removed support patches that are no longer needed
  - Added Kconfig CONFIG_LARGE_ANON_FOLIO and friends
      - Whole feature defaults to off
      - Arch opts-in to allowing feature and provides max allocation order


Future Work
-----------

Once this series is in, there are some more incremental changes I plan to follow
up with:

  - Add the other 3 fault path changes back in
  - Properly support pte-mapped folios for:
      - numa balancing (do_numa_page())
      - fix assumptions about exclusivity for large folios in madvise()
      - compaction (although I think this is already a problem for large folios
        in the file cache so perhaps someone is working on it?)


[1] https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
[3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v1
[4] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/

Thanks,
Ryan


Ryan Roberts (10):
  mm: Expose clear_huge_page() unconditionally
  mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
  mm: Introduce try_vma_alloc_movable_folio()
  mm: Implement folio_add_new_anon_rmap_range()
  mm: Implement folio_remove_rmap_range()
  mm: Allow deferred splitting of arbitrary large anon folios
  mm: Batch-zap large anonymous folio PTE mappings
  mm: Kconfig hooks to determine max anon folio allocation order
  arm64: mm: Declare support for large anonymous folios
  mm: Allocate large folios for anonymous memory

 arch/alpha/include/asm/page.h   |   5 +-
 arch/arm64/Kconfig              |  13 ++
 arch/arm64/include/asm/page.h   |   3 +-
 arch/arm64/mm/fault.c           |   7 +-
 arch/ia64/include/asm/page.h    |   5 +-
 arch/m68k/include/asm/page_no.h |   7 +-
 arch/s390/include/asm/page.h    |   5 +-
 arch/x86/include/asm/page.h     |   5 +-
 include/linux/highmem.h         |  23 ++-
 include/linux/mm.h              |   3 +-
 include/linux/rmap.h            |   4 +
 mm/Kconfig                      |  39 ++++
 mm/memory.c                     | 324 ++++++++++++++++++++++++++++++--
 mm/rmap.c                       | 107 ++++++++++-
 14 files changed, 506 insertions(+), 44 deletions(-)

--
2.25.1


^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-26 17:14 ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

Hi All,

Following on from the previous RFCv2 [1], this series implements variable order,
large folios for anonymous memory. The objective of this is to improve
performance by allocating larger chunks of memory during anonymous page faults:

 - Since SW (the kernel) is dealing with larger chunks of memory than base
   pages, there are efficiency savings to be had; fewer page faults, batched PTE
   and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
   overhead. This should benefit all architectures.
 - Since we are now mapping physically contiguous chunks of memory, we can take
   advantage of HW TLB compression techniques. A reduction in TLB pressure
   speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
   TLB entries; "the contiguous bit" (architectural) and HPA (uarch).

This patch set deals with the SW side of things only and based on feedback from
the RFC, aims to be the most minimal initial change, upon which future
incremental changes can be added. For this reason, the new behaviour is hidden
behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
default. Although the code has been refactored to parameterize the desired order
of the allocation, when the feature is disabled (by forcing the order to be
always 0) my performance tests measure no regression. So I'm hoping this will be
a suitable mechanism to allow incremental submissions to the kernel without
affecting the rest of the world.

The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
[2], which is a hard dependency. I'm not sure of Matthew's exact plans for
getting that series into the kernel, but I'm hoping we can start the review
process on this patch set independently. I have a branch at [3].

I've posted a separate series concerning the HW part (contpte mapping) for arm64
at [4].


Performance
-----------

Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
javascript benchmark running in Chromium). Both cases are running on Ampere
Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
is repeated 15 times over 5 reboots and averaged.

All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
'anonfolio' is the full patch set similar to the RFC with the additional changes
to the extra 3 fault paths. The rest of the configs are described at [4].

Kernel Compilation (smaller is better):

| kernel          |   real-time |   kern-time |   user-time |
|:----------------|------------:|------------:|------------:|
| baseline-4k     |        0.0% |        0.0% |        0.0% |
| anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
| anonfolio       |       -5.4% |      -46.0% |       -0.3% |
| contpte         |       -6.8% |      -45.7% |       -2.1% |
| exefolio        |       -8.4% |      -46.4% |       -3.7% |
| baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
| baseline-64k    |      -10.5% |      -66.0% |       -3.5% |

Speedometer 2.0 (bigger is better):

| kernel          |   runs_per_min |
|:----------------|---------------:|
| baseline-4k     |           0.0% |
| anonfolio-basic |           0.7% |
| anonfolio       |           1.2% |
| contpte         |           3.1% |
| exefolio        |           4.2% |
| baseline-16k    |           5.3% |


Changes since RFCv2
-------------------

  - Simplified series to bare minimum (on David Hildenbrand's advice)
      - Removed changes to 3 fault paths:
          - write fault on zero page: wp_page_copy()
          - write fault on non-exclusive CoW page: wp_page_copy()
          - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
      - Only 1 fault path change remains:
          - write fault on unallocated address: do_anonymous_page()
      - Removed support patches that are no longer needed
  - Added Kconfig CONFIG_LARGE_ANON_FOLIO and friends
      - Whole feature defaults to off
      - Arch opts-in to allowing feature and provides max allocation order


Future Work
-----------

Once this series is in, there are some more incremental changes I plan to follow
up with:

  - Add the other 3 fault path changes back in
  - Properly support pte-mapped folios for:
      - numa balancing (do_numa_page())
      - fix assumptions about exclusivity for large folios in madvise()
      - compaction (although I think this is already a problem for large folios
        in the file cache so perhaps someone is working on it?)


[1] https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
[3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v1
[4] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/

Thanks,
Ryan


Ryan Roberts (10):
  mm: Expose clear_huge_page() unconditionally
  mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
  mm: Introduce try_vma_alloc_movable_folio()
  mm: Implement folio_add_new_anon_rmap_range()
  mm: Implement folio_remove_rmap_range()
  mm: Allow deferred splitting of arbitrary large anon folios
  mm: Batch-zap large anonymous folio PTE mappings
  mm: Kconfig hooks to determine max anon folio allocation order
  arm64: mm: Declare support for large anonymous folios
  mm: Allocate large folios for anonymous memory

 arch/alpha/include/asm/page.h   |   5 +-
 arch/arm64/Kconfig              |  13 ++
 arch/arm64/include/asm/page.h   |   3 +-
 arch/arm64/mm/fault.c           |   7 +-
 arch/ia64/include/asm/page.h    |   5 +-
 arch/m68k/include/asm/page_no.h |   7 +-
 arch/s390/include/asm/page.h    |   5 +-
 arch/x86/include/asm/page.h     |   5 +-
 include/linux/highmem.h         |  23 ++-
 include/linux/mm.h              |   3 +-
 include/linux/rmap.h            |   4 +
 mm/Kconfig                      |  39 ++++
 mm/memory.c                     | 324 ++++++++++++++++++++++++++++++--
 mm/rmap.c                       | 107 ++++++++++-
 14 files changed, 506 insertions(+), 44 deletions(-)

--
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
  2023-06-26 17:14 ` Ryan Roberts
@ 2023-06-26 17:14   ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

In preparation for extending vma_alloc_zeroed_movable_folio() to
allocate a arbitrary order folio, expose clear_huge_page()
unconditionally, so that it can be used to zero the allocated folio in
the generic implementation of vma_alloc_zeroed_movable_folio().

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/mm.h | 3 ++-
 mm/memory.c        | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f1741bd870a..7e3bf45e6491 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3684,10 +3684,11 @@ enum mf_action_page_type {
  */
 extern const struct attribute_group memory_failure_attr_group;
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr_hint,
 			    unsigned int pages_per_huge_page);
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 int copy_user_large_folio(struct folio *dst, struct folio *src,
 			  unsigned long addr_hint,
 			  struct vm_area_struct *vma);
diff --git a/mm/memory.c b/mm/memory.c
index fb30f7523550..3d4ea668c4d1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5741,7 +5741,6 @@ void __might_fault(const char *file, int line)
 EXPORT_SYMBOL(__might_fault);
 #endif
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 /*
  * Process all subpages of the specified huge page with the specified
  * operation.  The target subpage will be processed last to keep its
@@ -5839,6 +5838,7 @@ void clear_huge_page(struct page *page,
 	process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, page);
 }
 
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
 				     unsigned long addr,
 				     struct vm_area_struct *vma,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
@ 2023-06-26 17:14   ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

In preparation for extending vma_alloc_zeroed_movable_folio() to
allocate a arbitrary order folio, expose clear_huge_page()
unconditionally, so that it can be used to zero the allocated folio in
the generic implementation of vma_alloc_zeroed_movable_folio().

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/mm.h | 3 ++-
 mm/memory.c        | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f1741bd870a..7e3bf45e6491 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3684,10 +3684,11 @@ enum mf_action_page_type {
  */
 extern const struct attribute_group memory_failure_attr_group;
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr_hint,
 			    unsigned int pages_per_huge_page);
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 int copy_user_large_folio(struct folio *dst, struct folio *src,
 			  unsigned long addr_hint,
 			  struct vm_area_struct *vma);
diff --git a/mm/memory.c b/mm/memory.c
index fb30f7523550..3d4ea668c4d1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5741,7 +5741,6 @@ void __might_fault(const char *file, int line)
 EXPORT_SYMBOL(__might_fault);
 #endif
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 /*
  * Process all subpages of the specified huge page with the specified
  * operation.  The target subpage will be processed last to keep its
@@ -5839,6 +5838,7 @@ void clear_huge_page(struct page *page,
 	process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, page);
 }
 
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
 				     unsigned long addr,
 				     struct vm_area_struct *vma,
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 02/10] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
  2023-06-26 17:14 ` Ryan Roberts
@ 2023-06-26 17:14   ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

Allow allocation of large folios with vma_alloc_zeroed_movable_folio().
This prepares the ground for large anonymous folios. The generic
implementation of vma_alloc_zeroed_movable_folio() now uses
clear_huge_page() to zero the allocated folio since it may now be a
non-0 order.

Currently the function is always called with order 0 and no extra gfp
flags, so no functional change intended. But a subsequent commit will
take advantage of the new parameters to allocate large folios. The extra
gfp flags will be used to control the reclaim policy.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/alpha/include/asm/page.h   |  5 +++--
 arch/arm64/include/asm/page.h   |  3 ++-
 arch/arm64/mm/fault.c           |  7 ++++---
 arch/ia64/include/asm/page.h    |  5 +++--
 arch/m68k/include/asm/page_no.h |  7 ++++---
 arch/s390/include/asm/page.h    |  5 +++--
 arch/x86/include/asm/page.h     |  5 +++--
 include/linux/highmem.h         | 23 +++++++++++++----------
 mm/memory.c                     |  5 +++--
 9 files changed, 38 insertions(+), 27 deletions(-)

diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
index 4db1ebc0ed99..6fc7fe91b6cb 100644
--- a/arch/alpha/include/asm/page.h
+++ b/arch/alpha/include/asm/page.h
@@ -17,8 +17,9 @@
 extern void clear_page(void *page);
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 
-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)
 
 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 2312e6ee595f..47710852f872 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -30,7 +30,8 @@ void copy_highpage(struct page *to, struct page *from);
 #define __HAVE_ARCH_COPY_HIGHPAGE
 
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-						unsigned long vaddr);
+						unsigned long vaddr,
+						gfp_t gfp, int order);
 #define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio
 
 void tag_clear_highpage(struct page *to);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 6045a5117ac1..0a43c3b3f190 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -961,9 +961,10 @@ NOKPROBE_SYMBOL(do_debug_exception);
  * Used during anonymous page fault handling.
  */
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-						unsigned long vaddr)
+						unsigned long vaddr,
+						gfp_t gfp, int order)
 {
-	gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
+	gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO | gfp;
 
 	/*
 	 * If the page is mapped with PROT_MTE, initialise the tags at the
@@ -973,7 +974,7 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
 	if (vma->vm_flags & VM_MTE)
 		flags |= __GFP_ZEROTAGS;
 
-	return vma_alloc_folio(flags, 0, vma, vaddr, false);
+	return vma_alloc_folio(flags, order, vma, vaddr, false);
 }
 
 void tag_clear_highpage(struct page *page)
diff --git a/arch/ia64/include/asm/page.h b/arch/ia64/include/asm/page.h
index 310b09c3342d..ebdf04274023 100644
--- a/arch/ia64/include/asm/page.h
+++ b/arch/ia64/include/asm/page.h
@@ -82,10 +82,11 @@ do {						\
 } while (0)
 
 
-#define vma_alloc_zeroed_movable_folio(vma, vaddr)			\
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order)		\
 ({									\
 	struct folio *folio = vma_alloc_folio(				\
-		GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false); \
+		GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp),		\
+		order, vma, vaddr, false);				\
 	if (folio)							\
 		flush_dcache_folio(folio);				\
 	folio;								\
diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h
index 060e4c0e7605..4a2fe57fef5e 100644
--- a/arch/m68k/include/asm/page_no.h
+++ b/arch/m68k/include/asm/page_no.h
@@ -3,7 +3,7 @@
 #define _M68K_PAGE_NO_H
 
 #ifndef __ASSEMBLY__
- 
+
 extern unsigned long memory_start;
 extern unsigned long memory_end;
 
@@ -13,8 +13,9 @@ extern unsigned long memory_end;
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)
 
 #define __pa(vaddr)		((unsigned long)(vaddr))
 #define __va(paddr)		((void *)((unsigned long)(paddr)))
diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index 8a2a3b5d1e29..b749564140f1 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -73,8 +73,9 @@ static inline void copy_page(void *to, void *from)
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)
 
 /*
  * These are used to make use of C type-checking..
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index d18e5c332cb9..34deab1a8dae 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -34,8 +34,9 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 	copy_page(to, from);
 }
 
-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)
 
 #ifndef __pa
 #define __pa(x)		__phys_addr((unsigned long)(x))
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 4de1dbcd3ef6..b9a9b0340557 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -209,26 +209,29 @@ static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
 
 #ifndef vma_alloc_zeroed_movable_folio
 /**
- * vma_alloc_zeroed_movable_folio - Allocate a zeroed page for a VMA.
- * @vma: The VMA the page is to be allocated for.
- * @vaddr: The virtual address the page will be inserted into.
- *
- * This function will allocate a page suitable for inserting into this
- * VMA at this virtual address.  It may be allocated from highmem or
+ * vma_alloc_zeroed_movable_folio - Allocate a zeroed folio for a VMA.
+ * @vma: The start VMA the folio is to be allocated for.
+ * @vaddr: The virtual address the folio will be inserted into.
+ * @gfp: Additional gfp falgs to mix in or 0.
+ * @order: The order of the folio (2^order pages).
+ *
+ * This function will allocate a folio suitable for inserting into this
+ * VMA starting at this virtual address.  It may be allocated from highmem or
  * the movable zone.  An architecture may provide its own implementation.
  *
- * Return: A folio containing one allocated and zeroed page or NULL if
+ * Return: A folio containing 2^order allocated and zeroed pages or NULL if
  * we are out of memory.
  */
 static inline
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-				   unsigned long vaddr)
+				   unsigned long vaddr, gfp_t gfp, int order)
 {
 	struct folio *folio;
 
-	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vaddr, false);
+	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
+					order, vma, vaddr, false);
 	if (folio)
-		clear_user_highpage(&folio->page, vaddr);
+		clear_huge_page(&folio->page, vaddr, 1U << order);
 
 	return folio;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 3d4ea668c4d1..367bbbb29d91 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3073,7 +3073,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		goto oom;
 
 	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
-		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address,
+									0, 0);
 		if (!new_folio)
 			goto oom;
 	} else {
@@ -4087,7 +4088,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
 	if (!folio)
 		goto oom;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 02/10] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
@ 2023-06-26 17:14   ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

Allow allocation of large folios with vma_alloc_zeroed_movable_folio().
This prepares the ground for large anonymous folios. The generic
implementation of vma_alloc_zeroed_movable_folio() now uses
clear_huge_page() to zero the allocated folio since it may now be a
non-0 order.

Currently the function is always called with order 0 and no extra gfp
flags, so no functional change intended. But a subsequent commit will
take advantage of the new parameters to allocate large folios. The extra
gfp flags will be used to control the reclaim policy.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/alpha/include/asm/page.h   |  5 +++--
 arch/arm64/include/asm/page.h   |  3 ++-
 arch/arm64/mm/fault.c           |  7 ++++---
 arch/ia64/include/asm/page.h    |  5 +++--
 arch/m68k/include/asm/page_no.h |  7 ++++---
 arch/s390/include/asm/page.h    |  5 +++--
 arch/x86/include/asm/page.h     |  5 +++--
 include/linux/highmem.h         | 23 +++++++++++++----------
 mm/memory.c                     |  5 +++--
 9 files changed, 38 insertions(+), 27 deletions(-)

diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
index 4db1ebc0ed99..6fc7fe91b6cb 100644
--- a/arch/alpha/include/asm/page.h
+++ b/arch/alpha/include/asm/page.h
@@ -17,8 +17,9 @@
 extern void clear_page(void *page);
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 
-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)
 
 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 2312e6ee595f..47710852f872 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -30,7 +30,8 @@ void copy_highpage(struct page *to, struct page *from);
 #define __HAVE_ARCH_COPY_HIGHPAGE
 
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-						unsigned long vaddr);
+						unsigned long vaddr,
+						gfp_t gfp, int order);
 #define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio
 
 void tag_clear_highpage(struct page *to);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 6045a5117ac1..0a43c3b3f190 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -961,9 +961,10 @@ NOKPROBE_SYMBOL(do_debug_exception);
  * Used during anonymous page fault handling.
  */
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-						unsigned long vaddr)
+						unsigned long vaddr,
+						gfp_t gfp, int order)
 {
-	gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
+	gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO | gfp;
 
 	/*
 	 * If the page is mapped with PROT_MTE, initialise the tags at the
@@ -973,7 +974,7 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
 	if (vma->vm_flags & VM_MTE)
 		flags |= __GFP_ZEROTAGS;
 
-	return vma_alloc_folio(flags, 0, vma, vaddr, false);
+	return vma_alloc_folio(flags, order, vma, vaddr, false);
 }
 
 void tag_clear_highpage(struct page *page)
diff --git a/arch/ia64/include/asm/page.h b/arch/ia64/include/asm/page.h
index 310b09c3342d..ebdf04274023 100644
--- a/arch/ia64/include/asm/page.h
+++ b/arch/ia64/include/asm/page.h
@@ -82,10 +82,11 @@ do {						\
 } while (0)
 
 
-#define vma_alloc_zeroed_movable_folio(vma, vaddr)			\
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order)		\
 ({									\
 	struct folio *folio = vma_alloc_folio(				\
-		GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false); \
+		GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp),		\
+		order, vma, vaddr, false);				\
 	if (folio)							\
 		flush_dcache_folio(folio);				\
 	folio;								\
diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h
index 060e4c0e7605..4a2fe57fef5e 100644
--- a/arch/m68k/include/asm/page_no.h
+++ b/arch/m68k/include/asm/page_no.h
@@ -3,7 +3,7 @@
 #define _M68K_PAGE_NO_H
 
 #ifndef __ASSEMBLY__
- 
+
 extern unsigned long memory_start;
 extern unsigned long memory_end;
 
@@ -13,8 +13,9 @@ extern unsigned long memory_end;
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)
 
 #define __pa(vaddr)		((unsigned long)(vaddr))
 #define __va(paddr)		((void *)((unsigned long)(paddr)))
diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index 8a2a3b5d1e29..b749564140f1 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -73,8 +73,9 @@ static inline void copy_page(void *to, void *from)
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)
 
 /*
  * These are used to make use of C type-checking..
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index d18e5c332cb9..34deab1a8dae 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -34,8 +34,9 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 	copy_page(to, from);
 }
 
-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)
 
 #ifndef __pa
 #define __pa(x)		__phys_addr((unsigned long)(x))
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 4de1dbcd3ef6..b9a9b0340557 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -209,26 +209,29 @@ static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
 
 #ifndef vma_alloc_zeroed_movable_folio
 /**
- * vma_alloc_zeroed_movable_folio - Allocate a zeroed page for a VMA.
- * @vma: The VMA the page is to be allocated for.
- * @vaddr: The virtual address the page will be inserted into.
- *
- * This function will allocate a page suitable for inserting into this
- * VMA at this virtual address.  It may be allocated from highmem or
+ * vma_alloc_zeroed_movable_folio - Allocate a zeroed folio for a VMA.
+ * @vma: The start VMA the folio is to be allocated for.
+ * @vaddr: The virtual address the folio will be inserted into.
+ * @gfp: Additional gfp falgs to mix in or 0.
+ * @order: The order of the folio (2^order pages).
+ *
+ * This function will allocate a folio suitable for inserting into this
+ * VMA starting at this virtual address.  It may be allocated from highmem or
  * the movable zone.  An architecture may provide its own implementation.
  *
- * Return: A folio containing one allocated and zeroed page or NULL if
+ * Return: A folio containing 2^order allocated and zeroed pages or NULL if
  * we are out of memory.
  */
 static inline
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-				   unsigned long vaddr)
+				   unsigned long vaddr, gfp_t gfp, int order)
 {
 	struct folio *folio;
 
-	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vaddr, false);
+	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
+					order, vma, vaddr, false);
 	if (folio)
-		clear_user_highpage(&folio->page, vaddr);
+		clear_huge_page(&folio->page, vaddr, 1U << order);
 
 	return folio;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 3d4ea668c4d1..367bbbb29d91 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3073,7 +3073,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		goto oom;
 
 	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
-		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address,
+									0, 0);
 		if (!new_folio)
 			goto oom;
 	} else {
@@ -4087,7 +4088,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
 	if (!folio)
 		goto oom;
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
  2023-06-26 17:14 ` Ryan Roberts
@ 2023-06-26 17:14   ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

Opportunistically attempt to allocate high-order folios in highmem,
optionally zeroed. Retry with lower orders all the way to order-0, until
success. Although, of note, order-1 allocations are skipped since a
large folio must be at least order-2 to work with the THP machinery. The
user must check what they got with folio_order().

This will be used to oportunistically allocate large folios for
anonymous memory with a sensible fallback under memory pressure.

For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
high latency due to reclaim, instead preferring to just try for a lower
order. The same approach is used by the readahead code when allocating
large folios.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 367bbbb29d91..53896d46e686 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
 	return 0;
 }
 
+static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
+				unsigned long vaddr, int order, bool zeroed)
+{
+	gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
+
+	if (zeroed)
+		return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
+	else
+		return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
+								vaddr, false);
+}
+
+/*
+ * Opportunistically attempt to allocate high-order folios, retrying with lower
+ * orders all the way to order-0, until success. order-1 allocations are skipped
+ * since a folio must be at least order-2 to work with the THP machinery. The
+ * user must check what they got with folio_order(). vaddr can be any virtual
+ * address that will be mapped by the allocated folio.
+ */
+static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
+				unsigned long vaddr, int order, bool zeroed)
+{
+	struct folio *folio;
+
+	for (; order > 1; order--) {
+		folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
+		if (folio)
+			return folio;
+	}
+
+	return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
@ 2023-06-26 17:14   ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

Opportunistically attempt to allocate high-order folios in highmem,
optionally zeroed. Retry with lower orders all the way to order-0, until
success. Although, of note, order-1 allocations are skipped since a
large folio must be at least order-2 to work with the THP machinery. The
user must check what they got with folio_order().

This will be used to oportunistically allocate large folios for
anonymous memory with a sensible fallback under memory pressure.

For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
high latency due to reclaim, instead preferring to just try for a lower
order. The same approach is used by the readahead code when allocating
large folios.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 367bbbb29d91..53896d46e686 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
 	return 0;
 }
 
+static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
+				unsigned long vaddr, int order, bool zeroed)
+{
+	gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
+
+	if (zeroed)
+		return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
+	else
+		return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
+								vaddr, false);
+}
+
+/*
+ * Opportunistically attempt to allocate high-order folios, retrying with lower
+ * orders all the way to order-0, until success. order-1 allocations are skipped
+ * since a folio must be at least order-2 to work with the THP machinery. The
+ * user must check what they got with folio_order(). vaddr can be any virtual
+ * address that will be mapped by the allocated folio.
+ */
+static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
+				unsigned long vaddr, int order, bool zeroed)
+{
+	struct folio *folio;
+
+	for (; order > 1; order--) {
+		folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
+		if (folio)
+			return folio;
+	}
+
+	return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
  2023-06-26 17:14 ` Ryan Roberts
@ 2023-06-26 17:14   ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
belonging to a folio, for effciency savings. All pages are accounted as
small pages.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/rmap.h |  2 ++
 mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a3825ce81102..15433a3d0cbf 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address);
 void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
 		unsigned long address);
+void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
+		int nr, struct vm_area_struct *vma, unsigned long address);
 void page_add_file_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
 void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
diff --git a/mm/rmap.c b/mm/rmap.c
index 1d8369549424..4050bcea7ae7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 }
 
+/**
+ * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
+ * anonymous potentially large folio.
+ * @folio:      The folio containing the pages to be mapped
+ * @page:       First page in the folio to be mapped
+ * @nr:         Number of pages to be mapped
+ * @vma:        the vm area in which the mapping is added
+ * @address:    the user virtual address of the first page to be mapped
+ *
+ * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
+ * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
+ * bypassed and the folio does not have to be locked. All pages in the folio are
+ * individually accounted.
+ *
+ * As the folio is new, it's assumed to be mapped exclusively by a single
+ * process.
+ */
+void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
+		int nr, struct vm_area_struct *vma, unsigned long address)
+{
+	int i;
+
+	VM_BUG_ON_VMA(address < vma->vm_start ||
+		      address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
+	__folio_set_swapbacked(folio);
+
+	if (folio_test_large(folio)) {
+		/* increment count (starts at 0) */
+		atomic_set(&folio->_nr_pages_mapped, nr);
+	}
+
+	for (i = 0; i < nr; i++) {
+		/* increment count (starts at -1) */
+		atomic_set(&page->_mapcount, 0);
+		__page_set_anon_rmap(folio, page, vma, address, 1);
+		page++;
+		address += PAGE_SIZE;
+	}
+
+	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
+
+}
+
 /**
  * folio_add_file_rmap_range - add pte mapping to page range of a folio
  * @folio:	The folio to add the mapping to
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
@ 2023-06-26 17:14   ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
belonging to a folio, for effciency savings. All pages are accounted as
small pages.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/rmap.h |  2 ++
 mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a3825ce81102..15433a3d0cbf 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address);
 void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
 		unsigned long address);
+void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
+		int nr, struct vm_area_struct *vma, unsigned long address);
 void page_add_file_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
 void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
diff --git a/mm/rmap.c b/mm/rmap.c
index 1d8369549424..4050bcea7ae7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 }
 
+/**
+ * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
+ * anonymous potentially large folio.
+ * @folio:      The folio containing the pages to be mapped
+ * @page:       First page in the folio to be mapped
+ * @nr:         Number of pages to be mapped
+ * @vma:        the vm area in which the mapping is added
+ * @address:    the user virtual address of the first page to be mapped
+ *
+ * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
+ * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
+ * bypassed and the folio does not have to be locked. All pages in the folio are
+ * individually accounted.
+ *
+ * As the folio is new, it's assumed to be mapped exclusively by a single
+ * process.
+ */
+void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
+		int nr, struct vm_area_struct *vma, unsigned long address)
+{
+	int i;
+
+	VM_BUG_ON_VMA(address < vma->vm_start ||
+		      address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
+	__folio_set_swapbacked(folio);
+
+	if (folio_test_large(folio)) {
+		/* increment count (starts at 0) */
+		atomic_set(&folio->_nr_pages_mapped, nr);
+	}
+
+	for (i = 0; i < nr; i++) {
+		/* increment count (starts at -1) */
+		atomic_set(&page->_mapcount, 0);
+		__page_set_anon_rmap(folio, page, vma, address, 1);
+		page++;
+		address += PAGE_SIZE;
+	}
+
+	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
+
+}
+
 /**
  * folio_add_file_rmap_range - add pte mapping to page range of a folio
  * @folio:	The folio to add the mapping to
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 05/10] mm: Implement folio_remove_rmap_range()
  2023-06-26 17:14 ` Ryan Roberts
@ 2023-06-26 17:14   ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

Like page_remove_rmap() but batch-removes the rmap for a range of pages
belonging to a folio, for effciency savings. All pages are accounted as
small pages.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/rmap.h |  2 ++
 mm/rmap.c            | 62 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 15433a3d0cbf..50f50e4cb0f8 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -204,6 +204,8 @@ void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
 		struct vm_area_struct *, bool compound);
 void page_remove_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
+void folio_remove_rmap_range(struct folio *folio, struct page *page,
+		int nr, struct vm_area_struct *vma);
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address, rmap_t flags);
diff --git a/mm/rmap.c b/mm/rmap.c
index 4050bcea7ae7..ac1d93d43f2b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1434,6 +1434,68 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
 	folio_add_file_rmap_range(folio, page, nr_pages, vma, compound);
 }
 
+/*
+ * folio_remove_rmap_range - take down pte mappings from a range of pages
+ * belonging to a folio. All pages are accounted as small pages.
+ * @folio:	folio that all pages belong to
+ * @page:       first page in range to remove mapping from
+ * @nr:		number of pages in range to remove mapping from
+ * @vma:        the vm area from which the mapping is removed
+ *
+ * The caller needs to hold the pte lock.
+ */
+void folio_remove_rmap_range(struct folio *folio, struct page *page,
+					int nr, struct vm_area_struct *vma)
+{
+	atomic_t *mapped = &folio->_nr_pages_mapped;
+	int nr_unmapped = 0;
+	int nr_mapped;
+	bool last;
+	enum node_stat_item idx;
+
+	VM_BUG_ON_FOLIO(folio_test_hugetlb(folio), folio);
+
+	if (!folio_test_large(folio)) {
+		/* Is this the page's last map to be removed? */
+		last = atomic_add_negative(-1, &page->_mapcount);
+		nr_unmapped = last;
+	} else {
+		for (; nr != 0; nr--, page++) {
+			/* Is this the page's last map to be removed? */
+			last = atomic_add_negative(-1, &page->_mapcount);
+			if (last) {
+				/* Page still mapped if folio mapped entirely */
+				nr_mapped = atomic_dec_return_relaxed(mapped);
+				if (nr_mapped < COMPOUND_MAPPED)
+					nr_unmapped++;
+			}
+		}
+	}
+
+	if (nr_unmapped) {
+		idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
+		__lruvec_stat_mod_folio(folio, idx, -nr_unmapped);
+
+		/*
+		 * Queue anon THP for deferred split if we have just unmapped at
+		 * least 1 page, while at least 1 page remains mapped.
+		 */
+		if (folio_test_large(folio) && folio_test_anon(folio))
+			if (nr_mapped)
+				deferred_split_folio(folio);
+	}
+
+	/*
+	 * It would be tidy to reset folio_test_anon mapping when fully
+	 * unmapped, but that might overwrite a racing page_add_anon_rmap
+	 * which increments mapcount after us but sets mapping before us:
+	 * so leave the reset to free_pages_prepare, and remember that
+	 * it's only reliable while mapped.
+	 */
+
+	munlock_vma_folio(folio, vma, false);
+}
+
 /**
  * page_remove_rmap - take down pte mapping from a page
  * @page:	page to remove mapping from
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 05/10] mm: Implement folio_remove_rmap_range()
@ 2023-06-26 17:14   ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

Like page_remove_rmap() but batch-removes the rmap for a range of pages
belonging to a folio, for effciency savings. All pages are accounted as
small pages.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/rmap.h |  2 ++
 mm/rmap.c            | 62 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 15433a3d0cbf..50f50e4cb0f8 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -204,6 +204,8 @@ void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
 		struct vm_area_struct *, bool compound);
 void page_remove_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
+void folio_remove_rmap_range(struct folio *folio, struct page *page,
+		int nr, struct vm_area_struct *vma);
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address, rmap_t flags);
diff --git a/mm/rmap.c b/mm/rmap.c
index 4050bcea7ae7..ac1d93d43f2b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1434,6 +1434,68 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
 	folio_add_file_rmap_range(folio, page, nr_pages, vma, compound);
 }
 
+/*
+ * folio_remove_rmap_range - take down pte mappings from a range of pages
+ * belonging to a folio. All pages are accounted as small pages.
+ * @folio:	folio that all pages belong to
+ * @page:       first page in range to remove mapping from
+ * @nr:		number of pages in range to remove mapping from
+ * @vma:        the vm area from which the mapping is removed
+ *
+ * The caller needs to hold the pte lock.
+ */
+void folio_remove_rmap_range(struct folio *folio, struct page *page,
+					int nr, struct vm_area_struct *vma)
+{
+	atomic_t *mapped = &folio->_nr_pages_mapped;
+	int nr_unmapped = 0;
+	int nr_mapped;
+	bool last;
+	enum node_stat_item idx;
+
+	VM_BUG_ON_FOLIO(folio_test_hugetlb(folio), folio);
+
+	if (!folio_test_large(folio)) {
+		/* Is this the page's last map to be removed? */
+		last = atomic_add_negative(-1, &page->_mapcount);
+		nr_unmapped = last;
+	} else {
+		for (; nr != 0; nr--, page++) {
+			/* Is this the page's last map to be removed? */
+			last = atomic_add_negative(-1, &page->_mapcount);
+			if (last) {
+				/* Page still mapped if folio mapped entirely */
+				nr_mapped = atomic_dec_return_relaxed(mapped);
+				if (nr_mapped < COMPOUND_MAPPED)
+					nr_unmapped++;
+			}
+		}
+	}
+
+	if (nr_unmapped) {
+		idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
+		__lruvec_stat_mod_folio(folio, idx, -nr_unmapped);
+
+		/*
+		 * Queue anon THP for deferred split if we have just unmapped at
+		 * least 1 page, while at least 1 page remains mapped.
+		 */
+		if (folio_test_large(folio) && folio_test_anon(folio))
+			if (nr_mapped)
+				deferred_split_folio(folio);
+	}
+
+	/*
+	 * It would be tidy to reset folio_test_anon mapping when fully
+	 * unmapped, but that might overwrite a racing page_add_anon_rmap
+	 * which increments mapcount after us but sets mapping before us:
+	 * so leave the reset to free_pages_prepare, and remember that
+	 * it's only reliable while mapped.
+	 */
+
+	munlock_vma_folio(folio, vma, false);
+}
+
 /**
  * page_remove_rmap - take down pte mapping from a page
  * @page:	page to remove mapping from
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 06/10] mm: Allow deferred splitting of arbitrary large anon folios
  2023-06-26 17:14 ` Ryan Roberts
@ 2023-06-26 17:14   ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

With the introduction of large folios for anonymous memory, we would
like to be able to split them when they have unmapped subpages, in order
to free those unused pages under memory pressure. So remove the
artificial requirement that the large folio needed to be at least
PMD-sized.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index ac1d93d43f2b..3d11c5fb6090 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1567,7 +1567,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 		 * page of the folio is unmapped and at least one page
 		 * is still mapped.
 		 */
-		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
+		if (folio_test_large(folio) && folio_test_anon(folio))
 			if (!compound || nr < nr_pmdmapped)
 				deferred_split_folio(folio);
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 06/10] mm: Allow deferred splitting of arbitrary large anon folios
@ 2023-06-26 17:14   ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

With the introduction of large folios for anonymous memory, we would
like to be able to split them when they have unmapped subpages, in order
to free those unused pages under memory pressure. So remove the
artificial requirement that the large folio needed to be at least
PMD-sized.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index ac1d93d43f2b..3d11c5fb6090 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1567,7 +1567,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 		 * page of the folio is unmapped and at least one page
 		 * is still mapped.
 		 */
-		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
+		if (folio_test_large(folio) && folio_test_anon(folio))
 			if (!compound || nr < nr_pmdmapped)
 				deferred_split_folio(folio);
 	}
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 07/10] mm: Batch-zap large anonymous folio PTE mappings
  2023-06-26 17:14 ` Ryan Roberts
@ 2023-06-26 17:14   ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

This allows batching the rmap removal with folio_remove_rmap_range(),
which means we avoid spuriously adding a partially unmapped folio to the
deferrred split queue in the common case, which reduces split queue lock
contention.

Previously each page was removed from the rmap individually with
page_remove_rmap(). If the first page belonged to a large folio, this
would cause page_remove_rmap() to conclude that the folio was now
partially mapped and add the folio to the deferred split queue. But
subsequent calls would cause the folio to become fully unmapped, meaning
there is no value to adding it to the split queue.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 119 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 53896d46e686..9165ed1b9fc2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -914,6 +914,57 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	return 0;
 }
 
+static inline unsigned long page_addr(struct page *page,
+				struct page *anchor, unsigned long anchor_addr)
+{
+	unsigned long offset;
+	unsigned long addr;
+
+	offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
+	addr = anchor_addr + offset;
+
+	if (anchor > page) {
+		if (addr > anchor_addr)
+			return 0;
+	} else {
+		if (addr < anchor_addr)
+			return ULONG_MAX;
+	}
+
+	return addr;
+}
+
+static int calc_anon_folio_map_pgcount(struct folio *folio,
+				       struct page *page, pte_t *pte,
+				       unsigned long addr, unsigned long end)
+{
+	pte_t ptent;
+	int floops;
+	int i;
+	unsigned long pfn;
+
+	end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr),
+		  end);
+	floops = (end - addr) >> PAGE_SHIFT;
+	pfn = page_to_pfn(page);
+	pfn++;
+	pte++;
+
+	for (i = 1; i < floops; i++) {
+		ptent = ptep_get(pte);
+
+		if (!pte_present(ptent) ||
+		    pte_pfn(ptent) != pfn) {
+			return i;
+		}
+
+		pfn++;
+		pte++;
+	}
+
+	return floops;
+}
+
 /*
  * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
  * is required to copy this pte.
@@ -1379,6 +1430,44 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
 	pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
 }
 
+static unsigned long zap_anon_pte_range(struct mmu_gather *tlb,
+					struct vm_area_struct *vma,
+					struct page *page, pte_t *pte,
+					unsigned long addr, unsigned long end,
+					bool *full_out)
+{
+	struct folio *folio = page_folio(page);
+	struct mm_struct *mm = tlb->mm;
+	pte_t ptent;
+	int pgcount;
+	int i;
+	bool full;
+
+	pgcount = calc_anon_folio_map_pgcount(folio, page, pte, addr, end);
+
+	for (i = 0; i < pgcount;) {
+		ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+		tlb_remove_tlb_entry(tlb, pte, addr);
+		full = __tlb_remove_page(tlb, page, 0);
+
+		if (unlikely(page_mapcount(page) < 1))
+			print_bad_pte(vma, addr, ptent, page);
+
+		i++;
+		page++;
+		pte++;
+		addr += PAGE_SIZE;
+
+		if (unlikely(full))
+			break;
+	}
+
+	folio_remove_rmap_range(folio, page - i, i, vma);
+
+	*full_out = full;
+	return i;
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -1415,6 +1504,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			page = vm_normal_page(vma, addr, ptent);
 			if (unlikely(!should_zap_page(details, page)))
 				continue;
+
+			/*
+			 * Batch zap large anonymous folio mappings. This allows
+			 * batching the rmap removal, which means we avoid
+			 * spuriously adding a partially unmapped folio to the
+			 * deferrred split queue in the common case, which
+			 * reduces split queue lock contention. Require the VMA
+			 * to be anonymous to ensure that none of the PTEs in
+			 * the range require zap_install_uffd_wp_if_needed().
+			 */
+			if (page && PageAnon(page) && vma_is_anonymous(vma)) {
+				bool full;
+				int pgcount;
+
+				pgcount = zap_anon_pte_range(tlb, vma,
+						page, pte, addr, end, &full);
+
+				rss[mm_counter(page)] -= pgcount;
+				pgcount--;
+				pte += pgcount;
+				addr += pgcount << PAGE_SHIFT;
+
+				if (unlikely(full)) {
+					force_flush = 1;
+					addr += PAGE_SIZE;
+					break;
+				}
+				continue;
+			}
+
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
 			tlb_remove_tlb_entry(tlb, pte, addr);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 07/10] mm: Batch-zap large anonymous folio PTE mappings
@ 2023-06-26 17:14   ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

This allows batching the rmap removal with folio_remove_rmap_range(),
which means we avoid spuriously adding a partially unmapped folio to the
deferrred split queue in the common case, which reduces split queue lock
contention.

Previously each page was removed from the rmap individually with
page_remove_rmap(). If the first page belonged to a large folio, this
would cause page_remove_rmap() to conclude that the folio was now
partially mapped and add the folio to the deferred split queue. But
subsequent calls would cause the folio to become fully unmapped, meaning
there is no value to adding it to the split queue.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 119 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 53896d46e686..9165ed1b9fc2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -914,6 +914,57 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	return 0;
 }
 
+static inline unsigned long page_addr(struct page *page,
+				struct page *anchor, unsigned long anchor_addr)
+{
+	unsigned long offset;
+	unsigned long addr;
+
+	offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
+	addr = anchor_addr + offset;
+
+	if (anchor > page) {
+		if (addr > anchor_addr)
+			return 0;
+	} else {
+		if (addr < anchor_addr)
+			return ULONG_MAX;
+	}
+
+	return addr;
+}
+
+static int calc_anon_folio_map_pgcount(struct folio *folio,
+				       struct page *page, pte_t *pte,
+				       unsigned long addr, unsigned long end)
+{
+	pte_t ptent;
+	int floops;
+	int i;
+	unsigned long pfn;
+
+	end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr),
+		  end);
+	floops = (end - addr) >> PAGE_SHIFT;
+	pfn = page_to_pfn(page);
+	pfn++;
+	pte++;
+
+	for (i = 1; i < floops; i++) {
+		ptent = ptep_get(pte);
+
+		if (!pte_present(ptent) ||
+		    pte_pfn(ptent) != pfn) {
+			return i;
+		}
+
+		pfn++;
+		pte++;
+	}
+
+	return floops;
+}
+
 /*
  * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
  * is required to copy this pte.
@@ -1379,6 +1430,44 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
 	pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
 }
 
+static unsigned long zap_anon_pte_range(struct mmu_gather *tlb,
+					struct vm_area_struct *vma,
+					struct page *page, pte_t *pte,
+					unsigned long addr, unsigned long end,
+					bool *full_out)
+{
+	struct folio *folio = page_folio(page);
+	struct mm_struct *mm = tlb->mm;
+	pte_t ptent;
+	int pgcount;
+	int i;
+	bool full;
+
+	pgcount = calc_anon_folio_map_pgcount(folio, page, pte, addr, end);
+
+	for (i = 0; i < pgcount;) {
+		ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+		tlb_remove_tlb_entry(tlb, pte, addr);
+		full = __tlb_remove_page(tlb, page, 0);
+
+		if (unlikely(page_mapcount(page) < 1))
+			print_bad_pte(vma, addr, ptent, page);
+
+		i++;
+		page++;
+		pte++;
+		addr += PAGE_SIZE;
+
+		if (unlikely(full))
+			break;
+	}
+
+	folio_remove_rmap_range(folio, page - i, i, vma);
+
+	*full_out = full;
+	return i;
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -1415,6 +1504,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			page = vm_normal_page(vma, addr, ptent);
 			if (unlikely(!should_zap_page(details, page)))
 				continue;
+
+			/*
+			 * Batch zap large anonymous folio mappings. This allows
+			 * batching the rmap removal, which means we avoid
+			 * spuriously adding a partially unmapped folio to the
+			 * deferrred split queue in the common case, which
+			 * reduces split queue lock contention. Require the VMA
+			 * to be anonymous to ensure that none of the PTEs in
+			 * the range require zap_install_uffd_wp_if_needed().
+			 */
+			if (page && PageAnon(page) && vma_is_anonymous(vma)) {
+				bool full;
+				int pgcount;
+
+				pgcount = zap_anon_pte_range(tlb, vma,
+						page, pte, addr, end, &full);
+
+				rss[mm_counter(page)] -= pgcount;
+				pgcount--;
+				pte += pgcount;
+				addr += pgcount << PAGE_SHIFT;
+
+				if (unlikely(full)) {
+					force_flush = 1;
+					addr += PAGE_SIZE;
+					break;
+				}
+				continue;
+			}
+
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
 			tlb_remove_tlb_entry(tlb, pte, addr);
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order
  2023-06-26 17:14 ` Ryan Roberts
@ 2023-06-26 17:14   ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

For variable-order anonymous folios, we need to determine the order that
we will allocate. From a SW perspective, the higher the order we
allocate, the less overhead we will have; fewer faults, fewer folios in
lists, etc. But of course there will also be more memory wastage as the
order increases.

From a HW perspective, there are memory block sizes that can be
beneficial to reducing TLB pressure. arm64, for example, has the ability
to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
64K base pages) such that one of these chunks only uses a single TLB
entry.

So we let the architecture specify the order of the maximally beneficial
mapping unit when PTE-mapped. Furthermore, because in some cases, this
order may be quite big (and therefore potentially wasteful of memory),
allow the arch to specify 2 values; One is the max order for a mapping
that _would not_ use THP if all size and alignment constraints were met,
and the other is the max order for a mapping that _would_ use THP if all
those constraints were met.

Implement this with Kconfig by introducing some new options to allow the
architecture to declare that it supports large anonymous folios along
with these 2 preferred max order values. Then introduce a user-facing
option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
enabled if the architecture has declared its support. When disabled, it
forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
allocated.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/Kconfig  | 39 +++++++++++++++++++++++++++++++++++++++
 mm/memory.c |  8 ++++++++
 2 files changed, 47 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 7672a22647b4..f4ba48c37b75 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
 
 source "mm/damon/Kconfig"
 
+config ARCH_SUPPORTS_LARGE_ANON_FOLIO
+	def_bool n
+	help
+	  An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
+	  to be enabled. It must also set the following integer values:
+	  - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
+	  - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
+
+config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
+	int
+	help
+	  The maximum size of folio to allocate for an anonymous VMA PTE-mapping
+	  that does not have the MADV_HUGEPAGE hint set.
+
+config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
+	int
+	help
+	  The maximum size of folio to allocate for an anonymous VMA PTE-mapping
+	  that has the MADV_HUGEPAGE hint set.
+
+config LARGE_ANON_FOLIO
+	bool "Allocate large folios for anonymous memory"
+	depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
+	default n
+	help
+	  Use large (bigger than order-0) folios to back anonymous memory where
+	  possible. This reduces the number of page faults, as well as other
+	  per-page overheads to improve performance for many workloads.
+
+config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
+	int
+	default 0 if !LARGE_ANON_FOLIO
+	default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
+
+config LARGE_ANON_FOLIO_THP_ORDER_MAX
+	int
+	default 0 if !LARGE_ANON_FOLIO
+	default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
+
 endmenu
diff --git a/mm/memory.c b/mm/memory.c
index 9165ed1b9fc2..a8f7e2b28d7a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3153,6 +3153,14 @@ static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
 	return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
 }
 
+static inline int max_anon_folio_order(struct vm_area_struct *vma)
+{
+	if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
+		return CONFIG_LARGE_ANON_FOLIO_THP_ORDER_MAX;
+	else
+		return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order
@ 2023-06-26 17:14   ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

For variable-order anonymous folios, we need to determine the order that
we will allocate. From a SW perspective, the higher the order we
allocate, the less overhead we will have; fewer faults, fewer folios in
lists, etc. But of course there will also be more memory wastage as the
order increases.

From a HW perspective, there are memory block sizes that can be
beneficial to reducing TLB pressure. arm64, for example, has the ability
to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
64K base pages) such that one of these chunks only uses a single TLB
entry.

So we let the architecture specify the order of the maximally beneficial
mapping unit when PTE-mapped. Furthermore, because in some cases, this
order may be quite big (and therefore potentially wasteful of memory),
allow the arch to specify 2 values; One is the max order for a mapping
that _would not_ use THP if all size and alignment constraints were met,
and the other is the max order for a mapping that _would_ use THP if all
those constraints were met.

Implement this with Kconfig by introducing some new options to allow the
architecture to declare that it supports large anonymous folios along
with these 2 preferred max order values. Then introduce a user-facing
option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
enabled if the architecture has declared its support. When disabled, it
forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
allocated.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/Kconfig  | 39 +++++++++++++++++++++++++++++++++++++++
 mm/memory.c |  8 ++++++++
 2 files changed, 47 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 7672a22647b4..f4ba48c37b75 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
 
 source "mm/damon/Kconfig"
 
+config ARCH_SUPPORTS_LARGE_ANON_FOLIO
+	def_bool n
+	help
+	  An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
+	  to be enabled. It must also set the following integer values:
+	  - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
+	  - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
+
+config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
+	int
+	help
+	  The maximum size of folio to allocate for an anonymous VMA PTE-mapping
+	  that does not have the MADV_HUGEPAGE hint set.
+
+config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
+	int
+	help
+	  The maximum size of folio to allocate for an anonymous VMA PTE-mapping
+	  that has the MADV_HUGEPAGE hint set.
+
+config LARGE_ANON_FOLIO
+	bool "Allocate large folios for anonymous memory"
+	depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
+	default n
+	help
+	  Use large (bigger than order-0) folios to back anonymous memory where
+	  possible. This reduces the number of page faults, as well as other
+	  per-page overheads to improve performance for many workloads.
+
+config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
+	int
+	default 0 if !LARGE_ANON_FOLIO
+	default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
+
+config LARGE_ANON_FOLIO_THP_ORDER_MAX
+	int
+	default 0 if !LARGE_ANON_FOLIO
+	default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
+
 endmenu
diff --git a/mm/memory.c b/mm/memory.c
index 9165ed1b9fc2..a8f7e2b28d7a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3153,6 +3153,14 @@ static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
 	return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
 }
 
+static inline int max_anon_folio_order(struct vm_area_struct *vma)
+{
+	if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
+		return CONFIG_LARGE_ANON_FOLIO_THP_ORDER_MAX;
+	else
+		return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 09/10] arm64: mm: Declare support for large anonymous folios
  2023-06-26 17:14 ` Ryan Roberts
@ 2023-06-26 17:14   ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

For the unhinted case, when THP is not permitted for the vma, don't
allow anything bigger than 64K. This means we don't waste too much
memory. Additionally, for 4K pages this is the contpte size, and for
16K, this is (usually) the HPA size when the uarch feature is
implemented. For the hinted case, when THP is permitted for the vma,
allow the contpte size for all page size configurations; 64K for 4K, 2M
for 16K and 2M for 64K.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/Kconfig | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 343e1e1cae10..0e91b5bc8cd9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -243,6 +243,7 @@ config ARM64
 	select TRACE_IRQFLAGS_SUPPORT
 	select TRACE_IRQFLAGS_NMI_SUPPORT
 	select HAVE_SOFTIRQ_ON_OWN_STACK
+	select ARCH_SUPPORTS_LARGE_ANON_FOLIO
 	help
 	  ARM 64-bit (AArch64) Linux support.
 
@@ -281,6 +282,18 @@ config ARM64_CONT_PMD_SHIFT
 	default 5 if ARM64_16K_PAGES
 	default 4
 
+config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
+	int
+	default 0 if ARM64_64K_PAGES	# 64K (1 page)
+	default 2 if ARM64_16K_PAGES	# 64K (4 pages; benefits from HPA where HW supports it)
+	default 4 if ARM64_4K_PAGES	# 64K (16 pages; eligible for contpte-mapping)
+
+config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
+	int
+	default 5 if ARM64_64K_PAGES	# 2M  (32 page; eligible for contpte-mapping)
+	default 7 if ARM64_16K_PAGES	# 2M  (128 pages; eligible for contpte-mapping)
+	default 4 if ARM64_4K_PAGES	# 64K (16 pages; eligible for contpte-mapping)
+
 config ARCH_MMAP_RND_BITS_MIN
 	default 14 if ARM64_64K_PAGES
 	default 16 if ARM64_16K_PAGES
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 09/10] arm64: mm: Declare support for large anonymous folios
@ 2023-06-26 17:14   ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

For the unhinted case, when THP is not permitted for the vma, don't
allow anything bigger than 64K. This means we don't waste too much
memory. Additionally, for 4K pages this is the contpte size, and for
16K, this is (usually) the HPA size when the uarch feature is
implemented. For the hinted case, when THP is permitted for the vma,
allow the contpte size for all page size configurations; 64K for 4K, 2M
for 16K and 2M for 64K.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/Kconfig | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 343e1e1cae10..0e91b5bc8cd9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -243,6 +243,7 @@ config ARM64
 	select TRACE_IRQFLAGS_SUPPORT
 	select TRACE_IRQFLAGS_NMI_SUPPORT
 	select HAVE_SOFTIRQ_ON_OWN_STACK
+	select ARCH_SUPPORTS_LARGE_ANON_FOLIO
 	help
 	  ARM 64-bit (AArch64) Linux support.
 
@@ -281,6 +282,18 @@ config ARM64_CONT_PMD_SHIFT
 	default 5 if ARM64_16K_PAGES
 	default 4
 
+config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
+	int
+	default 0 if ARM64_64K_PAGES	# 64K (1 page)
+	default 2 if ARM64_16K_PAGES	# 64K (4 pages; benefits from HPA where HW supports it)
+	default 4 if ARM64_4K_PAGES	# 64K (16 pages; eligible for contpte-mapping)
+
+config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
+	int
+	default 5 if ARM64_64K_PAGES	# 2M  (32 page; eligible for contpte-mapping)
+	default 7 if ARM64_16K_PAGES	# 2M  (128 pages; eligible for contpte-mapping)
+	default 4 if ARM64_4K_PAGES	# 64K (16 pages; eligible for contpte-mapping)
+
 config ARCH_MMAP_RND_BITS_MIN
 	default 14 if ARM64_64K_PAGES
 	default 16 if ARM64_16K_PAGES
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
  2023-06-26 17:14 ` Ryan Roberts
@ 2023-06-26 17:14   ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

With all of the enabler patches in place, modify the anonymous memory
write allocation path so that it opportunistically attempts to allocate
a large folio up to `max_anon_folio_order()` size (This value is
ultimately configured by the architecture). This reduces the number of
page faults, reduces the size of (e.g. LRU) lists, and generally
improves performance by batching what were per-page operations into
per-(large)-folio operations.

If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
`max_anon_folio_order()` always returns 0, meaning we get the existing
allocation behaviour.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 144 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index a8f7e2b28d7a..d23c44cc5092 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
 		return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
 }
 
+/*
+ * Returns index of first pte that is not none, or nr if all are none.
+ */
+static inline int check_ptes_none(pte_t *pte, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		if (!pte_none(ptep_get(pte++)))
+			return i;
+	}
+
+	return nr;
+}
+
+static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
+{
+	/*
+	 * The aim here is to determine what size of folio we should allocate
+	 * for this fault. Factors include:
+	 * - Order must not be higher than `order` upon entry
+	 * - Folio must be naturally aligned within VA space
+	 * - Folio must not breach boundaries of vma
+	 * - Folio must be fully contained inside one pmd entry
+	 * - Folio must not overlap any non-none ptes
+	 *
+	 * Additionally, we do not allow order-1 since this breaks assumptions
+	 * elsewhere in the mm; THP pages must be at least order-2 (since they
+	 * store state up to the 3rd struct page subpage), and these pages must
+	 * be THP in order to correctly use pre-existing THP infrastructure such
+	 * as folio_split().
+	 *
+	 * As a consequence of relying on the THP infrastructure, if the system
+	 * does not support THP, we always fallback to order-0.
+	 *
+	 * Note that the caller may or may not choose to lock the pte. If
+	 * unlocked, the calculation should be considered an estimate that will
+	 * need to be validated under the lock.
+	 */
+
+	struct vm_area_struct *vma = vmf->vma;
+	int nr;
+	unsigned long addr;
+	pte_t *pte;
+	pte_t *first_set = NULL;
+	int ret;
+
+	if (has_transparent_hugepage()) {
+		order = min(order, PMD_SHIFT - PAGE_SHIFT);
+
+		for (; order > 1; order--) {
+			nr = 1 << order;
+			addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
+			pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
+
+			/* Check vma bounds. */
+			if (addr < vma->vm_start ||
+			    addr + (nr << PAGE_SHIFT) > vma->vm_end)
+				continue;
+
+			/* Ptes covered by order already known to be none. */
+			if (pte + nr <= first_set)
+				break;
+
+			/* Already found set pte in range covered by order. */
+			if (pte <= first_set)
+				continue;
+
+			/* Need to check if all the ptes are none. */
+			ret = check_ptes_none(pte, nr);
+			if (ret == nr)
+				break;
+
+			first_set = pte + ret;
+		}
+
+		if (order == 1)
+			order = 0;
+	} else
+		order = 0;
+
+	return order;
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *
@@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	struct folio *folio;
 	vm_fault_t ret = 0;
 	pte_t entry;
+	unsigned long addr;
+	int order = uffd_wp ? 0 : max_anon_folio_order(vma);
+	int pgcount = BIT(order);
 
 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
@@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return handle_userfault(vmf, VM_UFFD_MISSING);
 		}
-		goto setpte;
+		if (uffd_wp)
+			entry = pte_mkuffd_wp(entry);
+		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, vmf->address, vmf->pte);
+		goto unlock;
 	}
 
-	/* Allocate our own private page. */
+retry:
+	/*
+	 * Estimate the folio order to allocate. We are not under the ptl here
+	 * so this estiamte needs to be re-checked later once we have the lock.
+	 */
+	vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+	order = calc_anon_folio_order_alloc(vmf, order);
+	pte_unmap(vmf->pte);
+
+	/* Allocate our own private folio. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
+	folio = try_vma_alloc_movable_folio(vma, vmf->address, order, true);
 	if (!folio)
 		goto oom;
 
+	/* We may have been granted less than we asked for. */
+	order = folio_order(folio);
+	pgcount = BIT(order);
+	addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
 	folio_throttle_swaprate(folio, GFP_KERNEL);
 
 	/*
 	 * The memory barrier inside __folio_mark_uptodate makes sure that
-	 * preceding stores to the page contents become visible before
-	 * the set_pte_at() write.
+	 * preceding stores to the folio contents become visible before
+	 * the set_ptes() write.
 	 */
 	__folio_mark_uptodate(folio);
 
@@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
-	if (vmf_pte_changed(vmf)) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
-		goto release;
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
+
+	/*
+	 * Ensure our estimate above is still correct; we could have raced with
+	 * another thread to service a fault in the region.
+	 */
+	if (order == 0) {
+		if (vmf_pte_changed(vmf)) {
+			update_mmu_tlb(vma, vmf->address, vmf->pte);
+			goto release;
+		}
+	} else if (check_ptes_none(vmf->pte, pgcount) != pgcount) {
+		pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
+
+		/* If faulting pte was allocated by another, exit early. */
+		if (!pte_none(ptep_get(pte))) {
+			update_mmu_tlb(vma, vmf->address, pte);
+			goto release;
+		}
+
+		/* Else try again, with a lower order. */
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		folio_put(folio);
+		order--;
+		goto retry;
 	}
 
 	ret = check_stable_address_space(vma->vm_mm);
@@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	folio_add_new_anon_rmap(folio, vma, vmf->address);
+	folio_ref_add(folio, pgcount - 1);
+
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
+	folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, addr);
 	folio_add_lru_vma(folio, vma);
-setpte:
+
 	if (uffd_wp)
 		entry = pte_mkuffd_wp(entry);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, vmf->address, vmf->pte);
+	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
@ 2023-06-26 17:14   ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-26 17:14 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

With all of the enabler patches in place, modify the anonymous memory
write allocation path so that it opportunistically attempts to allocate
a large folio up to `max_anon_folio_order()` size (This value is
ultimately configured by the architecture). This reduces the number of
page faults, reduces the size of (e.g. LRU) lists, and generally
improves performance by batching what were per-page operations into
per-(large)-folio operations.

If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
`max_anon_folio_order()` always returns 0, meaning we get the existing
allocation behaviour.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 144 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index a8f7e2b28d7a..d23c44cc5092 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
 		return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
 }
 
+/*
+ * Returns index of first pte that is not none, or nr if all are none.
+ */
+static inline int check_ptes_none(pte_t *pte, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		if (!pte_none(ptep_get(pte++)))
+			return i;
+	}
+
+	return nr;
+}
+
+static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
+{
+	/*
+	 * The aim here is to determine what size of folio we should allocate
+	 * for this fault. Factors include:
+	 * - Order must not be higher than `order` upon entry
+	 * - Folio must be naturally aligned within VA space
+	 * - Folio must not breach boundaries of vma
+	 * - Folio must be fully contained inside one pmd entry
+	 * - Folio must not overlap any non-none ptes
+	 *
+	 * Additionally, we do not allow order-1 since this breaks assumptions
+	 * elsewhere in the mm; THP pages must be at least order-2 (since they
+	 * store state up to the 3rd struct page subpage), and these pages must
+	 * be THP in order to correctly use pre-existing THP infrastructure such
+	 * as folio_split().
+	 *
+	 * As a consequence of relying on the THP infrastructure, if the system
+	 * does not support THP, we always fallback to order-0.
+	 *
+	 * Note that the caller may or may not choose to lock the pte. If
+	 * unlocked, the calculation should be considered an estimate that will
+	 * need to be validated under the lock.
+	 */
+
+	struct vm_area_struct *vma = vmf->vma;
+	int nr;
+	unsigned long addr;
+	pte_t *pte;
+	pte_t *first_set = NULL;
+	int ret;
+
+	if (has_transparent_hugepage()) {
+		order = min(order, PMD_SHIFT - PAGE_SHIFT);
+
+		for (; order > 1; order--) {
+			nr = 1 << order;
+			addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
+			pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
+
+			/* Check vma bounds. */
+			if (addr < vma->vm_start ||
+			    addr + (nr << PAGE_SHIFT) > vma->vm_end)
+				continue;
+
+			/* Ptes covered by order already known to be none. */
+			if (pte + nr <= first_set)
+				break;
+
+			/* Already found set pte in range covered by order. */
+			if (pte <= first_set)
+				continue;
+
+			/* Need to check if all the ptes are none. */
+			ret = check_ptes_none(pte, nr);
+			if (ret == nr)
+				break;
+
+			first_set = pte + ret;
+		}
+
+		if (order == 1)
+			order = 0;
+	} else
+		order = 0;
+
+	return order;
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *
@@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	struct folio *folio;
 	vm_fault_t ret = 0;
 	pte_t entry;
+	unsigned long addr;
+	int order = uffd_wp ? 0 : max_anon_folio_order(vma);
+	int pgcount = BIT(order);
 
 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
@@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return handle_userfault(vmf, VM_UFFD_MISSING);
 		}
-		goto setpte;
+		if (uffd_wp)
+			entry = pte_mkuffd_wp(entry);
+		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, vmf->address, vmf->pte);
+		goto unlock;
 	}
 
-	/* Allocate our own private page. */
+retry:
+	/*
+	 * Estimate the folio order to allocate. We are not under the ptl here
+	 * so this estiamte needs to be re-checked later once we have the lock.
+	 */
+	vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+	order = calc_anon_folio_order_alloc(vmf, order);
+	pte_unmap(vmf->pte);
+
+	/* Allocate our own private folio. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
+	folio = try_vma_alloc_movable_folio(vma, vmf->address, order, true);
 	if (!folio)
 		goto oom;
 
+	/* We may have been granted less than we asked for. */
+	order = folio_order(folio);
+	pgcount = BIT(order);
+	addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
 	folio_throttle_swaprate(folio, GFP_KERNEL);
 
 	/*
 	 * The memory barrier inside __folio_mark_uptodate makes sure that
-	 * preceding stores to the page contents become visible before
-	 * the set_pte_at() write.
+	 * preceding stores to the folio contents become visible before
+	 * the set_ptes() write.
 	 */
 	__folio_mark_uptodate(folio);
 
@@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
-	if (vmf_pte_changed(vmf)) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
-		goto release;
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
+
+	/*
+	 * Ensure our estimate above is still correct; we could have raced with
+	 * another thread to service a fault in the region.
+	 */
+	if (order == 0) {
+		if (vmf_pte_changed(vmf)) {
+			update_mmu_tlb(vma, vmf->address, vmf->pte);
+			goto release;
+		}
+	} else if (check_ptes_none(vmf->pte, pgcount) != pgcount) {
+		pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
+
+		/* If faulting pte was allocated by another, exit early. */
+		if (!pte_none(ptep_get(pte))) {
+			update_mmu_tlb(vma, vmf->address, pte);
+			goto release;
+		}
+
+		/* Else try again, with a lower order. */
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		folio_put(folio);
+		order--;
+		goto retry;
 	}
 
 	ret = check_stable_address_space(vma->vm_mm);
@@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	folio_add_new_anon_rmap(folio, vma, vmf->address);
+	folio_ref_add(folio, pgcount - 1);
+
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
+	folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, addr);
 	folio_add_lru_vma(folio, vma);
-setpte:
+
 	if (uffd_wp)
 		entry = pte_mkuffd_wp(entry);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, vmf->address, vmf->pte);
+	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
  2023-06-26 17:14   ` Ryan Roberts
  (?)
@ 2023-06-27  1:55     ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  1:55 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> In preparation for extending vma_alloc_zeroed_movable_folio() to
> allocate a arbitrary order folio, expose clear_huge_page()
> unconditionally, so that it can be used to zero the allocated folio in
> the generic implementation of vma_alloc_zeroed_movable_folio().
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/mm.h | 3 ++-
>  mm/memory.c        | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7f1741bd870a..7e3bf45e6491 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
>   */
>  extern const struct attribute_group memory_failure_attr_group;
>
> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>  extern void clear_huge_page(struct page *page,
>                             unsigned long addr_hint,
>                             unsigned int pages_per_huge_page);
> +
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)

We might not want to depend on THP eventually. Right now, we still
have to, unless splitting is optional, which seems to contradict
06/10. (deferred_split_folio()  is a nop without THP.)

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
@ 2023-06-27  1:55     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  1:55 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> In preparation for extending vma_alloc_zeroed_movable_folio() to
> allocate a arbitrary order folio, expose clear_huge_page()
> unconditionally, so that it can be used to zero the allocated folio in
> the generic implementation of vma_alloc_zeroed_movable_folio().
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/mm.h | 3 ++-
>  mm/memory.c        | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7f1741bd870a..7e3bf45e6491 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
>   */
>  extern const struct attribute_group memory_failure_attr_group;
>
> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>  extern void clear_huge_page(struct page *page,
>                             unsigned long addr_hint,
>                             unsigned int pages_per_huge_page);
> +
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)

We might not want to depend on THP eventually. Right now, we still
have to, unless splitting is optional, which seems to contradict
06/10. (deferred_split_folio()  is a nop without THP.)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
@ 2023-06-27  1:55     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  1:55 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> In preparation for extending vma_alloc_zeroed_movable_folio() to
> allocate a arbitrary order folio, expose clear_huge_page()
> unconditionally, so that it can be used to zero the allocated folio in
> the generic implementation of vma_alloc_zeroed_movable_folio().
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/mm.h | 3 ++-
>  mm/memory.c        | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7f1741bd870a..7e3bf45e6491 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
>   */
>  extern const struct attribute_group memory_failure_attr_group;
>
> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>  extern void clear_huge_page(struct page *page,
>                             unsigned long addr_hint,
>                             unsigned int pages_per_huge_page);
> +
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)

We might not want to depend on THP eventually. Right now, we still
have to, unless splitting is optional, which seems to contradict
06/10. (deferred_split_folio()  is a nop without THP.)

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 02/10] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
  2023-06-26 17:14   ` Ryan Roberts
  (?)
@ 2023-06-27  2:27     ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  2:27 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Allow allocation of large folios with vma_alloc_zeroed_movable_folio().
> This prepares the ground for large anonymous folios. The generic
> implementation of vma_alloc_zeroed_movable_folio() now uses
> clear_huge_page() to zero the allocated folio since it may now be a
> non-0 order.
>
> Currently the function is always called with order 0 and no extra gfp
> flags, so no functional change intended. But a subsequent commit will
> take advantage of the new parameters to allocate large folios. The extra
> gfp flags will be used to control the reclaim policy.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/alpha/include/asm/page.h   |  5 +++--
>  arch/arm64/include/asm/page.h   |  3 ++-
>  arch/arm64/mm/fault.c           |  7 ++++---
>  arch/ia64/include/asm/page.h    |  5 +++--
>  arch/m68k/include/asm/page_no.h |  7 ++++---
>  arch/s390/include/asm/page.h    |  5 +++--
>  arch/x86/include/asm/page.h     |  5 +++--
>  include/linux/highmem.h         | 23 +++++++++++++----------
>  mm/memory.c                     |  5 +++--
>  9 files changed, 38 insertions(+), 27 deletions(-)
>
> diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
> index 4db1ebc0ed99..6fc7fe91b6cb 100644
> --- a/arch/alpha/include/asm/page.h
> +++ b/arch/alpha/include/asm/page.h
> @@ -17,8 +17,9 @@
>  extern void clear_page(void *page);
>  #define clear_user_page(page, vaddr, pg)       clear_page(page)
>
> -#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
> -       vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
> +#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
> +       vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
> +                       order, vma, vaddr, false)

I don't think we need to worry about gfp if we want to make a minimum
series. There would be many discussion points around it, e.g., I
already disagree with what you chose: GFP_TRANSHUGE_LIGHT would be
more suitable than __GFP_NORETRY, and there are even better options
than GFP_TRANSHUGE_LIGHT.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 02/10] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
@ 2023-06-27  2:27     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  2:27 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Allow allocation of large folios with vma_alloc_zeroed_movable_folio().
> This prepares the ground for large anonymous folios. The generic
> implementation of vma_alloc_zeroed_movable_folio() now uses
> clear_huge_page() to zero the allocated folio since it may now be a
> non-0 order.
>
> Currently the function is always called with order 0 and no extra gfp
> flags, so no functional change intended. But a subsequent commit will
> take advantage of the new parameters to allocate large folios. The extra
> gfp flags will be used to control the reclaim policy.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/alpha/include/asm/page.h   |  5 +++--
>  arch/arm64/include/asm/page.h   |  3 ++-
>  arch/arm64/mm/fault.c           |  7 ++++---
>  arch/ia64/include/asm/page.h    |  5 +++--
>  arch/m68k/include/asm/page_no.h |  7 ++++---
>  arch/s390/include/asm/page.h    |  5 +++--
>  arch/x86/include/asm/page.h     |  5 +++--
>  include/linux/highmem.h         | 23 +++++++++++++----------
>  mm/memory.c                     |  5 +++--
>  9 files changed, 38 insertions(+), 27 deletions(-)
>
> diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
> index 4db1ebc0ed99..6fc7fe91b6cb 100644
> --- a/arch/alpha/include/asm/page.h
> +++ b/arch/alpha/include/asm/page.h
> @@ -17,8 +17,9 @@
>  extern void clear_page(void *page);
>  #define clear_user_page(page, vaddr, pg)       clear_page(page)
>
> -#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
> -       vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
> +#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
> +       vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
> +                       order, vma, vaddr, false)

I don't think we need to worry about gfp if we want to make a minimum
series. There would be many discussion points around it, e.g., I
already disagree with what you chose: GFP_TRANSHUGE_LIGHT would be
more suitable than __GFP_NORETRY, and there are even better options
than GFP_TRANSHUGE_LIGHT.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 02/10] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
@ 2023-06-27  2:27     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  2:27 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Allow allocation of large folios with vma_alloc_zeroed_movable_folio().
> This prepares the ground for large anonymous folios. The generic
> implementation of vma_alloc_zeroed_movable_folio() now uses
> clear_huge_page() to zero the allocated folio since it may now be a
> non-0 order.
>
> Currently the function is always called with order 0 and no extra gfp
> flags, so no functional change intended. But a subsequent commit will
> take advantage of the new parameters to allocate large folios. The extra
> gfp flags will be used to control the reclaim policy.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/alpha/include/asm/page.h   |  5 +++--
>  arch/arm64/include/asm/page.h   |  3 ++-
>  arch/arm64/mm/fault.c           |  7 ++++---
>  arch/ia64/include/asm/page.h    |  5 +++--
>  arch/m68k/include/asm/page_no.h |  7 ++++---
>  arch/s390/include/asm/page.h    |  5 +++--
>  arch/x86/include/asm/page.h     |  5 +++--
>  include/linux/highmem.h         | 23 +++++++++++++----------
>  mm/memory.c                     |  5 +++--
>  9 files changed, 38 insertions(+), 27 deletions(-)
>
> diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
> index 4db1ebc0ed99..6fc7fe91b6cb 100644
> --- a/arch/alpha/include/asm/page.h
> +++ b/arch/alpha/include/asm/page.h
> @@ -17,8 +17,9 @@
>  extern void clear_page(void *page);
>  #define clear_user_page(page, vaddr, pg)       clear_page(page)
>
> -#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
> -       vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
> +#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
> +       vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
> +                       order, vma, vaddr, false)

I don't think we need to worry about gfp if we want to make a minimum
series. There would be many discussion points around it, e.g., I
already disagree with what you chose: GFP_TRANSHUGE_LIGHT would be
more suitable than __GFP_NORETRY, and there are even better options
than GFP_TRANSHUGE_LIGHT.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
  2023-06-26 17:14   ` Ryan Roberts
  (?)
@ 2023-06-27  2:34     ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  2:34 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Opportunistically attempt to allocate high-order folios in highmem,
> optionally zeroed. Retry with lower orders all the way to order-0, until
> success. Although, of note, order-1 allocations are skipped since a
> large folio must be at least order-2 to work with the THP machinery. The
> user must check what they got with folio_order().
>
> This will be used to oportunistically allocate large folios for
> anonymous memory with a sensible fallback under memory pressure.
>
> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
> high latency due to reclaim, instead preferring to just try for a lower
> order. The same approach is used by the readahead code when allocating
> large folios.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 367bbbb29d91..53896d46e686 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>         return 0;
>  }
>
> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
> +                               unsigned long vaddr, int order, bool zeroed)
> +{
> +       gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
> +
> +       if (zeroed)
> +               return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
> +       else
> +               return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
> +                                                               vaddr, false);
> +}
> +
> +/*
> + * Opportunistically attempt to allocate high-order folios, retrying with lower
> + * orders all the way to order-0, until success. order-1 allocations are skipped
> + * since a folio must be at least order-2 to work with the THP machinery. The
> + * user must check what they got with folio_order(). vaddr can be any virtual
> + * address that will be mapped by the allocated folio.
> + */
> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
> +                               unsigned long vaddr, int order, bool zeroed)
> +{
> +       struct folio *folio;
> +
> +       for (; order > 1; order--) {
> +               folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
> +               if (folio)
> +                       return folio;
> +       }
> +
> +       return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
> +}

I'd drop this patch. Instead, in do_anonymous_page():

  if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER))
    folio = vma_alloc_zeroed_movable_folio(vma, addr,
CONFIG_ARCH_WANTS_PTE_ORDER))

  if (!folio)
    folio = vma_alloc_zeroed_movable_folio(vma, addr, 0);

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
@ 2023-06-27  2:34     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  2:34 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Opportunistically attempt to allocate high-order folios in highmem,
> optionally zeroed. Retry with lower orders all the way to order-0, until
> success. Although, of note, order-1 allocations are skipped since a
> large folio must be at least order-2 to work with the THP machinery. The
> user must check what they got with folio_order().
>
> This will be used to oportunistically allocate large folios for
> anonymous memory with a sensible fallback under memory pressure.
>
> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
> high latency due to reclaim, instead preferring to just try for a lower
> order. The same approach is used by the readahead code when allocating
> large folios.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 367bbbb29d91..53896d46e686 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>         return 0;
>  }
>
> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
> +                               unsigned long vaddr, int order, bool zeroed)
> +{
> +       gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
> +
> +       if (zeroed)
> +               return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
> +       else
> +               return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
> +                                                               vaddr, false);
> +}
> +
> +/*
> + * Opportunistically attempt to allocate high-order folios, retrying with lower
> + * orders all the way to order-0, until success. order-1 allocations are skipped
> + * since a folio must be at least order-2 to work with the THP machinery. The
> + * user must check what they got with folio_order(). vaddr can be any virtual
> + * address that will be mapped by the allocated folio.
> + */
> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
> +                               unsigned long vaddr, int order, bool zeroed)
> +{
> +       struct folio *folio;
> +
> +       for (; order > 1; order--) {
> +               folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
> +               if (folio)
> +                       return folio;
> +       }
> +
> +       return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
> +}

I'd drop this patch. Instead, in do_anonymous_page():

  if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER))
    folio = vma_alloc_zeroed_movable_folio(vma, addr,
CONFIG_ARCH_WANTS_PTE_ORDER))

  if (!folio)
    folio = vma_alloc_zeroed_movable_folio(vma, addr, 0);

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
@ 2023-06-27  2:34     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  2:34 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Opportunistically attempt to allocate high-order folios in highmem,
> optionally zeroed. Retry with lower orders all the way to order-0, until
> success. Although, of note, order-1 allocations are skipped since a
> large folio must be at least order-2 to work with the THP machinery. The
> user must check what they got with folio_order().
>
> This will be used to oportunistically allocate large folios for
> anonymous memory with a sensible fallback under memory pressure.
>
> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
> high latency due to reclaim, instead preferring to just try for a lower
> order. The same approach is used by the readahead code when allocating
> large folios.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 367bbbb29d91..53896d46e686 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>         return 0;
>  }
>
> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
> +                               unsigned long vaddr, int order, bool zeroed)
> +{
> +       gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
> +
> +       if (zeroed)
> +               return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
> +       else
> +               return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
> +                                                               vaddr, false);
> +}
> +
> +/*
> + * Opportunistically attempt to allocate high-order folios, retrying with lower
> + * orders all the way to order-0, until success. order-1 allocations are skipped
> + * since a folio must be at least order-2 to work with the THP machinery. The
> + * user must check what they got with folio_order(). vaddr can be any virtual
> + * address that will be mapped by the allocated folio.
> + */
> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
> +                               unsigned long vaddr, int order, bool zeroed)
> +{
> +       struct folio *folio;
> +
> +       for (; order > 1; order--) {
> +               folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
> +               if (folio)
> +                       return folio;
> +       }
> +
> +       return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
> +}

I'd drop this patch. Instead, in do_anonymous_page():

  if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER))
    folio = vma_alloc_zeroed_movable_folio(vma, addr,
CONFIG_ARCH_WANTS_PTE_ORDER))

  if (!folio)
    folio = vma_alloc_zeroed_movable_folio(vma, addr, 0);

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order
  2023-06-26 17:14   ` Ryan Roberts
  (?)
@ 2023-06-27  2:47     ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  2:47 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> For variable-order anonymous folios, we need to determine the order that
> we will allocate. From a SW perspective, the higher the order we
> allocate, the less overhead we will have; fewer faults, fewer folios in
> lists, etc. But of course there will also be more memory wastage as the
> order increases.
>
> From a HW perspective, there are memory block sizes that can be
> beneficial to reducing TLB pressure. arm64, for example, has the ability
> to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
> 64K base pages) such that one of these chunks only uses a single TLB
> entry.
>
> So we let the architecture specify the order of the maximally beneficial
> mapping unit when PTE-mapped. Furthermore, because in some cases, this
> order may be quite big (and therefore potentially wasteful of memory),
> allow the arch to specify 2 values; One is the max order for a mapping
> that _would not_ use THP if all size and alignment constraints were met,
> and the other is the max order for a mapping that _would_ use THP if all
> those constraints were met.
>
> Implement this with Kconfig by introducing some new options to allow the
> architecture to declare that it supports large anonymous folios along
> with these 2 preferred max order values. Then introduce a user-facing
> option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
> enabled if the architecture has declared its support. When disabled, it
> forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
> LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
> allocated.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/Kconfig  | 39 +++++++++++++++++++++++++++++++++++++++
>  mm/memory.c |  8 ++++++++
>  2 files changed, 47 insertions(+)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..f4ba48c37b75 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
>
>  source "mm/damon/Kconfig"
>
> +config ARCH_SUPPORTS_LARGE_ANON_FOLIO
> +       def_bool n
> +       help
> +         An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
> +         to be enabled. It must also set the following integer values:
> +         - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +         - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +
> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +       int
> +       help
> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
> +         that does not have the MADV_HUGEPAGE hint set.
> +
> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +       int
> +       help
> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
> +         that has the MADV_HUGEPAGE hint set.
> +
> +config LARGE_ANON_FOLIO
> +       bool "Allocate large folios for anonymous memory"
> +       depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
> +       default n
> +       help
> +         Use large (bigger than order-0) folios to back anonymous memory where
> +         possible. This reduces the number of page faults, as well as other
> +         per-page overheads to improve performance for many workloads.
> +
> +config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +       int
> +       default 0 if !LARGE_ANON_FOLIO
> +       default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +
> +config LARGE_ANON_FOLIO_THP_ORDER_MAX
> +       int
> +       default 0 if !LARGE_ANON_FOLIO
> +       default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +
>  endmenu

I don't think an MVP should add this many Kconfigs. One Kconfig sounds
reasonable to me for now.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order
@ 2023-06-27  2:47     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  2:47 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> For variable-order anonymous folios, we need to determine the order that
> we will allocate. From a SW perspective, the higher the order we
> allocate, the less overhead we will have; fewer faults, fewer folios in
> lists, etc. But of course there will also be more memory wastage as the
> order increases.
>
> From a HW perspective, there are memory block sizes that can be
> beneficial to reducing TLB pressure. arm64, for example, has the ability
> to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
> 64K base pages) such that one of these chunks only uses a single TLB
> entry.
>
> So we let the architecture specify the order of the maximally beneficial
> mapping unit when PTE-mapped. Furthermore, because in some cases, this
> order may be quite big (and therefore potentially wasteful of memory),
> allow the arch to specify 2 values; One is the max order for a mapping
> that _would not_ use THP if all size and alignment constraints were met,
> and the other is the max order for a mapping that _would_ use THP if all
> those constraints were met.
>
> Implement this with Kconfig by introducing some new options to allow the
> architecture to declare that it supports large anonymous folios along
> with these 2 preferred max order values. Then introduce a user-facing
> option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
> enabled if the architecture has declared its support. When disabled, it
> forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
> LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
> allocated.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/Kconfig  | 39 +++++++++++++++++++++++++++++++++++++++
>  mm/memory.c |  8 ++++++++
>  2 files changed, 47 insertions(+)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..f4ba48c37b75 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
>
>  source "mm/damon/Kconfig"
>
> +config ARCH_SUPPORTS_LARGE_ANON_FOLIO
> +       def_bool n
> +       help
> +         An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
> +         to be enabled. It must also set the following integer values:
> +         - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +         - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +
> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +       int
> +       help
> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
> +         that does not have the MADV_HUGEPAGE hint set.
> +
> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +       int
> +       help
> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
> +         that has the MADV_HUGEPAGE hint set.
> +
> +config LARGE_ANON_FOLIO
> +       bool "Allocate large folios for anonymous memory"
> +       depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
> +       default n
> +       help
> +         Use large (bigger than order-0) folios to back anonymous memory where
> +         possible. This reduces the number of page faults, as well as other
> +         per-page overheads to improve performance for many workloads.
> +
> +config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +       int
> +       default 0 if !LARGE_ANON_FOLIO
> +       default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +
> +config LARGE_ANON_FOLIO_THP_ORDER_MAX
> +       int
> +       default 0 if !LARGE_ANON_FOLIO
> +       default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +
>  endmenu

I don't think an MVP should add this many Kconfigs. One Kconfig sounds
reasonable to me for now.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order
@ 2023-06-27  2:47     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  2:47 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> For variable-order anonymous folios, we need to determine the order that
> we will allocate. From a SW perspective, the higher the order we
> allocate, the less overhead we will have; fewer faults, fewer folios in
> lists, etc. But of course there will also be more memory wastage as the
> order increases.
>
> From a HW perspective, there are memory block sizes that can be
> beneficial to reducing TLB pressure. arm64, for example, has the ability
> to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
> 64K base pages) such that one of these chunks only uses a single TLB
> entry.
>
> So we let the architecture specify the order of the maximally beneficial
> mapping unit when PTE-mapped. Furthermore, because in some cases, this
> order may be quite big (and therefore potentially wasteful of memory),
> allow the arch to specify 2 values; One is the max order for a mapping
> that _would not_ use THP if all size and alignment constraints were met,
> and the other is the max order for a mapping that _would_ use THP if all
> those constraints were met.
>
> Implement this with Kconfig by introducing some new options to allow the
> architecture to declare that it supports large anonymous folios along
> with these 2 preferred max order values. Then introduce a user-facing
> option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
> enabled if the architecture has declared its support. When disabled, it
> forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
> LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
> allocated.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/Kconfig  | 39 +++++++++++++++++++++++++++++++++++++++
>  mm/memory.c |  8 ++++++++
>  2 files changed, 47 insertions(+)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..f4ba48c37b75 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
>
>  source "mm/damon/Kconfig"
>
> +config ARCH_SUPPORTS_LARGE_ANON_FOLIO
> +       def_bool n
> +       help
> +         An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
> +         to be enabled. It must also set the following integer values:
> +         - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +         - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +
> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +       int
> +       help
> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
> +         that does not have the MADV_HUGEPAGE hint set.
> +
> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +       int
> +       help
> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
> +         that has the MADV_HUGEPAGE hint set.
> +
> +config LARGE_ANON_FOLIO
> +       bool "Allocate large folios for anonymous memory"
> +       depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
> +       default n
> +       help
> +         Use large (bigger than order-0) folios to back anonymous memory where
> +         possible. This reduces the number of page faults, as well as other
> +         per-page overheads to improve performance for many workloads.
> +
> +config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +       int
> +       default 0 if !LARGE_ANON_FOLIO
> +       default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +
> +config LARGE_ANON_FOLIO_THP_ORDER_MAX
> +       int
> +       default 0 if !LARGE_ANON_FOLIO
> +       default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +
>  endmenu

I don't think an MVP should add this many Kconfigs. One Kconfig sounds
reasonable to me for now.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 09/10] arm64: mm: Declare support for large anonymous folios
  2023-06-26 17:14   ` Ryan Roberts
  (?)
@ 2023-06-27  2:53     ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  2:53 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> For the unhinted case, when THP is not permitted for the vma, don't
> allow anything bigger than 64K. This means we don't waste too much
> memory. Additionally, for 4K pages this is the contpte size, and for
> 16K, this is (usually) the HPA size when the uarch feature is
> implemented. For the hinted case, when THP is permitted for the vma,
> allow the contpte size for all page size configurations; 64K for 4K, 2M
> for 16K and 2M for 64K.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/Kconfig | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 343e1e1cae10..0e91b5bc8cd9 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -243,6 +243,7 @@ config ARM64
>         select TRACE_IRQFLAGS_SUPPORT
>         select TRACE_IRQFLAGS_NMI_SUPPORT
>         select HAVE_SOFTIRQ_ON_OWN_STACK
> +       select ARCH_SUPPORTS_LARGE_ANON_FOLIO
>         help
>           ARM 64-bit (AArch64) Linux support.
>
> @@ -281,6 +282,18 @@ config ARM64_CONT_PMD_SHIFT
>         default 5 if ARM64_16K_PAGES
>         default 4
>
> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +       int
> +       default 0 if ARM64_64K_PAGES    # 64K (1 page)
> +       default 2 if ARM64_16K_PAGES    # 64K (4 pages; benefits from HPA where HW supports it)
> +       default 4 if ARM64_4K_PAGES     # 64K (16 pages; eligible for contpte-mapping)
> +
> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +       int
> +       default 5 if ARM64_64K_PAGES    # 2M  (32 page; eligible for contpte-mapping)
> +       default 7 if ARM64_16K_PAGES    # 2M  (128 pages; eligible for contpte-mapping)
> +       default 4 if ARM64_4K_PAGES     # 64K (16 pages; eligible for contpte-mapping)
> +
>  config ARCH_MMAP_RND_BITS_MIN
>         default 14 if ARM64_64K_PAGES
>         default 16 if ARM64_16K_PAGES

Can we please just add one Kconfig for the large anon folio feature,
i.e., ARCH_WANTS_PTE_ORDER, for now?

Feel free to add as many as you wish for arm specific features like
HPA and contpte.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 09/10] arm64: mm: Declare support for large anonymous folios
@ 2023-06-27  2:53     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  2:53 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> For the unhinted case, when THP is not permitted for the vma, don't
> allow anything bigger than 64K. This means we don't waste too much
> memory. Additionally, for 4K pages this is the contpte size, and for
> 16K, this is (usually) the HPA size when the uarch feature is
> implemented. For the hinted case, when THP is permitted for the vma,
> allow the contpte size for all page size configurations; 64K for 4K, 2M
> for 16K and 2M for 64K.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/Kconfig | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 343e1e1cae10..0e91b5bc8cd9 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -243,6 +243,7 @@ config ARM64
>         select TRACE_IRQFLAGS_SUPPORT
>         select TRACE_IRQFLAGS_NMI_SUPPORT
>         select HAVE_SOFTIRQ_ON_OWN_STACK
> +       select ARCH_SUPPORTS_LARGE_ANON_FOLIO
>         help
>           ARM 64-bit (AArch64) Linux support.
>
> @@ -281,6 +282,18 @@ config ARM64_CONT_PMD_SHIFT
>         default 5 if ARM64_16K_PAGES
>         default 4
>
> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +       int
> +       default 0 if ARM64_64K_PAGES    # 64K (1 page)
> +       default 2 if ARM64_16K_PAGES    # 64K (4 pages; benefits from HPA where HW supports it)
> +       default 4 if ARM64_4K_PAGES     # 64K (16 pages; eligible for contpte-mapping)
> +
> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +       int
> +       default 5 if ARM64_64K_PAGES    # 2M  (32 page; eligible for contpte-mapping)
> +       default 7 if ARM64_16K_PAGES    # 2M  (128 pages; eligible for contpte-mapping)
> +       default 4 if ARM64_4K_PAGES     # 64K (16 pages; eligible for contpte-mapping)
> +
>  config ARCH_MMAP_RND_BITS_MIN
>         default 14 if ARM64_64K_PAGES
>         default 16 if ARM64_16K_PAGES

Can we please just add one Kconfig for the large anon folio feature,
i.e., ARCH_WANTS_PTE_ORDER, for now?

Feel free to add as many as you wish for arm specific features like
HPA and contpte.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 09/10] arm64: mm: Declare support for large anonymous folios
@ 2023-06-27  2:53     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  2:53 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> For the unhinted case, when THP is not permitted for the vma, don't
> allow anything bigger than 64K. This means we don't waste too much
> memory. Additionally, for 4K pages this is the contpte size, and for
> 16K, this is (usually) the HPA size when the uarch feature is
> implemented. For the hinted case, when THP is permitted for the vma,
> allow the contpte size for all page size configurations; 64K for 4K, 2M
> for 16K and 2M for 64K.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/Kconfig | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 343e1e1cae10..0e91b5bc8cd9 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -243,6 +243,7 @@ config ARM64
>         select TRACE_IRQFLAGS_SUPPORT
>         select TRACE_IRQFLAGS_NMI_SUPPORT
>         select HAVE_SOFTIRQ_ON_OWN_STACK
> +       select ARCH_SUPPORTS_LARGE_ANON_FOLIO
>         help
>           ARM 64-bit (AArch64) Linux support.
>
> @@ -281,6 +282,18 @@ config ARM64_CONT_PMD_SHIFT
>         default 5 if ARM64_16K_PAGES
>         default 4
>
> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +       int
> +       default 0 if ARM64_64K_PAGES    # 64K (1 page)
> +       default 2 if ARM64_16K_PAGES    # 64K (4 pages; benefits from HPA where HW supports it)
> +       default 4 if ARM64_4K_PAGES     # 64K (16 pages; eligible for contpte-mapping)
> +
> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +       int
> +       default 5 if ARM64_64K_PAGES    # 2M  (32 page; eligible for contpte-mapping)
> +       default 7 if ARM64_16K_PAGES    # 2M  (128 pages; eligible for contpte-mapping)
> +       default 4 if ARM64_4K_PAGES     # 64K (16 pages; eligible for contpte-mapping)
> +
>  config ARCH_MMAP_RND_BITS_MIN
>         default 14 if ARM64_64K_PAGES
>         default 16 if ARM64_16K_PAGES

Can we please just add one Kconfig for the large anon folio feature,
i.e., ARCH_WANTS_PTE_ORDER, for now?

Feel free to add as many as you wish for arm specific features like
HPA and contpte.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 06/10] mm: Allow deferred splitting of arbitrary large anon folios
  2023-06-26 17:14   ` Ryan Roberts
  (?)
@ 2023-06-27  2:54     ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  2:54 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> With the introduction of large folios for anonymous memory, we would
> like to be able to split them when they have unmapped subpages, in order
> to free those unused pages under memory pressure. So remove the
> artificial requirement that the large folio needed to be at least
> PMD-sized.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Reviewed-by: Yu Zhao <yuzhao@google.com>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 06/10] mm: Allow deferred splitting of arbitrary large anon folios
@ 2023-06-27  2:54     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  2:54 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> With the introduction of large folios for anonymous memory, we would
> like to be able to split them when they have unmapped subpages, in order
> to free those unused pages under memory pressure. So remove the
> artificial requirement that the large folio needed to be at least
> PMD-sized.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Reviewed-by: Yu Zhao <yuzhao@google.com>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 06/10] mm: Allow deferred splitting of arbitrary large anon folios
@ 2023-06-27  2:54     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  2:54 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> With the introduction of large folios for anonymous memory, we would
> like to be able to split them when they have unmapped subpages, in order
> to free those unused pages under memory pressure. So remove the
> artificial requirement that the large folio needed to be at least
> PMD-sized.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Reviewed-by: Yu Zhao <yuzhao@google.com>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
  2023-06-26 17:14   ` Ryan Roberts
  (?)
@ 2023-06-27  3:01     ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  3:01 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> With all of the enabler patches in place, modify the anonymous memory
> write allocation path so that it opportunistically attempts to allocate
> a large folio up to `max_anon_folio_order()` size (This value is
> ultimately configured by the architecture). This reduces the number of
> page faults, reduces the size of (e.g. LRU) lists, and generally
> improves performance by batching what were per-page operations into
> per-(large)-folio operations.
>
> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> `max_anon_folio_order()` always returns 0, meaning we get the existing
> allocation behaviour.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 144 insertions(+), 15 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index a8f7e2b28d7a..d23c44cc5092 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>  }
>
> +/*
> + * Returns index of first pte that is not none, or nr if all are none.
> + */
> +static inline int check_ptes_none(pte_t *pte, int nr)
> +{
> +       int i;
> +
> +       for (i = 0; i < nr; i++) {
> +               if (!pte_none(ptep_get(pte++)))
> +                       return i;
> +       }
> +
> +       return nr;
> +}
> +
> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)

As suggested previously in 03/10, we can leave this for later.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
@ 2023-06-27  3:01     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  3:01 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> With all of the enabler patches in place, modify the anonymous memory
> write allocation path so that it opportunistically attempts to allocate
> a large folio up to `max_anon_folio_order()` size (This value is
> ultimately configured by the architecture). This reduces the number of
> page faults, reduces the size of (e.g. LRU) lists, and generally
> improves performance by batching what were per-page operations into
> per-(large)-folio operations.
>
> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> `max_anon_folio_order()` always returns 0, meaning we get the existing
> allocation behaviour.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 144 insertions(+), 15 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index a8f7e2b28d7a..d23c44cc5092 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>  }
>
> +/*
> + * Returns index of first pte that is not none, or nr if all are none.
> + */
> +static inline int check_ptes_none(pte_t *pte, int nr)
> +{
> +       int i;
> +
> +       for (i = 0; i < nr; i++) {
> +               if (!pte_none(ptep_get(pte++)))
> +                       return i;
> +       }
> +
> +       return nr;
> +}
> +
> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)

As suggested previously in 03/10, we can leave this for later.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
@ 2023-06-27  3:01     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  3:01 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> With all of the enabler patches in place, modify the anonymous memory
> write allocation path so that it opportunistically attempts to allocate
> a large folio up to `max_anon_folio_order()` size (This value is
> ultimately configured by the architecture). This reduces the number of
> page faults, reduces the size of (e.g. LRU) lists, and generally
> improves performance by batching what were per-page operations into
> per-(large)-folio operations.
>
> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> `max_anon_folio_order()` always returns 0, meaning we get the existing
> allocation behaviour.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 144 insertions(+), 15 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index a8f7e2b28d7a..d23c44cc5092 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>  }
>
> +/*
> + * Returns index of first pte that is not none, or nr if all are none.
> + */
> +static inline int check_ptes_none(pte_t *pte, int nr)
> +{
> +       int i;
> +
> +       for (i = 0; i < nr; i++) {
> +               if (!pte_none(ptep_get(pte++)))
> +                       return i;
> +       }
> +
> +       return nr;
> +}
> +
> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)

As suggested previously in 03/10, we can leave this for later.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 07/10] mm: Batch-zap large anonymous folio PTE mappings
  2023-06-26 17:14   ` Ryan Roberts
  (?)
@ 2023-06-27  3:04     ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  3:04 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> This allows batching the rmap removal with folio_remove_rmap_range(),
> which means we avoid spuriously adding a partially unmapped folio to the
> deferrred split queue in the common case, which reduces split queue lock
> contention.
>
> Previously each page was removed from the rmap individually with
> page_remove_rmap(). If the first page belonged to a large folio, this
> would cause page_remove_rmap() to conclude that the folio was now
> partially mapped and add the folio to the deferred split queue. But
> subsequent calls would cause the folio to become fully unmapped, meaning
> there is no value to adding it to the split queue.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 119 insertions(+)

We don't really need this patch for the series to work. So again, I'd
split it out.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 07/10] mm: Batch-zap large anonymous folio PTE mappings
@ 2023-06-27  3:04     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  3:04 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> This allows batching the rmap removal with folio_remove_rmap_range(),
> which means we avoid spuriously adding a partially unmapped folio to the
> deferrred split queue in the common case, which reduces split queue lock
> contention.
>
> Previously each page was removed from the rmap individually with
> page_remove_rmap(). If the first page belonged to a large folio, this
> would cause page_remove_rmap() to conclude that the folio was now
> partially mapped and add the folio to the deferred split queue. But
> subsequent calls would cause the folio to become fully unmapped, meaning
> there is no value to adding it to the split queue.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 119 insertions(+)

We don't really need this patch for the series to work. So again, I'd
split it out.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 07/10] mm: Batch-zap large anonymous folio PTE mappings
@ 2023-06-27  3:04     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  3:04 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> This allows batching the rmap removal with folio_remove_rmap_range(),
> which means we avoid spuriously adding a partially unmapped folio to the
> deferrred split queue in the common case, which reduces split queue lock
> contention.
>
> Previously each page was removed from the rmap individually with
> page_remove_rmap(). If the first page belonged to a large folio, this
> would cause page_remove_rmap() to conclude that the folio was now
> partially mapped and add the folio to the deferred split queue. But
> subsequent calls would cause the folio to become fully unmapped, meaning
> there is no value to adding it to the split queue.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 119 insertions(+)

We don't really need this patch for the series to work. So again, I'd
split it out.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 05/10] mm: Implement folio_remove_rmap_range()
  2023-06-26 17:14   ` Ryan Roberts
  (?)
@ 2023-06-27  3:06     ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  3:06 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Like page_remove_rmap() but batch-removes the rmap for a range of pages
> belonging to a folio, for effciency savings. All pages are accounted as
> small pages.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/rmap.h |  2 ++
>  mm/rmap.c            | 62 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 64 insertions(+)

Sorry for nagging: this can be included in a followup series.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 05/10] mm: Implement folio_remove_rmap_range()
@ 2023-06-27  3:06     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  3:06 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Like page_remove_rmap() but batch-removes the rmap for a range of pages
> belonging to a folio, for effciency savings. All pages are accounted as
> small pages.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/rmap.h |  2 ++
>  mm/rmap.c            | 62 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 64 insertions(+)

Sorry for nagging: this can be included in a followup series.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 05/10] mm: Implement folio_remove_rmap_range()
@ 2023-06-27  3:06     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  3:06 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Like page_remove_rmap() but batch-removes the rmap for a range of pages
> belonging to a folio, for effciency savings. All pages are accounted as
> small pages.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/rmap.h |  2 ++
>  mm/rmap.c            | 62 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 64 insertions(+)

Sorry for nagging: this can be included in a followup series.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
  2023-06-26 17:14 ` Ryan Roberts
  (?)
@ 2023-06-27  3:30   ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  3:30 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi All,
>
> Following on from the previous RFCv2 [1], this series implements variable order,
> large folios for anonymous memory. The objective of this is to improve
> performance by allocating larger chunks of memory during anonymous page faults:
>
>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>    overhead. This should benefit all architectures.
>  - Since we are now mapping physically contiguous chunks of memory, we can take
>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>
> This patch set deals with the SW side of things only and based on feedback from
> the RFC, aims to be the most minimal initial change, upon which future
> incremental changes can be added. For this reason, the new behaviour is hidden
> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
> default. Although the code has been refactored to parameterize the desired order
> of the allocation, when the feature is disabled (by forcing the order to be
> always 0) my performance tests measure no regression. So I'm hoping this will be
> a suitable mechanism to allow incremental submissions to the kernel without
> affecting the rest of the world.
>
> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
> getting that series into the kernel, but I'm hoping we can start the review
> process on this patch set independently. I have a branch at [3].
>
> I've posted a separate series concerning the HW part (contpte mapping) for arm64
> at [4].
>
>
> Performance
> -----------
>
> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> javascript benchmark running in Chromium). Both cases are running on Ampere
> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> is repeated 15 times over 5 reboots and averaged.
>
> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
> 'anonfolio' is the full patch set similar to the RFC with the additional changes
> to the extra 3 fault paths. The rest of the configs are described at [4].
>
> Kernel Compilation (smaller is better):
>
> | kernel          |   real-time |   kern-time |   user-time |
> |:----------------|------------:|------------:|------------:|
> | baseline-4k     |        0.0% |        0.0% |        0.0% |
> | anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
> | anonfolio       |       -5.4% |      -46.0% |       -0.3% |
> | contpte         |       -6.8% |      -45.7% |       -2.1% |
> | exefolio        |       -8.4% |      -46.4% |       -3.7% |
> | baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
> | baseline-64k    |      -10.5% |      -66.0% |       -3.5% |
>
> Speedometer 2.0 (bigger is better):
>
> | kernel          |   runs_per_min |
> |:----------------|---------------:|
> | baseline-4k     |           0.0% |
> | anonfolio-basic |           0.7% |
> | anonfolio       |           1.2% |
> | contpte         |           3.1% |
> | exefolio        |           4.2% |
> | baseline-16k    |           5.3% |

Thanks for pushing this forward!

> Changes since RFCv2
> -------------------
>
>   - Simplified series to bare minimum (on David Hildenbrand's advice)

My impression is that this series still includes many pieces that can
be split out and discussed separately with followup series.

(I skipped 04/10 and will look at it tomorrow.)

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-27  3:30   ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  3:30 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi All,
>
> Following on from the previous RFCv2 [1], this series implements variable order,
> large folios for anonymous memory. The objective of this is to improve
> performance by allocating larger chunks of memory during anonymous page faults:
>
>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>    overhead. This should benefit all architectures.
>  - Since we are now mapping physically contiguous chunks of memory, we can take
>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>
> This patch set deals with the SW side of things only and based on feedback from
> the RFC, aims to be the most minimal initial change, upon which future
> incremental changes can be added. For this reason, the new behaviour is hidden
> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
> default. Although the code has been refactored to parameterize the desired order
> of the allocation, when the feature is disabled (by forcing the order to be
> always 0) my performance tests measure no regression. So I'm hoping this will be
> a suitable mechanism to allow incremental submissions to the kernel without
> affecting the rest of the world.
>
> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
> getting that series into the kernel, but I'm hoping we can start the review
> process on this patch set independently. I have a branch at [3].
>
> I've posted a separate series concerning the HW part (contpte mapping) for arm64
> at [4].
>
>
> Performance
> -----------
>
> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> javascript benchmark running in Chromium). Both cases are running on Ampere
> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> is repeated 15 times over 5 reboots and averaged.
>
> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
> 'anonfolio' is the full patch set similar to the RFC with the additional changes
> to the extra 3 fault paths. The rest of the configs are described at [4].
>
> Kernel Compilation (smaller is better):
>
> | kernel          |   real-time |   kern-time |   user-time |
> |:----------------|------------:|------------:|------------:|
> | baseline-4k     |        0.0% |        0.0% |        0.0% |
> | anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
> | anonfolio       |       -5.4% |      -46.0% |       -0.3% |
> | contpte         |       -6.8% |      -45.7% |       -2.1% |
> | exefolio        |       -8.4% |      -46.4% |       -3.7% |
> | baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
> | baseline-64k    |      -10.5% |      -66.0% |       -3.5% |
>
> Speedometer 2.0 (bigger is better):
>
> | kernel          |   runs_per_min |
> |:----------------|---------------:|
> | baseline-4k     |           0.0% |
> | anonfolio-basic |           0.7% |
> | anonfolio       |           1.2% |
> | contpte         |           3.1% |
> | exefolio        |           4.2% |
> | baseline-16k    |           5.3% |

Thanks for pushing this forward!

> Changes since RFCv2
> -------------------
>
>   - Simplified series to bare minimum (on David Hildenbrand's advice)

My impression is that this series still includes many pieces that can
be split out and discussed separately with followup series.

(I skipped 04/10 and will look at it tomorrow.)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-27  3:30   ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  3:30 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi All,
>
> Following on from the previous RFCv2 [1], this series implements variable order,
> large folios for anonymous memory. The objective of this is to improve
> performance by allocating larger chunks of memory during anonymous page faults:
>
>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>    overhead. This should benefit all architectures.
>  - Since we are now mapping physically contiguous chunks of memory, we can take
>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>
> This patch set deals with the SW side of things only and based on feedback from
> the RFC, aims to be the most minimal initial change, upon which future
> incremental changes can be added. For this reason, the new behaviour is hidden
> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
> default. Although the code has been refactored to parameterize the desired order
> of the allocation, when the feature is disabled (by forcing the order to be
> always 0) my performance tests measure no regression. So I'm hoping this will be
> a suitable mechanism to allow incremental submissions to the kernel without
> affecting the rest of the world.
>
> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
> getting that series into the kernel, but I'm hoping we can start the review
> process on this patch set independently. I have a branch at [3].
>
> I've posted a separate series concerning the HW part (contpte mapping) for arm64
> at [4].
>
>
> Performance
> -----------
>
> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> javascript benchmark running in Chromium). Both cases are running on Ampere
> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> is repeated 15 times over 5 reboots and averaged.
>
> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
> 'anonfolio' is the full patch set similar to the RFC with the additional changes
> to the extra 3 fault paths. The rest of the configs are described at [4].
>
> Kernel Compilation (smaller is better):
>
> | kernel          |   real-time |   kern-time |   user-time |
> |:----------------|------------:|------------:|------------:|
> | baseline-4k     |        0.0% |        0.0% |        0.0% |
> | anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
> | anonfolio       |       -5.4% |      -46.0% |       -0.3% |
> | contpte         |       -6.8% |      -45.7% |       -2.1% |
> | exefolio        |       -8.4% |      -46.4% |       -3.7% |
> | baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
> | baseline-64k    |      -10.5% |      -66.0% |       -3.5% |
>
> Speedometer 2.0 (bigger is better):
>
> | kernel          |   runs_per_min |
> |:----------------|---------------:|
> | baseline-4k     |           0.0% |
> | anonfolio-basic |           0.7% |
> | anonfolio       |           1.2% |
> | contpte         |           3.1% |
> | exefolio        |           4.2% |
> | baseline-16k    |           5.3% |

Thanks for pushing this forward!

> Changes since RFCv2
> -------------------
>
>   - Simplified series to bare minimum (on David Hildenbrand's advice)

My impression is that this series still includes many pieces that can
be split out and discussed separately with followup series.

(I skipped 04/10 and will look at it tomorrow.)

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
  2023-06-27  2:34     ` Yu Zhao
  (?)
@ 2023-06-27  5:29       ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  5:29 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 8:34 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > Opportunistically attempt to allocate high-order folios in highmem,
> > optionally zeroed. Retry with lower orders all the way to order-0, until
> > success. Although, of note, order-1 allocations are skipped since a
> > large folio must be at least order-2 to work with the THP machinery. The
> > user must check what they got with folio_order().
> >
> > This will be used to oportunistically allocate large folios for
> > anonymous memory with a sensible fallback under memory pressure.
> >
> > For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
> > high latency due to reclaim, instead preferring to just try for a lower
> > order. The same approach is used by the readahead code when allocating
> > large folios.
> >
> > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > ---
> >  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
> >  1 file changed, 33 insertions(+)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 367bbbb29d91..53896d46e686 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
> >         return 0;
> >  }
> >
> > +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
> > +                               unsigned long vaddr, int order, bool zeroed)
> > +{
> > +       gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
> > +
> > +       if (zeroed)
> > +               return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
> > +       else
> > +               return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
> > +                                                               vaddr, false);
> > +}
> > +
> > +/*
> > + * Opportunistically attempt to allocate high-order folios, retrying with lower
> > + * orders all the way to order-0, until success. order-1 allocations are skipped
> > + * since a folio must be at least order-2 to work with the THP machinery. The
> > + * user must check what they got with folio_order(). vaddr can be any virtual
> > + * address that will be mapped by the allocated folio.
> > + */
> > +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
> > +                               unsigned long vaddr, int order, bool zeroed)
> > +{
> > +       struct folio *folio;
> > +
> > +       for (; order > 1; order--) {
> > +               folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
> > +               if (folio)
> > +                       return folio;
> > +       }
> > +
> > +       return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
> > +}
>
> I'd drop this patch. Instead, in do_anonymous_page():
>
>   if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER))
>     folio = vma_alloc_zeroed_movable_folio(vma, addr,
> CONFIG_ARCH_WANTS_PTE_ORDER))
>
>   if (!folio)
>     folio = vma_alloc_zeroed_movable_folio(vma, addr, 0);

I meant a runtime function arch_wants_pte_order() (Its default
implementation would return 0.)

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
@ 2023-06-27  5:29       ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  5:29 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 8:34 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > Opportunistically attempt to allocate high-order folios in highmem,
> > optionally zeroed. Retry with lower orders all the way to order-0, until
> > success. Although, of note, order-1 allocations are skipped since a
> > large folio must be at least order-2 to work with the THP machinery. The
> > user must check what they got with folio_order().
> >
> > This will be used to oportunistically allocate large folios for
> > anonymous memory with a sensible fallback under memory pressure.
> >
> > For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
> > high latency due to reclaim, instead preferring to just try for a lower
> > order. The same approach is used by the readahead code when allocating
> > large folios.
> >
> > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > ---
> >  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
> >  1 file changed, 33 insertions(+)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 367bbbb29d91..53896d46e686 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
> >         return 0;
> >  }
> >
> > +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
> > +                               unsigned long vaddr, int order, bool zeroed)
> > +{
> > +       gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
> > +
> > +       if (zeroed)
> > +               return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
> > +       else
> > +               return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
> > +                                                               vaddr, false);
> > +}
> > +
> > +/*
> > + * Opportunistically attempt to allocate high-order folios, retrying with lower
> > + * orders all the way to order-0, until success. order-1 allocations are skipped
> > + * since a folio must be at least order-2 to work with the THP machinery. The
> > + * user must check what they got with folio_order(). vaddr can be any virtual
> > + * address that will be mapped by the allocated folio.
> > + */
> > +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
> > +                               unsigned long vaddr, int order, bool zeroed)
> > +{
> > +       struct folio *folio;
> > +
> > +       for (; order > 1; order--) {
> > +               folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
> > +               if (folio)
> > +                       return folio;
> > +       }
> > +
> > +       return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
> > +}
>
> I'd drop this patch. Instead, in do_anonymous_page():
>
>   if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER))
>     folio = vma_alloc_zeroed_movable_folio(vma, addr,
> CONFIG_ARCH_WANTS_PTE_ORDER))
>
>   if (!folio)
>     folio = vma_alloc_zeroed_movable_folio(vma, addr, 0);

I meant a runtime function arch_wants_pte_order() (Its default
implementation would return 0.)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
@ 2023-06-27  5:29       ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  5:29 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On Mon, Jun 26, 2023 at 8:34 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > Opportunistically attempt to allocate high-order folios in highmem,
> > optionally zeroed. Retry with lower orders all the way to order-0, until
> > success. Although, of note, order-1 allocations are skipped since a
> > large folio must be at least order-2 to work with the THP machinery. The
> > user must check what they got with folio_order().
> >
> > This will be used to oportunistically allocate large folios for
> > anonymous memory with a sensible fallback under memory pressure.
> >
> > For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
> > high latency due to reclaim, instead preferring to just try for a lower
> > order. The same approach is used by the readahead code when allocating
> > large folios.
> >
> > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > ---
> >  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
> >  1 file changed, 33 insertions(+)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 367bbbb29d91..53896d46e686 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
> >         return 0;
> >  }
> >
> > +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
> > +                               unsigned long vaddr, int order, bool zeroed)
> > +{
> > +       gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
> > +
> > +       if (zeroed)
> > +               return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
> > +       else
> > +               return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
> > +                                                               vaddr, false);
> > +}
> > +
> > +/*
> > + * Opportunistically attempt to allocate high-order folios, retrying with lower
> > + * orders all the way to order-0, until success. order-1 allocations are skipped
> > + * since a folio must be at least order-2 to work with the THP machinery. The
> > + * user must check what they got with folio_order(). vaddr can be any virtual
> > + * address that will be mapped by the allocated folio.
> > + */
> > +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
> > +                               unsigned long vaddr, int order, bool zeroed)
> > +{
> > +       struct folio *folio;
> > +
> > +       for (; order > 1; order--) {
> > +               folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
> > +               if (folio)
> > +                       return folio;
> > +       }
> > +
> > +       return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
> > +}
>
> I'd drop this patch. Instead, in do_anonymous_page():
>
>   if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER))
>     folio = vma_alloc_zeroed_movable_folio(vma, addr,
> CONFIG_ARCH_WANTS_PTE_ORDER))
>
>   if (!folio)
>     folio = vma_alloc_zeroed_movable_folio(vma, addr, 0);

I meant a runtime function arch_wants_pte_order() (Its default
implementation would return 0.)

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
  2023-06-26 17:14   ` Ryan Roberts
  (?)
@ 2023-06-27  7:08     ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  7:08 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
> belonging to a folio, for effciency savings. All pages are accounted as
> small pages.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/rmap.h |  2 ++
>  mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 45 insertions(+)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index a3825ce81102..15433a3d0cbf 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>                 unsigned long address);
>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>                 unsigned long address);
> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
> +               int nr, struct vm_area_struct *vma, unsigned long address);

We should update folio_add_new_anon_rmap() to support large() &&
!folio_test_pmd_mappable() folios instead.

I double checked all places currently using folio_add_new_anon_rmap(),
and as expected, none actually allocates large() &&
!folio_test_pmd_mappable() and maps it one by one, which makes the
cases simpler, i.e.,
  if (!large())
    // the existing basepage case
  else if (!folio_test_pmd_mappable())
    // our new case
  else
    // the existing THP case

>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>                 bool compound);
>  void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1d8369549424..4050bcea7ae7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>         __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  }
>
> +/**
> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
> + * anonymous potentially large folio.
> + * @folio:      The folio containing the pages to be mapped
> + * @page:       First page in the folio to be mapped
> + * @nr:         Number of pages to be mapped
> + * @vma:        the vm area in which the mapping is added
> + * @address:    the user virtual address of the first page to be mapped
> + *
> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
> + * bypassed and the folio does not have to be locked. All pages in the folio are
> + * individually accounted.
> + *
> + * As the folio is new, it's assumed to be mapped exclusively by a single
> + * process.
> + */
> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
> +               int nr, struct vm_area_struct *vma, unsigned long address)
> +{
> +       int i;
> +
> +       VM_BUG_ON_VMA(address < vma->vm_start ||
> +                     address + (nr << PAGE_SHIFT) > vma->vm_end, vma);

BTW, VM_BUG_ON* shouldn't be used in new code:
Documentation/process/coding-style.rst

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
@ 2023-06-27  7:08     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  7:08 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
> belonging to a folio, for effciency savings. All pages are accounted as
> small pages.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/rmap.h |  2 ++
>  mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 45 insertions(+)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index a3825ce81102..15433a3d0cbf 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>                 unsigned long address);
>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>                 unsigned long address);
> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
> +               int nr, struct vm_area_struct *vma, unsigned long address);

We should update folio_add_new_anon_rmap() to support large() &&
!folio_test_pmd_mappable() folios instead.

I double checked all places currently using folio_add_new_anon_rmap(),
and as expected, none actually allocates large() &&
!folio_test_pmd_mappable() and maps it one by one, which makes the
cases simpler, i.e.,
  if (!large())
    // the existing basepage case
  else if (!folio_test_pmd_mappable())
    // our new case
  else
    // the existing THP case

>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>                 bool compound);
>  void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1d8369549424..4050bcea7ae7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>         __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  }
>
> +/**
> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
> + * anonymous potentially large folio.
> + * @folio:      The folio containing the pages to be mapped
> + * @page:       First page in the folio to be mapped
> + * @nr:         Number of pages to be mapped
> + * @vma:        the vm area in which the mapping is added
> + * @address:    the user virtual address of the first page to be mapped
> + *
> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
> + * bypassed and the folio does not have to be locked. All pages in the folio are
> + * individually accounted.
> + *
> + * As the folio is new, it's assumed to be mapped exclusively by a single
> + * process.
> + */
> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
> +               int nr, struct vm_area_struct *vma, unsigned long address)
> +{
> +       int i;
> +
> +       VM_BUG_ON_VMA(address < vma->vm_start ||
> +                     address + (nr << PAGE_SHIFT) > vma->vm_end, vma);

BTW, VM_BUG_ON* shouldn't be used in new code:
Documentation/process/coding-style.rst

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
@ 2023-06-27  7:08     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  7:08 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
> belonging to a folio, for effciency savings. All pages are accounted as
> small pages.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/rmap.h |  2 ++
>  mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 45 insertions(+)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index a3825ce81102..15433a3d0cbf 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>                 unsigned long address);
>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>                 unsigned long address);
> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
> +               int nr, struct vm_area_struct *vma, unsigned long address);

We should update folio_add_new_anon_rmap() to support large() &&
!folio_test_pmd_mappable() folios instead.

I double checked all places currently using folio_add_new_anon_rmap(),
and as expected, none actually allocates large() &&
!folio_test_pmd_mappable() and maps it one by one, which makes the
cases simpler, i.e.,
  if (!large())
    // the existing basepage case
  else if (!folio_test_pmd_mappable())
    // our new case
  else
    // the existing THP case

>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>                 bool compound);
>  void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1d8369549424..4050bcea7ae7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>         __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  }
>
> +/**
> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
> + * anonymous potentially large folio.
> + * @folio:      The folio containing the pages to be mapped
> + * @page:       First page in the folio to be mapped
> + * @nr:         Number of pages to be mapped
> + * @vma:        the vm area in which the mapping is added
> + * @address:    the user virtual address of the first page to be mapped
> + *
> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
> + * bypassed and the folio does not have to be locked. All pages in the folio are
> + * individually accounted.
> + *
> + * As the folio is new, it's assumed to be mapped exclusively by a single
> + * process.
> + */
> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
> +               int nr, struct vm_area_struct *vma, unsigned long address)
> +{
> +       int i;
> +
> +       VM_BUG_ON_VMA(address < vma->vm_start ||
> +                     address + (nr << PAGE_SHIFT) > vma->vm_end, vma);

BTW, VM_BUG_ON* shouldn't be used in new code:
Documentation/process/coding-style.rst

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
  2023-06-27  1:55     ` Yu Zhao
  (?)
@ 2023-06-27  7:21       ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  7:21 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 02:55, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> In preparation for extending vma_alloc_zeroed_movable_folio() to
>> allocate a arbitrary order folio, expose clear_huge_page()
>> unconditionally, so that it can be used to zero the allocated folio in
>> the generic implementation of vma_alloc_zeroed_movable_folio().
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/mm.h | 3 ++-
>>  mm/memory.c        | 2 +-
>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 7f1741bd870a..7e3bf45e6491 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
>>   */
>>  extern const struct attribute_group memory_failure_attr_group;
>>
>> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>>  extern void clear_huge_page(struct page *page,
>>                             unsigned long addr_hint,
>>                             unsigned int pages_per_huge_page);
>> +
>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> 
> We might not want to depend on THP eventually. Right now, we still
> have to, unless splitting is optional, which seems to contradict
> 06/10. (deferred_split_folio()  is a nop without THP.)

Yes, I agree - for large anon folios to work, we depend on THP. But I don't
think that helps us here.

In the next patch, I give vma_alloc_zeroed_movable_folio() an extra `order`
parameter. So the generic/default version of the function now needs a way to
clear a compound page.

I guess I could do something like:

 static inline
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
				   unsigned long vaddr, gfp_t gfp, int order)
 {
 	struct folio *folio;

	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
					order, vma, vaddr, false);
 	if (folio) {
#ifdef CONFIG_LARGE_FOLIO
		clear_huge_page(&folio->page, vaddr, 1U << order);
#else
		BUG_ON(order != 0);
		clear_user_highpage(&folio->page, vaddr);
#endif
	}

 	return folio;
 }

But that's pretty messy and there's no reason why other users might come along
that pass order != 0 and will be surprised by the BUG_ON.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
@ 2023-06-27  7:21       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  7:21 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 02:55, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> In preparation for extending vma_alloc_zeroed_movable_folio() to
>> allocate a arbitrary order folio, expose clear_huge_page()
>> unconditionally, so that it can be used to zero the allocated folio in
>> the generic implementation of vma_alloc_zeroed_movable_folio().
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/mm.h | 3 ++-
>>  mm/memory.c        | 2 +-
>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 7f1741bd870a..7e3bf45e6491 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
>>   */
>>  extern const struct attribute_group memory_failure_attr_group;
>>
>> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>>  extern void clear_huge_page(struct page *page,
>>                             unsigned long addr_hint,
>>                             unsigned int pages_per_huge_page);
>> +
>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> 
> We might not want to depend on THP eventually. Right now, we still
> have to, unless splitting is optional, which seems to contradict
> 06/10. (deferred_split_folio()  is a nop without THP.)

Yes, I agree - for large anon folios to work, we depend on THP. But I don't
think that helps us here.

In the next patch, I give vma_alloc_zeroed_movable_folio() an extra `order`
parameter. So the generic/default version of the function now needs a way to
clear a compound page.

I guess I could do something like:

 static inline
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
				   unsigned long vaddr, gfp_t gfp, int order)
 {
 	struct folio *folio;

	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
					order, vma, vaddr, false);
 	if (folio) {
#ifdef CONFIG_LARGE_FOLIO
		clear_huge_page(&folio->page, vaddr, 1U << order);
#else
		BUG_ON(order != 0);
		clear_user_highpage(&folio->page, vaddr);
#endif
	}

 	return folio;
 }

But that's pretty messy and there's no reason why other users might come along
that pass order != 0 and will be surprised by the BUG_ON.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
@ 2023-06-27  7:21       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  7:21 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On 27/06/2023 02:55, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> In preparation for extending vma_alloc_zeroed_movable_folio() to
>> allocate a arbitrary order folio, expose clear_huge_page()
>> unconditionally, so that it can be used to zero the allocated folio in
>> the generic implementation of vma_alloc_zeroed_movable_folio().
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/mm.h | 3 ++-
>>  mm/memory.c        | 2 +-
>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 7f1741bd870a..7e3bf45e6491 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
>>   */
>>  extern const struct attribute_group memory_failure_attr_group;
>>
>> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>>  extern void clear_huge_page(struct page *page,
>>                             unsigned long addr_hint,
>>                             unsigned int pages_per_huge_page);
>> +
>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> 
> We might not want to depend on THP eventually. Right now, we still
> have to, unless splitting is optional, which seems to contradict
> 06/10. (deferred_split_folio()  is a nop without THP.)

Yes, I agree - for large anon folios to work, we depend on THP. But I don't
think that helps us here.

In the next patch, I give vma_alloc_zeroed_movable_folio() an extra `order`
parameter. So the generic/default version of the function now needs a way to
clear a compound page.

I guess I could do something like:

 static inline
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
				   unsigned long vaddr, gfp_t gfp, int order)
 {
 	struct folio *folio;

	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
					order, vma, vaddr, false);
 	if (folio) {
#ifdef CONFIG_LARGE_FOLIO
		clear_huge_page(&folio->page, vaddr, 1U << order);
#else
		BUG_ON(order != 0);
		clear_user_highpage(&folio->page, vaddr);
#endif
	}

 	return folio;
 }

But that's pretty messy and there's no reason why other users might come along
that pass order != 0 and will be surprised by the BUG_ON.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 02/10] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
  2023-06-27  2:27     ` Yu Zhao
  (?)
@ 2023-06-27  7:27       ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  7:27 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 03:27, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Allow allocation of large folios with vma_alloc_zeroed_movable_folio().
>> This prepares the ground for large anonymous folios. The generic
>> implementation of vma_alloc_zeroed_movable_folio() now uses
>> clear_huge_page() to zero the allocated folio since it may now be a
>> non-0 order.
>>
>> Currently the function is always called with order 0 and no extra gfp
>> flags, so no functional change intended. But a subsequent commit will
>> take advantage of the new parameters to allocate large folios. The extra
>> gfp flags will be used to control the reclaim policy.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/alpha/include/asm/page.h   |  5 +++--
>>  arch/arm64/include/asm/page.h   |  3 ++-
>>  arch/arm64/mm/fault.c           |  7 ++++---
>>  arch/ia64/include/asm/page.h    |  5 +++--
>>  arch/m68k/include/asm/page_no.h |  7 ++++---
>>  arch/s390/include/asm/page.h    |  5 +++--
>>  arch/x86/include/asm/page.h     |  5 +++--
>>  include/linux/highmem.h         | 23 +++++++++++++----------
>>  mm/memory.c                     |  5 +++--
>>  9 files changed, 38 insertions(+), 27 deletions(-)
>>
>> diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
>> index 4db1ebc0ed99..6fc7fe91b6cb 100644
>> --- a/arch/alpha/include/asm/page.h
>> +++ b/arch/alpha/include/asm/page.h
>> @@ -17,8 +17,9 @@
>>  extern void clear_page(void *page);
>>  #define clear_user_page(page, vaddr, pg)       clear_page(page)
>>
>> -#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
>> -       vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
>> +#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
>> +       vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
>> +                       order, vma, vaddr, false)
> 
> I don't think we need to worry about gfp if we want to make a minimum
> series. There would be many discussion points around it, e.g., I
> already disagree with what you chose: GFP_TRANSHUGE_LIGHT would be
> more suitable than __GFP_NORETRY, and there are even better options
> than GFP_TRANSHUGE_LIGHT.

OK, but disagreeing about what the GFP flags should be is different from
disagreeing about whether we need a mechanism for specifying them. Given I need
to do the changes to add `order` I thought it was sensible to add the gfp flags
at the same time.

I'll follow your advice and remove the gfp flag addition for now.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 02/10] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
@ 2023-06-27  7:27       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  7:27 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 03:27, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Allow allocation of large folios with vma_alloc_zeroed_movable_folio().
>> This prepares the ground for large anonymous folios. The generic
>> implementation of vma_alloc_zeroed_movable_folio() now uses
>> clear_huge_page() to zero the allocated folio since it may now be a
>> non-0 order.
>>
>> Currently the function is always called with order 0 and no extra gfp
>> flags, so no functional change intended. But a subsequent commit will
>> take advantage of the new parameters to allocate large folios. The extra
>> gfp flags will be used to control the reclaim policy.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/alpha/include/asm/page.h   |  5 +++--
>>  arch/arm64/include/asm/page.h   |  3 ++-
>>  arch/arm64/mm/fault.c           |  7 ++++---
>>  arch/ia64/include/asm/page.h    |  5 +++--
>>  arch/m68k/include/asm/page_no.h |  7 ++++---
>>  arch/s390/include/asm/page.h    |  5 +++--
>>  arch/x86/include/asm/page.h     |  5 +++--
>>  include/linux/highmem.h         | 23 +++++++++++++----------
>>  mm/memory.c                     |  5 +++--
>>  9 files changed, 38 insertions(+), 27 deletions(-)
>>
>> diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
>> index 4db1ebc0ed99..6fc7fe91b6cb 100644
>> --- a/arch/alpha/include/asm/page.h
>> +++ b/arch/alpha/include/asm/page.h
>> @@ -17,8 +17,9 @@
>>  extern void clear_page(void *page);
>>  #define clear_user_page(page, vaddr, pg)       clear_page(page)
>>
>> -#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
>> -       vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
>> +#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
>> +       vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
>> +                       order, vma, vaddr, false)
> 
> I don't think we need to worry about gfp if we want to make a minimum
> series. There would be many discussion points around it, e.g., I
> already disagree with what you chose: GFP_TRANSHUGE_LIGHT would be
> more suitable than __GFP_NORETRY, and there are even better options
> than GFP_TRANSHUGE_LIGHT.

OK, but disagreeing about what the GFP flags should be is different from
disagreeing about whether we need a mechanism for specifying them. Given I need
to do the changes to add `order` I thought it was sensible to add the gfp flags
at the same time.

I'll follow your advice and remove the gfp flag addition for now.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 02/10] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
@ 2023-06-27  7:27       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  7:27 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On 27/06/2023 03:27, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Allow allocation of large folios with vma_alloc_zeroed_movable_folio().
>> This prepares the ground for large anonymous folios. The generic
>> implementation of vma_alloc_zeroed_movable_folio() now uses
>> clear_huge_page() to zero the allocated folio since it may now be a
>> non-0 order.
>>
>> Currently the function is always called with order 0 and no extra gfp
>> flags, so no functional change intended. But a subsequent commit will
>> take advantage of the new parameters to allocate large folios. The extra
>> gfp flags will be used to control the reclaim policy.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/alpha/include/asm/page.h   |  5 +++--
>>  arch/arm64/include/asm/page.h   |  3 ++-
>>  arch/arm64/mm/fault.c           |  7 ++++---
>>  arch/ia64/include/asm/page.h    |  5 +++--
>>  arch/m68k/include/asm/page_no.h |  7 ++++---
>>  arch/s390/include/asm/page.h    |  5 +++--
>>  arch/x86/include/asm/page.h     |  5 +++--
>>  include/linux/highmem.h         | 23 +++++++++++++----------
>>  mm/memory.c                     |  5 +++--
>>  9 files changed, 38 insertions(+), 27 deletions(-)
>>
>> diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
>> index 4db1ebc0ed99..6fc7fe91b6cb 100644
>> --- a/arch/alpha/include/asm/page.h
>> +++ b/arch/alpha/include/asm/page.h
>> @@ -17,8 +17,9 @@
>>  extern void clear_page(void *page);
>>  #define clear_user_page(page, vaddr, pg)       clear_page(page)
>>
>> -#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
>> -       vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
>> +#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
>> +       vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
>> +                       order, vma, vaddr, false)
> 
> I don't think we need to worry about gfp if we want to make a minimum
> series. There would be many discussion points around it, e.g., I
> already disagree with what you chose: GFP_TRANSHUGE_LIGHT would be
> more suitable than __GFP_NORETRY, and there are even better options
> than GFP_TRANSHUGE_LIGHT.

OK, but disagreeing about what the GFP flags should be is different from
disagreeing about whether we need a mechanism for specifying them. Given I need
to do the changes to add `order` I thought it was sensible to add the gfp flags
at the same time.

I'll follow your advice and remove the gfp flag addition for now.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
  2023-06-27  3:30   ` Yu Zhao
  (?)
@ 2023-06-27  7:49     ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  7:49 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > Hi All,
> >
> > Following on from the previous RFCv2 [1], this series implements variable order,
> > large folios for anonymous memory. The objective of this is to improve
> > performance by allocating larger chunks of memory during anonymous page faults:
> >
> >  - Since SW (the kernel) is dealing with larger chunks of memory than base
> >    pages, there are efficiency savings to be had; fewer page faults, batched PTE
> >    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
> >    overhead. This should benefit all architectures.
> >  - Since we are now mapping physically contiguous chunks of memory, we can take
> >    advantage of HW TLB compression techniques. A reduction in TLB pressure
> >    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> >    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
> >
> > This patch set deals with the SW side of things only and based on feedback from
> > the RFC, aims to be the most minimal initial change, upon which future
> > incremental changes can be added. For this reason, the new behaviour is hidden
> > behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
> > default. Although the code has been refactored to parameterize the desired order
> > of the allocation, when the feature is disabled (by forcing the order to be
> > always 0) my performance tests measure no regression. So I'm hoping this will be
> > a suitable mechanism to allow incremental submissions to the kernel without
> > affecting the rest of the world.
> >
> > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> > [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
> > getting that series into the kernel, but I'm hoping we can start the review
> > process on this patch set independently. I have a branch at [3].
> >
> > I've posted a separate series concerning the HW part (contpte mapping) for arm64
> > at [4].
> >
> >
> > Performance
> > -----------
> >
> > Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> > javascript benchmark running in Chromium). Both cases are running on Ampere
> > Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> > is repeated 15 times over 5 reboots and averaged.
> >
> > All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
> > 'anonfolio' is the full patch set similar to the RFC with the additional changes
> > to the extra 3 fault paths. The rest of the configs are described at [4].
> >
> > Kernel Compilation (smaller is better):
> >
> > | kernel          |   real-time |   kern-time |   user-time |
> > |:----------------|------------:|------------:|------------:|
> > | baseline-4k     |        0.0% |        0.0% |        0.0% |
> > | anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
> > | anonfolio       |       -5.4% |      -46.0% |       -0.3% |
> > | contpte         |       -6.8% |      -45.7% |       -2.1% |
> > | exefolio        |       -8.4% |      -46.4% |       -3.7% |
> > | baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
> > | baseline-64k    |      -10.5% |      -66.0% |       -3.5% |
> >
> > Speedometer 2.0 (bigger is better):
> >
> > | kernel          |   runs_per_min |
> > |:----------------|---------------:|
> > | baseline-4k     |           0.0% |
> > | anonfolio-basic |           0.7% |
> > | anonfolio       |           1.2% |
> > | contpte         |           3.1% |
> > | exefolio        |           4.2% |
> > | baseline-16k    |           5.3% |
>
> Thanks for pushing this forward!
>
> > Changes since RFCv2
> > -------------------
> >
> >   - Simplified series to bare minimum (on David Hildenbrand's advice)
>
> My impression is that this series still includes many pieces that can
> be split out and discussed separately with followup series.
>
> (I skipped 04/10 and will look at it tomorrow.)

I went through the series twice. Here what I think a bare minimum
series (easier to review/debug/land) would look like:
1. a new arch specific function providing a prefered order within (0,
PMD_ORDER).
2. an extended anon folio alloc API taking that order (02/10, partially).
3. an updated folio_add_new_anon_rmap() covering the large() &&
!pmd_mappable() case (similar to 04/10).
4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
(06/10, reviewed-by provided).
5. finally, use the extended anon folio alloc API with the arch
preferred order in do_anonymous_page() (10/10, partially).

The rest can be split out into separate series and move forward in
parallel with probably a long list of things we need/want to do.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-27  7:49     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  7:49 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > Hi All,
> >
> > Following on from the previous RFCv2 [1], this series implements variable order,
> > large folios for anonymous memory. The objective of this is to improve
> > performance by allocating larger chunks of memory during anonymous page faults:
> >
> >  - Since SW (the kernel) is dealing with larger chunks of memory than base
> >    pages, there are efficiency savings to be had; fewer page faults, batched PTE
> >    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
> >    overhead. This should benefit all architectures.
> >  - Since we are now mapping physically contiguous chunks of memory, we can take
> >    advantage of HW TLB compression techniques. A reduction in TLB pressure
> >    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> >    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
> >
> > This patch set deals with the SW side of things only and based on feedback from
> > the RFC, aims to be the most minimal initial change, upon which future
> > incremental changes can be added. For this reason, the new behaviour is hidden
> > behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
> > default. Although the code has been refactored to parameterize the desired order
> > of the allocation, when the feature is disabled (by forcing the order to be
> > always 0) my performance tests measure no regression. So I'm hoping this will be
> > a suitable mechanism to allow incremental submissions to the kernel without
> > affecting the rest of the world.
> >
> > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> > [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
> > getting that series into the kernel, but I'm hoping we can start the review
> > process on this patch set independently. I have a branch at [3].
> >
> > I've posted a separate series concerning the HW part (contpte mapping) for arm64
> > at [4].
> >
> >
> > Performance
> > -----------
> >
> > Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> > javascript benchmark running in Chromium). Both cases are running on Ampere
> > Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> > is repeated 15 times over 5 reboots and averaged.
> >
> > All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
> > 'anonfolio' is the full patch set similar to the RFC with the additional changes
> > to the extra 3 fault paths. The rest of the configs are described at [4].
> >
> > Kernel Compilation (smaller is better):
> >
> > | kernel          |   real-time |   kern-time |   user-time |
> > |:----------------|------------:|------------:|------------:|
> > | baseline-4k     |        0.0% |        0.0% |        0.0% |
> > | anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
> > | anonfolio       |       -5.4% |      -46.0% |       -0.3% |
> > | contpte         |       -6.8% |      -45.7% |       -2.1% |
> > | exefolio        |       -8.4% |      -46.4% |       -3.7% |
> > | baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
> > | baseline-64k    |      -10.5% |      -66.0% |       -3.5% |
> >
> > Speedometer 2.0 (bigger is better):
> >
> > | kernel          |   runs_per_min |
> > |:----------------|---------------:|
> > | baseline-4k     |           0.0% |
> > | anonfolio-basic |           0.7% |
> > | anonfolio       |           1.2% |
> > | contpte         |           3.1% |
> > | exefolio        |           4.2% |
> > | baseline-16k    |           5.3% |
>
> Thanks for pushing this forward!
>
> > Changes since RFCv2
> > -------------------
> >
> >   - Simplified series to bare minimum (on David Hildenbrand's advice)
>
> My impression is that this series still includes many pieces that can
> be split out and discussed separately with followup series.
>
> (I skipped 04/10 and will look at it tomorrow.)

I went through the series twice. Here what I think a bare minimum
series (easier to review/debug/land) would look like:
1. a new arch specific function providing a prefered order within (0,
PMD_ORDER).
2. an extended anon folio alloc API taking that order (02/10, partially).
3. an updated folio_add_new_anon_rmap() covering the large() &&
!pmd_mappable() case (similar to 04/10).
4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
(06/10, reviewed-by provided).
5. finally, use the extended anon folio alloc API with the arch
preferred order in do_anonymous_page() (10/10, partially).

The rest can be split out into separate series and move forward in
parallel with probably a long list of things we need/want to do.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-27  7:49     ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  7:49 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > Hi All,
> >
> > Following on from the previous RFCv2 [1], this series implements variable order,
> > large folios for anonymous memory. The objective of this is to improve
> > performance by allocating larger chunks of memory during anonymous page faults:
> >
> >  - Since SW (the kernel) is dealing with larger chunks of memory than base
> >    pages, there are efficiency savings to be had; fewer page faults, batched PTE
> >    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
> >    overhead. This should benefit all architectures.
> >  - Since we are now mapping physically contiguous chunks of memory, we can take
> >    advantage of HW TLB compression techniques. A reduction in TLB pressure
> >    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> >    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
> >
> > This patch set deals with the SW side of things only and based on feedback from
> > the RFC, aims to be the most minimal initial change, upon which future
> > incremental changes can be added. For this reason, the new behaviour is hidden
> > behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
> > default. Although the code has been refactored to parameterize the desired order
> > of the allocation, when the feature is disabled (by forcing the order to be
> > always 0) my performance tests measure no regression. So I'm hoping this will be
> > a suitable mechanism to allow incremental submissions to the kernel without
> > affecting the rest of the world.
> >
> > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> > [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
> > getting that series into the kernel, but I'm hoping we can start the review
> > process on this patch set independently. I have a branch at [3].
> >
> > I've posted a separate series concerning the HW part (contpte mapping) for arm64
> > at [4].
> >
> >
> > Performance
> > -----------
> >
> > Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> > javascript benchmark running in Chromium). Both cases are running on Ampere
> > Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> > is repeated 15 times over 5 reboots and averaged.
> >
> > All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
> > 'anonfolio' is the full patch set similar to the RFC with the additional changes
> > to the extra 3 fault paths. The rest of the configs are described at [4].
> >
> > Kernel Compilation (smaller is better):
> >
> > | kernel          |   real-time |   kern-time |   user-time |
> > |:----------------|------------:|------------:|------------:|
> > | baseline-4k     |        0.0% |        0.0% |        0.0% |
> > | anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
> > | anonfolio       |       -5.4% |      -46.0% |       -0.3% |
> > | contpte         |       -6.8% |      -45.7% |       -2.1% |
> > | exefolio        |       -8.4% |      -46.4% |       -3.7% |
> > | baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
> > | baseline-64k    |      -10.5% |      -66.0% |       -3.5% |
> >
> > Speedometer 2.0 (bigger is better):
> >
> > | kernel          |   runs_per_min |
> > |:----------------|---------------:|
> > | baseline-4k     |           0.0% |
> > | anonfolio-basic |           0.7% |
> > | anonfolio       |           1.2% |
> > | contpte         |           3.1% |
> > | exefolio        |           4.2% |
> > | baseline-16k    |           5.3% |
>
> Thanks for pushing this forward!
>
> > Changes since RFCv2
> > -------------------
> >
> >   - Simplified series to bare minimum (on David Hildenbrand's advice)
>
> My impression is that this series still includes many pieces that can
> be split out and discussed separately with followup series.
>
> (I skipped 04/10 and will look at it tomorrow.)

I went through the series twice. Here what I think a bare minimum
series (easier to review/debug/land) would look like:
1. a new arch specific function providing a prefered order within (0,
PMD_ORDER).
2. an extended anon folio alloc API taking that order (02/10, partially).
3. an updated folio_add_new_anon_rmap() covering the large() &&
!pmd_mappable() case (similar to 04/10).
4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
(06/10, reviewed-by provided).
5. finally, use the extended anon folio alloc API with the arch
preferred order in do_anonymous_page() (10/10, partially).

The rest can be split out into separate series and move forward in
parallel with probably a long list of things we need/want to do.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
  2023-06-27  5:29       ` Yu Zhao
  (?)
@ 2023-06-27  7:56         ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  7:56 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 06:29, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 8:34 PM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> Opportunistically attempt to allocate high-order folios in highmem,
>>> optionally zeroed. Retry with lower orders all the way to order-0, until
>>> success. Although, of note, order-1 allocations are skipped since a
>>> large folio must be at least order-2 to work with the THP machinery. The
>>> user must check what they got with folio_order().
>>>
>>> This will be used to oportunistically allocate large folios for
>>> anonymous memory with a sensible fallback under memory pressure.
>>>
>>> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
>>> high latency due to reclaim, instead preferring to just try for a lower
>>> order. The same approach is used by the readahead code when allocating
>>> large folios.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
>>>  1 file changed, 33 insertions(+)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 367bbbb29d91..53896d46e686 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>>         return 0;
>>>  }
>>>
>>> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
>>> +                               unsigned long vaddr, int order, bool zeroed)
>>> +{
>>> +       gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
>>> +
>>> +       if (zeroed)
>>> +               return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
>>> +       else
>>> +               return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
>>> +                                                               vaddr, false);
>>> +}
>>> +
>>> +/*
>>> + * Opportunistically attempt to allocate high-order folios, retrying with lower
>>> + * orders all the way to order-0, until success. order-1 allocations are skipped
>>> + * since a folio must be at least order-2 to work with the THP machinery. The
>>> + * user must check what they got with folio_order(). vaddr can be any virtual
>>> + * address that will be mapped by the allocated folio.
>>> + */
>>> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>>> +                               unsigned long vaddr, int order, bool zeroed)
>>> +{
>>> +       struct folio *folio;
>>> +
>>> +       for (; order > 1; order--) {
>>> +               folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
>>> +               if (folio)
>>> +                       return folio;
>>> +       }
>>> +
>>> +       return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>>> +}
>>
>> I'd drop this patch. Instead, in do_anonymous_page():
>>
>>   if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER))
>>     folio = vma_alloc_zeroed_movable_folio(vma, addr,
>> CONFIG_ARCH_WANTS_PTE_ORDER))
>>
>>   if (!folio)
>>     folio = vma_alloc_zeroed_movable_folio(vma, addr, 0);
> 
> I meant a runtime function arch_wants_pte_order() (Its default
> implementation would return 0.)

There are a bunch of things which you are implying here which I'll try to make
explicit:

I think you are implying that we shouldn't retry allocation with intermediate
orders; but only try the order requested by the arch (arch_wants_pte_order())
and 0. Correct? For arm64 at least, I would like the VMA's THP hint to be a
factor in determining the preferred order (see patches 8 and 9). So I would add
a vma parameter to arch_wants_pte_order() to allow for this.

For the case where the THP hint is present, then the arch will request 2M (if
the page size is 16K or 64K). If that fails to allocate, there is still value in
allocating a 64K folio (which is order 2 in the 16K case). Without the retry
with intermediate orders logic, we would not get this.

We can't just blindly allocate a folio of arch_wants_pte_order() size because it
might overlap with existing populated PTEs, or cross the bounds of the VMA (or a
number of other things - see calc_anon_folio_order_alloc() in patch 10). Are you
implying that if there is any kind of issue like this, then we should go
directly to order 0? I can kind of see the argument from a minimizing
fragmentation perspective, but for best possible performance I think we are
better off "packing the bin" with intermediate orders.

You're also implying that a runtime arch_wants_pte_order() function is better
than the Kconfig stuff I did in patch 8. On reflection, I agree with you here. I
think you mentioned that AMD supports coalescing 8 pages on some CPUs - so you
would probably want runtime logic to determine if you are on an appropriate AMD
CPU as part of the decision in that function?

The real reason for the existance of try_vma_alloc_movable_folio() is that I'm
reusing it on the other fault paths (which are no longer part of this series).
But I guess that's not a good reason to keep this until we get to those patches.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
@ 2023-06-27  7:56         ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  7:56 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 06:29, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 8:34 PM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> Opportunistically attempt to allocate high-order folios in highmem,
>>> optionally zeroed. Retry with lower orders all the way to order-0, until
>>> success. Although, of note, order-1 allocations are skipped since a
>>> large folio must be at least order-2 to work with the THP machinery. The
>>> user must check what they got with folio_order().
>>>
>>> This will be used to oportunistically allocate large folios for
>>> anonymous memory with a sensible fallback under memory pressure.
>>>
>>> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
>>> high latency due to reclaim, instead preferring to just try for a lower
>>> order. The same approach is used by the readahead code when allocating
>>> large folios.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
>>>  1 file changed, 33 insertions(+)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 367bbbb29d91..53896d46e686 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>>         return 0;
>>>  }
>>>
>>> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
>>> +                               unsigned long vaddr, int order, bool zeroed)
>>> +{
>>> +       gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
>>> +
>>> +       if (zeroed)
>>> +               return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
>>> +       else
>>> +               return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
>>> +                                                               vaddr, false);
>>> +}
>>> +
>>> +/*
>>> + * Opportunistically attempt to allocate high-order folios, retrying with lower
>>> + * orders all the way to order-0, until success. order-1 allocations are skipped
>>> + * since a folio must be at least order-2 to work with the THP machinery. The
>>> + * user must check what they got with folio_order(). vaddr can be any virtual
>>> + * address that will be mapped by the allocated folio.
>>> + */
>>> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>>> +                               unsigned long vaddr, int order, bool zeroed)
>>> +{
>>> +       struct folio *folio;
>>> +
>>> +       for (; order > 1; order--) {
>>> +               folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
>>> +               if (folio)
>>> +                       return folio;
>>> +       }
>>> +
>>> +       return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>>> +}
>>
>> I'd drop this patch. Instead, in do_anonymous_page():
>>
>>   if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER))
>>     folio = vma_alloc_zeroed_movable_folio(vma, addr,
>> CONFIG_ARCH_WANTS_PTE_ORDER))
>>
>>   if (!folio)
>>     folio = vma_alloc_zeroed_movable_folio(vma, addr, 0);
> 
> I meant a runtime function arch_wants_pte_order() (Its default
> implementation would return 0.)

There are a bunch of things which you are implying here which I'll try to make
explicit:

I think you are implying that we shouldn't retry allocation with intermediate
orders; but only try the order requested by the arch (arch_wants_pte_order())
and 0. Correct? For arm64 at least, I would like the VMA's THP hint to be a
factor in determining the preferred order (see patches 8 and 9). So I would add
a vma parameter to arch_wants_pte_order() to allow for this.

For the case where the THP hint is present, then the arch will request 2M (if
the page size is 16K or 64K). If that fails to allocate, there is still value in
allocating a 64K folio (which is order 2 in the 16K case). Without the retry
with intermediate orders logic, we would not get this.

We can't just blindly allocate a folio of arch_wants_pte_order() size because it
might overlap with existing populated PTEs, or cross the bounds of the VMA (or a
number of other things - see calc_anon_folio_order_alloc() in patch 10). Are you
implying that if there is any kind of issue like this, then we should go
directly to order 0? I can kind of see the argument from a minimizing
fragmentation perspective, but for best possible performance I think we are
better off "packing the bin" with intermediate orders.

You're also implying that a runtime arch_wants_pte_order() function is better
than the Kconfig stuff I did in patch 8. On reflection, I agree with you here. I
think you mentioned that AMD supports coalescing 8 pages on some CPUs - so you
would probably want runtime logic to determine if you are on an appropriate AMD
CPU as part of the decision in that function?

The real reason for the existance of try_vma_alloc_movable_folio() is that I'm
reusing it on the other fault paths (which are no longer part of this series).
But I guess that's not a good reason to keep this until we get to those patches.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
@ 2023-06-27  7:56         ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  7:56 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On 27/06/2023 06:29, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 8:34 PM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> Opportunistically attempt to allocate high-order folios in highmem,
>>> optionally zeroed. Retry with lower orders all the way to order-0, until
>>> success. Although, of note, order-1 allocations are skipped since a
>>> large folio must be at least order-2 to work with the THP machinery. The
>>> user must check what they got with folio_order().
>>>
>>> This will be used to oportunistically allocate large folios for
>>> anonymous memory with a sensible fallback under memory pressure.
>>>
>>> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
>>> high latency due to reclaim, instead preferring to just try for a lower
>>> order. The same approach is used by the readahead code when allocating
>>> large folios.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
>>>  1 file changed, 33 insertions(+)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 367bbbb29d91..53896d46e686 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>>         return 0;
>>>  }
>>>
>>> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
>>> +                               unsigned long vaddr, int order, bool zeroed)
>>> +{
>>> +       gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
>>> +
>>> +       if (zeroed)
>>> +               return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
>>> +       else
>>> +               return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
>>> +                                                               vaddr, false);
>>> +}
>>> +
>>> +/*
>>> + * Opportunistically attempt to allocate high-order folios, retrying with lower
>>> + * orders all the way to order-0, until success. order-1 allocations are skipped
>>> + * since a folio must be at least order-2 to work with the THP machinery. The
>>> + * user must check what they got with folio_order(). vaddr can be any virtual
>>> + * address that will be mapped by the allocated folio.
>>> + */
>>> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>>> +                               unsigned long vaddr, int order, bool zeroed)
>>> +{
>>> +       struct folio *folio;
>>> +
>>> +       for (; order > 1; order--) {
>>> +               folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
>>> +               if (folio)
>>> +                       return folio;
>>> +       }
>>> +
>>> +       return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>>> +}
>>
>> I'd drop this patch. Instead, in do_anonymous_page():
>>
>>   if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER))
>>     folio = vma_alloc_zeroed_movable_folio(vma, addr,
>> CONFIG_ARCH_WANTS_PTE_ORDER))
>>
>>   if (!folio)
>>     folio = vma_alloc_zeroed_movable_folio(vma, addr, 0);
> 
> I meant a runtime function arch_wants_pte_order() (Its default
> implementation would return 0.)

There are a bunch of things which you are implying here which I'll try to make
explicit:

I think you are implying that we shouldn't retry allocation with intermediate
orders; but only try the order requested by the arch (arch_wants_pte_order())
and 0. Correct? For arm64 at least, I would like the VMA's THP hint to be a
factor in determining the preferred order (see patches 8 and 9). So I would add
a vma parameter to arch_wants_pte_order() to allow for this.

For the case where the THP hint is present, then the arch will request 2M (if
the page size is 16K or 64K). If that fails to allocate, there is still value in
allocating a 64K folio (which is order 2 in the 16K case). Without the retry
with intermediate orders logic, we would not get this.

We can't just blindly allocate a folio of arch_wants_pte_order() size because it
might overlap with existing populated PTEs, or cross the bounds of the VMA (or a
number of other things - see calc_anon_folio_order_alloc() in patch 10). Are you
implying that if there is any kind of issue like this, then we should go
directly to order 0? I can kind of see the argument from a minimizing
fragmentation perspective, but for best possible performance I think we are
better off "packing the bin" with intermediate orders.

You're also implying that a runtime arch_wants_pte_order() function is better
than the Kconfig stuff I did in patch 8. On reflection, I agree with you here. I
think you mentioned that AMD supports coalescing 8 pages on some CPUs - so you
would probably want runtime logic to determine if you are on an appropriate AMD
CPU as part of the decision in that function?

The real reason for the existance of try_vma_alloc_movable_folio() is that I'm
reusing it on the other fault paths (which are no longer part of this series).
But I guess that's not a good reason to keep this until we get to those patches.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
  2023-06-27  7:08     ` Yu Zhao
  (?)
@ 2023-06-27  8:09       ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  8:09 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 08:08, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
>> belonging to a folio, for effciency savings. All pages are accounted as
>> small pages.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/rmap.h |  2 ++
>>  mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 45 insertions(+)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index a3825ce81102..15433a3d0cbf 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>>                 unsigned long address);
>>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>                 unsigned long address);
>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>> +               int nr, struct vm_area_struct *vma, unsigned long address);
> 
> We should update folio_add_new_anon_rmap() to support large() &&
> !folio_test_pmd_mappable() folios instead.
> 
> I double checked all places currently using folio_add_new_anon_rmap(),
> and as expected, none actually allocates large() &&
> !folio_test_pmd_mappable() and maps it one by one, which makes the
> cases simpler, i.e.,
>   if (!large())
>     // the existing basepage case
>   else if (!folio_test_pmd_mappable())
>     // our new case
>   else
>     // the existing THP case

I don't have a strong opinion either way. Happy to go with this suggestion. But
the reason I did it as a new function was because I was following the pattern in
[1] which adds a new folio_add_file_rmap_range() function.

[1] https://lore.kernel.org/linux-mm/20230315051444.3229621-35-willy@infradead.org/


> 
>>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>                 bool compound);
>>  void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 1d8369549424..4050bcea7ae7 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>         __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>  }
>>
>> +/**
>> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
>> + * anonymous potentially large folio.
>> + * @folio:      The folio containing the pages to be mapped
>> + * @page:       First page in the folio to be mapped
>> + * @nr:         Number of pages to be mapped
>> + * @vma:        the vm area in which the mapping is added
>> + * @address:    the user virtual address of the first page to be mapped
>> + *
>> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
>> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
>> + * bypassed and the folio does not have to be locked. All pages in the folio are
>> + * individually accounted.
>> + *
>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>> + * process.
>> + */
>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>> +               int nr, struct vm_area_struct *vma, unsigned long address)
>> +{
>> +       int i;
>> +
>> +       VM_BUG_ON_VMA(address < vma->vm_start ||
>> +                     address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
> 
> BTW, VM_BUG_ON* shouldn't be used in new code:
> Documentation/process/coding-style.rst

Thanks, sorry about that. Was copy-pasting from folio_add_new_anon_rmap().


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
@ 2023-06-27  8:09       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  8:09 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 08:08, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
>> belonging to a folio, for effciency savings. All pages are accounted as
>> small pages.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/rmap.h |  2 ++
>>  mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 45 insertions(+)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index a3825ce81102..15433a3d0cbf 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>>                 unsigned long address);
>>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>                 unsigned long address);
>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>> +               int nr, struct vm_area_struct *vma, unsigned long address);
> 
> We should update folio_add_new_anon_rmap() to support large() &&
> !folio_test_pmd_mappable() folios instead.
> 
> I double checked all places currently using folio_add_new_anon_rmap(),
> and as expected, none actually allocates large() &&
> !folio_test_pmd_mappable() and maps it one by one, which makes the
> cases simpler, i.e.,
>   if (!large())
>     // the existing basepage case
>   else if (!folio_test_pmd_mappable())
>     // our new case
>   else
>     // the existing THP case

I don't have a strong opinion either way. Happy to go with this suggestion. But
the reason I did it as a new function was because I was following the pattern in
[1] which adds a new folio_add_file_rmap_range() function.

[1] https://lore.kernel.org/linux-mm/20230315051444.3229621-35-willy@infradead.org/


> 
>>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>                 bool compound);
>>  void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 1d8369549424..4050bcea7ae7 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>         __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>  }
>>
>> +/**
>> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
>> + * anonymous potentially large folio.
>> + * @folio:      The folio containing the pages to be mapped
>> + * @page:       First page in the folio to be mapped
>> + * @nr:         Number of pages to be mapped
>> + * @vma:        the vm area in which the mapping is added
>> + * @address:    the user virtual address of the first page to be mapped
>> + *
>> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
>> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
>> + * bypassed and the folio does not have to be locked. All pages in the folio are
>> + * individually accounted.
>> + *
>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>> + * process.
>> + */
>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>> +               int nr, struct vm_area_struct *vma, unsigned long address)
>> +{
>> +       int i;
>> +
>> +       VM_BUG_ON_VMA(address < vma->vm_start ||
>> +                     address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
> 
> BTW, VM_BUG_ON* shouldn't be used in new code:
> Documentation/process/coding-style.rst

Thanks, sorry about that. Was copy-pasting from folio_add_new_anon_rmap().


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
@ 2023-06-27  8:09       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  8:09 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On 27/06/2023 08:08, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
>> belonging to a folio, for effciency savings. All pages are accounted as
>> small pages.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/rmap.h |  2 ++
>>  mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 45 insertions(+)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index a3825ce81102..15433a3d0cbf 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>>                 unsigned long address);
>>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>                 unsigned long address);
>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>> +               int nr, struct vm_area_struct *vma, unsigned long address);
> 
> We should update folio_add_new_anon_rmap() to support large() &&
> !folio_test_pmd_mappable() folios instead.
> 
> I double checked all places currently using folio_add_new_anon_rmap(),
> and as expected, none actually allocates large() &&
> !folio_test_pmd_mappable() and maps it one by one, which makes the
> cases simpler, i.e.,
>   if (!large())
>     // the existing basepage case
>   else if (!folio_test_pmd_mappable())
>     // our new case
>   else
>     // the existing THP case

I don't have a strong opinion either way. Happy to go with this suggestion. But
the reason I did it as a new function was because I was following the pattern in
[1] which adds a new folio_add_file_rmap_range() function.

[1] https://lore.kernel.org/linux-mm/20230315051444.3229621-35-willy@infradead.org/


> 
>>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>                 bool compound);
>>  void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 1d8369549424..4050bcea7ae7 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>         __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>  }
>>
>> +/**
>> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
>> + * anonymous potentially large folio.
>> + * @folio:      The folio containing the pages to be mapped
>> + * @page:       First page in the folio to be mapped
>> + * @nr:         Number of pages to be mapped
>> + * @vma:        the vm area in which the mapping is added
>> + * @address:    the user virtual address of the first page to be mapped
>> + *
>> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
>> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
>> + * bypassed and the folio does not have to be locked. All pages in the folio are
>> + * individually accounted.
>> + *
>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>> + * process.
>> + */
>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>> +               int nr, struct vm_area_struct *vma, unsigned long address)
>> +{
>> +       int i;
>> +
>> +       VM_BUG_ON_VMA(address < vma->vm_start ||
>> +                     address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
> 
> BTW, VM_BUG_ON* shouldn't be used in new code:
> Documentation/process/coding-style.rst

Thanks, sorry about that. Was copy-pasting from folio_add_new_anon_rmap().


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
  2023-06-27  7:21       ` Ryan Roberts
  (?)
@ 2023-06-27  8:29         ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  8:29 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Tue, Jun 27, 2023 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/06/2023 02:55, Yu Zhao wrote:
> > On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> In preparation for extending vma_alloc_zeroed_movable_folio() to
> >> allocate a arbitrary order folio, expose clear_huge_page()
> >> unconditionally, so that it can be used to zero the allocated folio in
> >> the generic implementation of vma_alloc_zeroed_movable_folio().
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  include/linux/mm.h | 3 ++-
> >>  mm/memory.c        | 2 +-
> >>  2 files changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index 7f1741bd870a..7e3bf45e6491 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
> >>   */
> >>  extern const struct attribute_group memory_failure_attr_group;
> >>
> >> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> >>  extern void clear_huge_page(struct page *page,
> >>                             unsigned long addr_hint,
> >>                             unsigned int pages_per_huge_page);
> >> +
> >> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> >
> > We might not want to depend on THP eventually. Right now, we still
> > have to, unless splitting is optional, which seems to contradict
> > 06/10. (deferred_split_folio()  is a nop without THP.)
>
> Yes, I agree - for large anon folios to work, we depend on THP. But I don't
> think that helps us here.
>
> In the next patch, I give vma_alloc_zeroed_movable_folio() an extra `order`
> parameter. So the generic/default version of the function now needs a way to
> clear a compound page.
>
> I guess I could do something like:
>
>  static inline
>  struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
>                                    unsigned long vaddr, gfp_t gfp, int order)
>  {
>         struct folio *folio;
>
>         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
>                                         order, vma, vaddr, false);
>         if (folio) {
> #ifdef CONFIG_LARGE_FOLIO
>                 clear_huge_page(&folio->page, vaddr, 1U << order);
> #else
>                 BUG_ON(order != 0);
>                 clear_user_highpage(&folio->page, vaddr);
> #endif
>         }
>
>         return folio;
>  }
>
> But that's pretty messy and there's no reason why other users might come along
> that pass order != 0 and will be surprised by the BUG_ON.

#ifdef CONFIG_LARGE_ANON_FOLIO // depends on CONFIG_TRANSPARENT_HUGE_PAGE
struct folio *alloc_anon_folio(struct vm_area_struct *vma, unsigned
long vaddr, int order)
{
  // how do_huge_pmd_anonymous_page() allocs and clears
  vma_alloc_folio(..., *true*);
}
#else
#define alloc_anon_folio(vma, addr, order)
vma_alloc_zeroed_movable_folio(vma, addr)
#endif

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
@ 2023-06-27  8:29         ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  8:29 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Tue, Jun 27, 2023 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/06/2023 02:55, Yu Zhao wrote:
> > On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> In preparation for extending vma_alloc_zeroed_movable_folio() to
> >> allocate a arbitrary order folio, expose clear_huge_page()
> >> unconditionally, so that it can be used to zero the allocated folio in
> >> the generic implementation of vma_alloc_zeroed_movable_folio().
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  include/linux/mm.h | 3 ++-
> >>  mm/memory.c        | 2 +-
> >>  2 files changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index 7f1741bd870a..7e3bf45e6491 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
> >>   */
> >>  extern const struct attribute_group memory_failure_attr_group;
> >>
> >> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> >>  extern void clear_huge_page(struct page *page,
> >>                             unsigned long addr_hint,
> >>                             unsigned int pages_per_huge_page);
> >> +
> >> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> >
> > We might not want to depend on THP eventually. Right now, we still
> > have to, unless splitting is optional, which seems to contradict
> > 06/10. (deferred_split_folio()  is a nop without THP.)
>
> Yes, I agree - for large anon folios to work, we depend on THP. But I don't
> think that helps us here.
>
> In the next patch, I give vma_alloc_zeroed_movable_folio() an extra `order`
> parameter. So the generic/default version of the function now needs a way to
> clear a compound page.
>
> I guess I could do something like:
>
>  static inline
>  struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
>                                    unsigned long vaddr, gfp_t gfp, int order)
>  {
>         struct folio *folio;
>
>         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
>                                         order, vma, vaddr, false);
>         if (folio) {
> #ifdef CONFIG_LARGE_FOLIO
>                 clear_huge_page(&folio->page, vaddr, 1U << order);
> #else
>                 BUG_ON(order != 0);
>                 clear_user_highpage(&folio->page, vaddr);
> #endif
>         }
>
>         return folio;
>  }
>
> But that's pretty messy and there's no reason why other users might come along
> that pass order != 0 and will be surprised by the BUG_ON.

#ifdef CONFIG_LARGE_ANON_FOLIO // depends on CONFIG_TRANSPARENT_HUGE_PAGE
struct folio *alloc_anon_folio(struct vm_area_struct *vma, unsigned
long vaddr, int order)
{
  // how do_huge_pmd_anonymous_page() allocs and clears
  vma_alloc_folio(..., *true*);
}
#else
#define alloc_anon_folio(vma, addr, order)
vma_alloc_zeroed_movable_folio(vma, addr)
#endif

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
@ 2023-06-27  8:29         ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27  8:29 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On Tue, Jun 27, 2023 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/06/2023 02:55, Yu Zhao wrote:
> > On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> In preparation for extending vma_alloc_zeroed_movable_folio() to
> >> allocate a arbitrary order folio, expose clear_huge_page()
> >> unconditionally, so that it can be used to zero the allocated folio in
> >> the generic implementation of vma_alloc_zeroed_movable_folio().
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  include/linux/mm.h | 3 ++-
> >>  mm/memory.c        | 2 +-
> >>  2 files changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index 7f1741bd870a..7e3bf45e6491 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
> >>   */
> >>  extern const struct attribute_group memory_failure_attr_group;
> >>
> >> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> >>  extern void clear_huge_page(struct page *page,
> >>                             unsigned long addr_hint,
> >>                             unsigned int pages_per_huge_page);
> >> +
> >> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> >
> > We might not want to depend on THP eventually. Right now, we still
> > have to, unless splitting is optional, which seems to contradict
> > 06/10. (deferred_split_folio()  is a nop without THP.)
>
> Yes, I agree - for large anon folios to work, we depend on THP. But I don't
> think that helps us here.
>
> In the next patch, I give vma_alloc_zeroed_movable_folio() an extra `order`
> parameter. So the generic/default version of the function now needs a way to
> clear a compound page.
>
> I guess I could do something like:
>
>  static inline
>  struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
>                                    unsigned long vaddr, gfp_t gfp, int order)
>  {
>         struct folio *folio;
>
>         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
>                                         order, vma, vaddr, false);
>         if (folio) {
> #ifdef CONFIG_LARGE_FOLIO
>                 clear_huge_page(&folio->page, vaddr, 1U << order);
> #else
>                 BUG_ON(order != 0);
>                 clear_user_highpage(&folio->page, vaddr);
> #endif
>         }
>
>         return folio;
>  }
>
> But that's pretty messy and there's no reason why other users might come along
> that pass order != 0 and will be surprised by the BUG_ON.

#ifdef CONFIG_LARGE_ANON_FOLIO // depends on CONFIG_TRANSPARENT_HUGE_PAGE
struct folio *alloc_anon_folio(struct vm_area_struct *vma, unsigned
long vaddr, int order)
{
  // how do_huge_pmd_anonymous_page() allocs and clears
  vma_alloc_folio(..., *true*);
}
#else
#define alloc_anon_folio(vma, addr, order)
vma_alloc_zeroed_movable_folio(vma, addr)
#endif

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
  2023-06-27  8:29         ` Yu Zhao
  (?)
@ 2023-06-27  9:41           ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  9:41 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 09:29, Yu Zhao wrote:
> On Tue, Jun 27, 2023 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/06/2023 02:55, Yu Zhao wrote:
>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> In preparation for extending vma_alloc_zeroed_movable_folio() to
>>>> allocate a arbitrary order folio, expose clear_huge_page()
>>>> unconditionally, so that it can be used to zero the allocated folio in
>>>> the generic implementation of vma_alloc_zeroed_movable_folio().
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/mm.h | 3 ++-
>>>>  mm/memory.c        | 2 +-
>>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 7f1741bd870a..7e3bf45e6491 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
>>>>   */
>>>>  extern const struct attribute_group memory_failure_attr_group;
>>>>
>>>> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>>>>  extern void clear_huge_page(struct page *page,
>>>>                             unsigned long addr_hint,
>>>>                             unsigned int pages_per_huge_page);
>>>> +
>>>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>>>
>>> We might not want to depend on THP eventually. Right now, we still
>>> have to, unless splitting is optional, which seems to contradict
>>> 06/10. (deferred_split_folio()  is a nop without THP.)
>>
>> Yes, I agree - for large anon folios to work, we depend on THP. But I don't
>> think that helps us here.
>>
>> In the next patch, I give vma_alloc_zeroed_movable_folio() an extra `order`
>> parameter. So the generic/default version of the function now needs a way to
>> clear a compound page.
>>
>> I guess I could do something like:
>>
>>  static inline
>>  struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
>>                                    unsigned long vaddr, gfp_t gfp, int order)
>>  {
>>         struct folio *folio;
>>
>>         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
>>                                         order, vma, vaddr, false);
>>         if (folio) {
>> #ifdef CONFIG_LARGE_FOLIO
>>                 clear_huge_page(&folio->page, vaddr, 1U << order);
>> #else
>>                 BUG_ON(order != 0);
>>                 clear_user_highpage(&folio->page, vaddr);
>> #endif
>>         }
>>
>>         return folio;
>>  }
>>
>> But that's pretty messy and there's no reason why other users might come along
>> that pass order != 0 and will be surprised by the BUG_ON.
> 
> #ifdef CONFIG_LARGE_ANON_FOLIO // depends on CONFIG_TRANSPARENT_HUGE_PAGE
> struct folio *alloc_anon_folio(struct vm_area_struct *vma, unsigned
> long vaddr, int order)
> {
>   // how do_huge_pmd_anonymous_page() allocs and clears
>   vma_alloc_folio(..., *true*);

This controls the mem allocation policy (see mempolicy.c::vma_alloc_folio()) not
clearing. Clearing is done in __do_huge_pmd_anonymous_page():

  clear_huge_page(page, vmf->address, HPAGE_PMD_NR);

> }
> #else
> #define alloc_anon_folio(vma, addr, order)
> vma_alloc_zeroed_movable_folio(vma, addr)
> #endif

Sorry I don't get this at all... If you are suggesting to bypass
vma_alloc_zeroed_movable_folio() entirely for the LARGE_ANON_FOLIO case, I don't
think that works because the arch code adds its own gfp flags there. For
example, arm64 adds __GFP_ZEROTAGS for VM_MTE VMAs.

Perhaps we can do away with an arch-owned vma_alloc_zeroed_movable_folio() and
replace it with a new arch_get_zeroed_movable_gfp_flags() then
alloc_anon_folio() add in those flags?

But I still think the cleanest, simplest change is just to unconditionally
expose clear_huge_page() as I've done it.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
@ 2023-06-27  9:41           ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  9:41 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 09:29, Yu Zhao wrote:
> On Tue, Jun 27, 2023 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/06/2023 02:55, Yu Zhao wrote:
>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> In preparation for extending vma_alloc_zeroed_movable_folio() to
>>>> allocate a arbitrary order folio, expose clear_huge_page()
>>>> unconditionally, so that it can be used to zero the allocated folio in
>>>> the generic implementation of vma_alloc_zeroed_movable_folio().
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/mm.h | 3 ++-
>>>>  mm/memory.c        | 2 +-
>>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 7f1741bd870a..7e3bf45e6491 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
>>>>   */
>>>>  extern const struct attribute_group memory_failure_attr_group;
>>>>
>>>> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>>>>  extern void clear_huge_page(struct page *page,
>>>>                             unsigned long addr_hint,
>>>>                             unsigned int pages_per_huge_page);
>>>> +
>>>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>>>
>>> We might not want to depend on THP eventually. Right now, we still
>>> have to, unless splitting is optional, which seems to contradict
>>> 06/10. (deferred_split_folio()  is a nop without THP.)
>>
>> Yes, I agree - for large anon folios to work, we depend on THP. But I don't
>> think that helps us here.
>>
>> In the next patch, I give vma_alloc_zeroed_movable_folio() an extra `order`
>> parameter. So the generic/default version of the function now needs a way to
>> clear a compound page.
>>
>> I guess I could do something like:
>>
>>  static inline
>>  struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
>>                                    unsigned long vaddr, gfp_t gfp, int order)
>>  {
>>         struct folio *folio;
>>
>>         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
>>                                         order, vma, vaddr, false);
>>         if (folio) {
>> #ifdef CONFIG_LARGE_FOLIO
>>                 clear_huge_page(&folio->page, vaddr, 1U << order);
>> #else
>>                 BUG_ON(order != 0);
>>                 clear_user_highpage(&folio->page, vaddr);
>> #endif
>>         }
>>
>>         return folio;
>>  }
>>
>> But that's pretty messy and there's no reason why other users might come along
>> that pass order != 0 and will be surprised by the BUG_ON.
> 
> #ifdef CONFIG_LARGE_ANON_FOLIO // depends on CONFIG_TRANSPARENT_HUGE_PAGE
> struct folio *alloc_anon_folio(struct vm_area_struct *vma, unsigned
> long vaddr, int order)
> {
>   // how do_huge_pmd_anonymous_page() allocs and clears
>   vma_alloc_folio(..., *true*);

This controls the mem allocation policy (see mempolicy.c::vma_alloc_folio()) not
clearing. Clearing is done in __do_huge_pmd_anonymous_page():

  clear_huge_page(page, vmf->address, HPAGE_PMD_NR);

> }
> #else
> #define alloc_anon_folio(vma, addr, order)
> vma_alloc_zeroed_movable_folio(vma, addr)
> #endif

Sorry I don't get this at all... If you are suggesting to bypass
vma_alloc_zeroed_movable_folio() entirely for the LARGE_ANON_FOLIO case, I don't
think that works because the arch code adds its own gfp flags there. For
example, arm64 adds __GFP_ZEROTAGS for VM_MTE VMAs.

Perhaps we can do away with an arch-owned vma_alloc_zeroed_movable_folio() and
replace it with a new arch_get_zeroed_movable_gfp_flags() then
alloc_anon_folio() add in those flags?

But I still think the cleanest, simplest change is just to unconditionally
expose clear_huge_page() as I've done it.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
@ 2023-06-27  9:41           ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  9:41 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On 27/06/2023 09:29, Yu Zhao wrote:
> On Tue, Jun 27, 2023 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/06/2023 02:55, Yu Zhao wrote:
>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> In preparation for extending vma_alloc_zeroed_movable_folio() to
>>>> allocate a arbitrary order folio, expose clear_huge_page()
>>>> unconditionally, so that it can be used to zero the allocated folio in
>>>> the generic implementation of vma_alloc_zeroed_movable_folio().
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/mm.h | 3 ++-
>>>>  mm/memory.c        | 2 +-
>>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 7f1741bd870a..7e3bf45e6491 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
>>>>   */
>>>>  extern const struct attribute_group memory_failure_attr_group;
>>>>
>>>> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>>>>  extern void clear_huge_page(struct page *page,
>>>>                             unsigned long addr_hint,
>>>>                             unsigned int pages_per_huge_page);
>>>> +
>>>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>>>
>>> We might not want to depend on THP eventually. Right now, we still
>>> have to, unless splitting is optional, which seems to contradict
>>> 06/10. (deferred_split_folio()  is a nop without THP.)
>>
>> Yes, I agree - for large anon folios to work, we depend on THP. But I don't
>> think that helps us here.
>>
>> In the next patch, I give vma_alloc_zeroed_movable_folio() an extra `order`
>> parameter. So the generic/default version of the function now needs a way to
>> clear a compound page.
>>
>> I guess I could do something like:
>>
>>  static inline
>>  struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
>>                                    unsigned long vaddr, gfp_t gfp, int order)
>>  {
>>         struct folio *folio;
>>
>>         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
>>                                         order, vma, vaddr, false);
>>         if (folio) {
>> #ifdef CONFIG_LARGE_FOLIO
>>                 clear_huge_page(&folio->page, vaddr, 1U << order);
>> #else
>>                 BUG_ON(order != 0);
>>                 clear_user_highpage(&folio->page, vaddr);
>> #endif
>>         }
>>
>>         return folio;
>>  }
>>
>> But that's pretty messy and there's no reason why other users might come along
>> that pass order != 0 and will be surprised by the BUG_ON.
> 
> #ifdef CONFIG_LARGE_ANON_FOLIO // depends on CONFIG_TRANSPARENT_HUGE_PAGE
> struct folio *alloc_anon_folio(struct vm_area_struct *vma, unsigned
> long vaddr, int order)
> {
>   // how do_huge_pmd_anonymous_page() allocs and clears
>   vma_alloc_folio(..., *true*);

This controls the mem allocation policy (see mempolicy.c::vma_alloc_folio()) not
clearing. Clearing is done in __do_huge_pmd_anonymous_page():

  clear_huge_page(page, vmf->address, HPAGE_PMD_NR);

> }
> #else
> #define alloc_anon_folio(vma, addr, order)
> vma_alloc_zeroed_movable_folio(vma, addr)
> #endif

Sorry I don't get this at all... If you are suggesting to bypass
vma_alloc_zeroed_movable_folio() entirely for the LARGE_ANON_FOLIO case, I don't
think that works because the arch code adds its own gfp flags there. For
example, arm64 adds __GFP_ZEROTAGS for VM_MTE VMAs.

Perhaps we can do away with an arch-owned vma_alloc_zeroed_movable_folio() and
replace it with a new arch_get_zeroed_movable_gfp_flags() then
alloc_anon_folio() add in those flags?

But I still think the cleanest, simplest change is just to unconditionally
expose clear_huge_page() as I've done it.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 07/10] mm: Batch-zap large anonymous folio PTE mappings
  2023-06-27  3:04     ` Yu Zhao
  (?)
@ 2023-06-27  9:46       ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  9:46 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 04:04, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> This allows batching the rmap removal with folio_remove_rmap_range(),
>> which means we avoid spuriously adding a partially unmapped folio to the
>> deferrred split queue in the common case, which reduces split queue lock
>> contention.
>>
>> Previously each page was removed from the rmap individually with
>> page_remove_rmap(). If the first page belonged to a large folio, this
>> would cause page_remove_rmap() to conclude that the folio was now
>> partially mapped and add the folio to the deferred split queue. But
>> subsequent calls would cause the folio to become fully unmapped, meaning
>> there is no value to adding it to the split queue.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 119 insertions(+)
> 
> We don't really need this patch for the series to work. So again, I'd
> split it out.

The reason I included it in the MVP was that without it I was seeing high lock
contention for the split queue lock, which was significantly eating the
performance gains. But since then Yin Fengwei's patch to make this more
efficient has been accepted so perhaps that solves the problem and in that case
we can drop this as you suggest. If I still see a reasonable perf improvement
without it, I'll drop for v2.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 07/10] mm: Batch-zap large anonymous folio PTE mappings
@ 2023-06-27  9:46       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  9:46 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 04:04, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> This allows batching the rmap removal with folio_remove_rmap_range(),
>> which means we avoid spuriously adding a partially unmapped folio to the
>> deferrred split queue in the common case, which reduces split queue lock
>> contention.
>>
>> Previously each page was removed from the rmap individually with
>> page_remove_rmap(). If the first page belonged to a large folio, this
>> would cause page_remove_rmap() to conclude that the folio was now
>> partially mapped and add the folio to the deferred split queue. But
>> subsequent calls would cause the folio to become fully unmapped, meaning
>> there is no value to adding it to the split queue.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 119 insertions(+)
> 
> We don't really need this patch for the series to work. So again, I'd
> split it out.

The reason I included it in the MVP was that without it I was seeing high lock
contention for the split queue lock, which was significantly eating the
performance gains. But since then Yin Fengwei's patch to make this more
efficient has been accepted so perhaps that solves the problem and in that case
we can drop this as you suggest. If I still see a reasonable perf improvement
without it, I'll drop for v2.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 07/10] mm: Batch-zap large anonymous folio PTE mappings
@ 2023-06-27  9:46       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  9:46 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On 27/06/2023 04:04, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> This allows batching the rmap removal with folio_remove_rmap_range(),
>> which means we avoid spuriously adding a partially unmapped folio to the
>> deferrred split queue in the common case, which reduces split queue lock
>> contention.
>>
>> Previously each page was removed from the rmap individually with
>> page_remove_rmap(). If the first page belonged to a large folio, this
>> would cause page_remove_rmap() to conclude that the folio was now
>> partially mapped and add the folio to the deferred split queue. But
>> subsequent calls would cause the folio to become fully unmapped, meaning
>> there is no value to adding it to the split queue.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 119 insertions(+)
> 
> We don't really need this patch for the series to work. So again, I'd
> split it out.

The reason I included it in the MVP was that without it I was seeing high lock
contention for the split queue lock, which was significantly eating the
performance gains. But since then Yin Fengwei's patch to make this more
efficient has been accepted so perhaps that solves the problem and in that case
we can drop this as you suggest. If I still see a reasonable perf improvement
without it, I'll drop for v2.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order
  2023-06-27  2:47     ` Yu Zhao
  (?)
@ 2023-06-27  9:54       ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  9:54 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 03:47, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> For variable-order anonymous folios, we need to determine the order that
>> we will allocate. From a SW perspective, the higher the order we
>> allocate, the less overhead we will have; fewer faults, fewer folios in
>> lists, etc. But of course there will also be more memory wastage as the
>> order increases.
>>
>> From a HW perspective, there are memory block sizes that can be
>> beneficial to reducing TLB pressure. arm64, for example, has the ability
>> to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
>> 64K base pages) such that one of these chunks only uses a single TLB
>> entry.
>>
>> So we let the architecture specify the order of the maximally beneficial
>> mapping unit when PTE-mapped. Furthermore, because in some cases, this
>> order may be quite big (and therefore potentially wasteful of memory),
>> allow the arch to specify 2 values; One is the max order for a mapping
>> that _would not_ use THP if all size and alignment constraints were met,
>> and the other is the max order for a mapping that _would_ use THP if all
>> those constraints were met.
>>
>> Implement this with Kconfig by introducing some new options to allow the
>> architecture to declare that it supports large anonymous folios along
>> with these 2 preferred max order values. Then introduce a user-facing
>> option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
>> enabled if the architecture has declared its support. When disabled, it
>> forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
>> LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
>> allocated.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/Kconfig  | 39 +++++++++++++++++++++++++++++++++++++++
>>  mm/memory.c |  8 ++++++++
>>  2 files changed, 47 insertions(+)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 7672a22647b4..f4ba48c37b75 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
>>
>>  source "mm/damon/Kconfig"
>>
>> +config ARCH_SUPPORTS_LARGE_ANON_FOLIO
>> +       def_bool n
>> +       help
>> +         An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
>> +         to be enabled. It must also set the following integer values:
>> +         - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +         - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +
>> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +       int
>> +       help
>> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
>> +         that does not have the MADV_HUGEPAGE hint set.
>> +
>> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +       int
>> +       help
>> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
>> +         that has the MADV_HUGEPAGE hint set.
>> +
>> +config LARGE_ANON_FOLIO
>> +       bool "Allocate large folios for anonymous memory"
>> +       depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
>> +       default n
>> +       help
>> +         Use large (bigger than order-0) folios to back anonymous memory where
>> +         possible. This reduces the number of page faults, as well as other
>> +         per-page overheads to improve performance for many workloads.
>> +
>> +config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +       int
>> +       default 0 if !LARGE_ANON_FOLIO
>> +       default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +
>> +config LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +       int
>> +       default 0 if !LARGE_ANON_FOLIO
>> +       default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +
>>  endmenu
> 
> I don't think an MVP should add this many Kconfigs. One Kconfig sounds
> reasonable to me for now.

If we move to arch_wants_pte_order() as you suggested (in your response to patch
3) then I agree we can remove most of these. I still think we might want 2
though. For an arch that does not implement arch_wants_pte_order() we wouldn't
want LARGE_ANON_FOLIO to show up in menuconfig so we would still need
ARCH_SUPPORTS_LARGE_ANON_FOLIO:


config ARCH_SUPPORTS_LARGE_ANON_FOLIO
       def_bool n
       help
         An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
         to be enabled. In this case, It must also define arch_wants_pte_order()

config LARGE_ANON_FOLIO
       bool "Allocate large folios for anonymous memory"
       depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
       default n
       help
         Use large (bigger than order-0) folios to back anonymous memory where
         possible. This reduces the number of page faults, as well as other
         per-page overheads to improve performance for many workloads.

What do you think?


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order
@ 2023-06-27  9:54       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  9:54 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 03:47, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> For variable-order anonymous folios, we need to determine the order that
>> we will allocate. From a SW perspective, the higher the order we
>> allocate, the less overhead we will have; fewer faults, fewer folios in
>> lists, etc. But of course there will also be more memory wastage as the
>> order increases.
>>
>> From a HW perspective, there are memory block sizes that can be
>> beneficial to reducing TLB pressure. arm64, for example, has the ability
>> to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
>> 64K base pages) such that one of these chunks only uses a single TLB
>> entry.
>>
>> So we let the architecture specify the order of the maximally beneficial
>> mapping unit when PTE-mapped. Furthermore, because in some cases, this
>> order may be quite big (and therefore potentially wasteful of memory),
>> allow the arch to specify 2 values; One is the max order for a mapping
>> that _would not_ use THP if all size and alignment constraints were met,
>> and the other is the max order for a mapping that _would_ use THP if all
>> those constraints were met.
>>
>> Implement this with Kconfig by introducing some new options to allow the
>> architecture to declare that it supports large anonymous folios along
>> with these 2 preferred max order values. Then introduce a user-facing
>> option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
>> enabled if the architecture has declared its support. When disabled, it
>> forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
>> LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
>> allocated.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/Kconfig  | 39 +++++++++++++++++++++++++++++++++++++++
>>  mm/memory.c |  8 ++++++++
>>  2 files changed, 47 insertions(+)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 7672a22647b4..f4ba48c37b75 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
>>
>>  source "mm/damon/Kconfig"
>>
>> +config ARCH_SUPPORTS_LARGE_ANON_FOLIO
>> +       def_bool n
>> +       help
>> +         An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
>> +         to be enabled. It must also set the following integer values:
>> +         - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +         - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +
>> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +       int
>> +       help
>> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
>> +         that does not have the MADV_HUGEPAGE hint set.
>> +
>> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +       int
>> +       help
>> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
>> +         that has the MADV_HUGEPAGE hint set.
>> +
>> +config LARGE_ANON_FOLIO
>> +       bool "Allocate large folios for anonymous memory"
>> +       depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
>> +       default n
>> +       help
>> +         Use large (bigger than order-0) folios to back anonymous memory where
>> +         possible. This reduces the number of page faults, as well as other
>> +         per-page overheads to improve performance for many workloads.
>> +
>> +config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +       int
>> +       default 0 if !LARGE_ANON_FOLIO
>> +       default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +
>> +config LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +       int
>> +       default 0 if !LARGE_ANON_FOLIO
>> +       default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +
>>  endmenu
> 
> I don't think an MVP should add this many Kconfigs. One Kconfig sounds
> reasonable to me for now.

If we move to arch_wants_pte_order() as you suggested (in your response to patch
3) then I agree we can remove most of these. I still think we might want 2
though. For an arch that does not implement arch_wants_pte_order() we wouldn't
want LARGE_ANON_FOLIO to show up in menuconfig so we would still need
ARCH_SUPPORTS_LARGE_ANON_FOLIO:


config ARCH_SUPPORTS_LARGE_ANON_FOLIO
       def_bool n
       help
         An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
         to be enabled. In this case, It must also define arch_wants_pte_order()

config LARGE_ANON_FOLIO
       bool "Allocate large folios for anonymous memory"
       depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
       default n
       help
         Use large (bigger than order-0) folios to back anonymous memory where
         possible. This reduces the number of page faults, as well as other
         per-page overheads to improve performance for many workloads.

What do you think?


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order
@ 2023-06-27  9:54       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  9:54 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On 27/06/2023 03:47, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> For variable-order anonymous folios, we need to determine the order that
>> we will allocate. From a SW perspective, the higher the order we
>> allocate, the less overhead we will have; fewer faults, fewer folios in
>> lists, etc. But of course there will also be more memory wastage as the
>> order increases.
>>
>> From a HW perspective, there are memory block sizes that can be
>> beneficial to reducing TLB pressure. arm64, for example, has the ability
>> to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
>> 64K base pages) such that one of these chunks only uses a single TLB
>> entry.
>>
>> So we let the architecture specify the order of the maximally beneficial
>> mapping unit when PTE-mapped. Furthermore, because in some cases, this
>> order may be quite big (and therefore potentially wasteful of memory),
>> allow the arch to specify 2 values; One is the max order for a mapping
>> that _would not_ use THP if all size and alignment constraints were met,
>> and the other is the max order for a mapping that _would_ use THP if all
>> those constraints were met.
>>
>> Implement this with Kconfig by introducing some new options to allow the
>> architecture to declare that it supports large anonymous folios along
>> with these 2 preferred max order values. Then introduce a user-facing
>> option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
>> enabled if the architecture has declared its support. When disabled, it
>> forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
>> LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
>> allocated.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/Kconfig  | 39 +++++++++++++++++++++++++++++++++++++++
>>  mm/memory.c |  8 ++++++++
>>  2 files changed, 47 insertions(+)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 7672a22647b4..f4ba48c37b75 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
>>
>>  source "mm/damon/Kconfig"
>>
>> +config ARCH_SUPPORTS_LARGE_ANON_FOLIO
>> +       def_bool n
>> +       help
>> +         An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
>> +         to be enabled. It must also set the following integer values:
>> +         - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +         - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +
>> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +       int
>> +       help
>> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
>> +         that does not have the MADV_HUGEPAGE hint set.
>> +
>> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +       int
>> +       help
>> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
>> +         that has the MADV_HUGEPAGE hint set.
>> +
>> +config LARGE_ANON_FOLIO
>> +       bool "Allocate large folios for anonymous memory"
>> +       depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
>> +       default n
>> +       help
>> +         Use large (bigger than order-0) folios to back anonymous memory where
>> +         possible. This reduces the number of page faults, as well as other
>> +         per-page overheads to improve performance for many workloads.
>> +
>> +config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +       int
>> +       default 0 if !LARGE_ANON_FOLIO
>> +       default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +
>> +config LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +       int
>> +       default 0 if !LARGE_ANON_FOLIO
>> +       default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +
>>  endmenu
> 
> I don't think an MVP should add this many Kconfigs. One Kconfig sounds
> reasonable to me for now.

If we move to arch_wants_pte_order() as you suggested (in your response to patch
3) then I agree we can remove most of these. I still think we might want 2
though. For an arch that does not implement arch_wants_pte_order() we wouldn't
want LARGE_ANON_FOLIO to show up in menuconfig so we would still need
ARCH_SUPPORTS_LARGE_ANON_FOLIO:


config ARCH_SUPPORTS_LARGE_ANON_FOLIO
       def_bool n
       help
         An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
         to be enabled. In this case, It must also define arch_wants_pte_order()

config LARGE_ANON_FOLIO
       bool "Allocate large folios for anonymous memory"
       depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
       default n
       help
         Use large (bigger than order-0) folios to back anonymous memory where
         possible. This reduces the number of page faults, as well as other
         per-page overheads to improve performance for many workloads.

What do you think?


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
  2023-06-27  3:01     ` Yu Zhao
  (?)
@ 2023-06-27  9:57       ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  9:57 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 04:01, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> With all of the enabler patches in place, modify the anonymous memory
>> write allocation path so that it opportunistically attempts to allocate
>> a large folio up to `max_anon_folio_order()` size (This value is
>> ultimately configured by the architecture). This reduces the number of
>> page faults, reduces the size of (e.g. LRU) lists, and generally
>> improves performance by batching what were per-page operations into
>> per-(large)-folio operations.
>>
>> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
>> `max_anon_folio_order()` always returns 0, meaning we get the existing
>> allocation behaviour.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 144 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index a8f7e2b28d7a..d23c44cc5092 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
>>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>>  }
>>
>> +/*
>> + * Returns index of first pte that is not none, or nr if all are none.
>> + */
>> +static inline int check_ptes_none(pte_t *pte, int nr)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < nr; i++) {
>> +               if (!pte_none(ptep_get(pte++)))
>> +                       return i;
>> +       }
>> +
>> +       return nr;
>> +}
>> +
>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> 
> As suggested previously in 03/10, we can leave this for later.

I disagree. This is the logic that prevents us from accidentally replacing
already set PTEs, or wandering out of the VMA bounds etc. How would you catch
all those corener cases without this?

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
@ 2023-06-27  9:57       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  9:57 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 04:01, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> With all of the enabler patches in place, modify the anonymous memory
>> write allocation path so that it opportunistically attempts to allocate
>> a large folio up to `max_anon_folio_order()` size (This value is
>> ultimately configured by the architecture). This reduces the number of
>> page faults, reduces the size of (e.g. LRU) lists, and generally
>> improves performance by batching what were per-page operations into
>> per-(large)-folio operations.
>>
>> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
>> `max_anon_folio_order()` always returns 0, meaning we get the existing
>> allocation behaviour.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 144 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index a8f7e2b28d7a..d23c44cc5092 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
>>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>>  }
>>
>> +/*
>> + * Returns index of first pte that is not none, or nr if all are none.
>> + */
>> +static inline int check_ptes_none(pte_t *pte, int nr)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < nr; i++) {
>> +               if (!pte_none(ptep_get(pte++)))
>> +                       return i;
>> +       }
>> +
>> +       return nr;
>> +}
>> +
>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> 
> As suggested previously in 03/10, we can leave this for later.

I disagree. This is the logic that prevents us from accidentally replacing
already set PTEs, or wandering out of the VMA bounds etc. How would you catch
all those corener cases without this?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
@ 2023-06-27  9:57       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  9:57 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On 27/06/2023 04:01, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> With all of the enabler patches in place, modify the anonymous memory
>> write allocation path so that it opportunistically attempts to allocate
>> a large folio up to `max_anon_folio_order()` size (This value is
>> ultimately configured by the architecture). This reduces the number of
>> page faults, reduces the size of (e.g. LRU) lists, and generally
>> improves performance by batching what were per-page operations into
>> per-(large)-folio operations.
>>
>> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
>> `max_anon_folio_order()` always returns 0, meaning we get the existing
>> allocation behaviour.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 144 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index a8f7e2b28d7a..d23c44cc5092 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
>>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>>  }
>>
>> +/*
>> + * Returns index of first pte that is not none, or nr if all are none.
>> + */
>> +static inline int check_ptes_none(pte_t *pte, int nr)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < nr; i++) {
>> +               if (!pte_none(ptep_get(pte++)))
>> +                       return i;
>> +       }
>> +
>> +       return nr;
>> +}
>> +
>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> 
> As suggested previously in 03/10, we can leave this for later.

I disagree. This is the logic that prevents us from accidentally replacing
already set PTEs, or wandering out of the VMA bounds etc. How would you catch
all those corener cases without this?

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
  2023-06-27  7:49     ` Yu Zhao
  (?)
@ 2023-06-27  9:59       ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  9:59 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 08:49, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> Hi All,
>>>
>>> Following on from the previous RFCv2 [1], this series implements variable order,
>>> large folios for anonymous memory. The objective of this is to improve
>>> performance by allocating larger chunks of memory during anonymous page faults:
>>>
>>>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>>>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>>>    overhead. This should benefit all architectures.
>>>  - Since we are now mapping physically contiguous chunks of memory, we can take
>>>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>
>>> This patch set deals with the SW side of things only and based on feedback from
>>> the RFC, aims to be the most minimal initial change, upon which future
>>> incremental changes can be added. For this reason, the new behaviour is hidden
>>> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
>>> default. Although the code has been refactored to parameterize the desired order
>>> of the allocation, when the feature is disabled (by forcing the order to be
>>> always 0) my performance tests measure no regression. So I'm hoping this will be
>>> a suitable mechanism to allow incremental submissions to the kernel without
>>> affecting the rest of the world.
>>>
>>> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
>>> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
>>> getting that series into the kernel, but I'm hoping we can start the review
>>> process on this patch set independently. I have a branch at [3].
>>>
>>> I've posted a separate series concerning the HW part (contpte mapping) for arm64
>>> at [4].
>>>
>>>
>>> Performance
>>> -----------
>>>
>>> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
>>> javascript benchmark running in Chromium). Both cases are running on Ampere
>>> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
>>> is repeated 15 times over 5 reboots and averaged.
>>>
>>> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
>>> 'anonfolio' is the full patch set similar to the RFC with the additional changes
>>> to the extra 3 fault paths. The rest of the configs are described at [4].
>>>
>>> Kernel Compilation (smaller is better):
>>>
>>> | kernel          |   real-time |   kern-time |   user-time |
>>> |:----------------|------------:|------------:|------------:|
>>> | baseline-4k     |        0.0% |        0.0% |        0.0% |
>>> | anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
>>> | anonfolio       |       -5.4% |      -46.0% |       -0.3% |
>>> | contpte         |       -6.8% |      -45.7% |       -2.1% |
>>> | exefolio        |       -8.4% |      -46.4% |       -3.7% |
>>> | baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
>>> | baseline-64k    |      -10.5% |      -66.0% |       -3.5% |
>>>
>>> Speedometer 2.0 (bigger is better):
>>>
>>> | kernel          |   runs_per_min |
>>> |:----------------|---------------:|
>>> | baseline-4k     |           0.0% |
>>> | anonfolio-basic |           0.7% |
>>> | anonfolio       |           1.2% |
>>> | contpte         |           3.1% |
>>> | exefolio        |           4.2% |
>>> | baseline-16k    |           5.3% |
>>
>> Thanks for pushing this forward!
>>
>>> Changes since RFCv2
>>> -------------------
>>>
>>>   - Simplified series to bare minimum (on David Hildenbrand's advice)
>>
>> My impression is that this series still includes many pieces that can
>> be split out and discussed separately with followup series.
>>
>> (I skipped 04/10 and will look at it tomorrow.)
> 
> I went through the series twice. Here what I think a bare minimum
> series (easier to review/debug/land) would look like:
> 1. a new arch specific function providing a prefered order within (0,
> PMD_ORDER).
> 2. an extended anon folio alloc API taking that order (02/10, partially).
> 3. an updated folio_add_new_anon_rmap() covering the large() &&
> !pmd_mappable() case (similar to 04/10).
> 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
> (06/10, reviewed-by provided).
> 5. finally, use the extended anon folio alloc API with the arch
> preferred order in do_anonymous_page() (10/10, partially).
> 
> The rest can be split out into separate series and move forward in
> parallel with probably a long list of things we need/want to do.

Thanks for the fadt review - I really appreciate it!

I've responded to many of your comments. I'd appreciate if we can close those
points then I will work up a v2.

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-27  9:59       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  9:59 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 08:49, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> Hi All,
>>>
>>> Following on from the previous RFCv2 [1], this series implements variable order,
>>> large folios for anonymous memory. The objective of this is to improve
>>> performance by allocating larger chunks of memory during anonymous page faults:
>>>
>>>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>>>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>>>    overhead. This should benefit all architectures.
>>>  - Since we are now mapping physically contiguous chunks of memory, we can take
>>>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>
>>> This patch set deals with the SW side of things only and based on feedback from
>>> the RFC, aims to be the most minimal initial change, upon which future
>>> incremental changes can be added. For this reason, the new behaviour is hidden
>>> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
>>> default. Although the code has been refactored to parameterize the desired order
>>> of the allocation, when the feature is disabled (by forcing the order to be
>>> always 0) my performance tests measure no regression. So I'm hoping this will be
>>> a suitable mechanism to allow incremental submissions to the kernel without
>>> affecting the rest of the world.
>>>
>>> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
>>> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
>>> getting that series into the kernel, but I'm hoping we can start the review
>>> process on this patch set independently. I have a branch at [3].
>>>
>>> I've posted a separate series concerning the HW part (contpte mapping) for arm64
>>> at [4].
>>>
>>>
>>> Performance
>>> -----------
>>>
>>> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
>>> javascript benchmark running in Chromium). Both cases are running on Ampere
>>> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
>>> is repeated 15 times over 5 reboots and averaged.
>>>
>>> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
>>> 'anonfolio' is the full patch set similar to the RFC with the additional changes
>>> to the extra 3 fault paths. The rest of the configs are described at [4].
>>>
>>> Kernel Compilation (smaller is better):
>>>
>>> | kernel          |   real-time |   kern-time |   user-time |
>>> |:----------------|------------:|------------:|------------:|
>>> | baseline-4k     |        0.0% |        0.0% |        0.0% |
>>> | anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
>>> | anonfolio       |       -5.4% |      -46.0% |       -0.3% |
>>> | contpte         |       -6.8% |      -45.7% |       -2.1% |
>>> | exefolio        |       -8.4% |      -46.4% |       -3.7% |
>>> | baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
>>> | baseline-64k    |      -10.5% |      -66.0% |       -3.5% |
>>>
>>> Speedometer 2.0 (bigger is better):
>>>
>>> | kernel          |   runs_per_min |
>>> |:----------------|---------------:|
>>> | baseline-4k     |           0.0% |
>>> | anonfolio-basic |           0.7% |
>>> | anonfolio       |           1.2% |
>>> | contpte         |           3.1% |
>>> | exefolio        |           4.2% |
>>> | baseline-16k    |           5.3% |
>>
>> Thanks for pushing this forward!
>>
>>> Changes since RFCv2
>>> -------------------
>>>
>>>   - Simplified series to bare minimum (on David Hildenbrand's advice)
>>
>> My impression is that this series still includes many pieces that can
>> be split out and discussed separately with followup series.
>>
>> (I skipped 04/10 and will look at it tomorrow.)
> 
> I went through the series twice. Here what I think a bare minimum
> series (easier to review/debug/land) would look like:
> 1. a new arch specific function providing a prefered order within (0,
> PMD_ORDER).
> 2. an extended anon folio alloc API taking that order (02/10, partially).
> 3. an updated folio_add_new_anon_rmap() covering the large() &&
> !pmd_mappable() case (similar to 04/10).
> 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
> (06/10, reviewed-by provided).
> 5. finally, use the extended anon folio alloc API with the arch
> preferred order in do_anonymous_page() (10/10, partially).
> 
> The rest can be split out into separate series and move forward in
> parallel with probably a long list of things we need/want to do.

Thanks for the fadt review - I really appreciate it!

I've responded to many of your comments. I'd appreciate if we can close those
points then I will work up a v2.

Thanks,
Ryan



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-27  9:59       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-27  9:59 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On 27/06/2023 08:49, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> Hi All,
>>>
>>> Following on from the previous RFCv2 [1], this series implements variable order,
>>> large folios for anonymous memory. The objective of this is to improve
>>> performance by allocating larger chunks of memory during anonymous page faults:
>>>
>>>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>>>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>>>    overhead. This should benefit all architectures.
>>>  - Since we are now mapping physically contiguous chunks of memory, we can take
>>>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>
>>> This patch set deals with the SW side of things only and based on feedback from
>>> the RFC, aims to be the most minimal initial change, upon which future
>>> incremental changes can be added. For this reason, the new behaviour is hidden
>>> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
>>> default. Although the code has been refactored to parameterize the desired order
>>> of the allocation, when the feature is disabled (by forcing the order to be
>>> always 0) my performance tests measure no regression. So I'm hoping this will be
>>> a suitable mechanism to allow incremental submissions to the kernel without
>>> affecting the rest of the world.
>>>
>>> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
>>> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
>>> getting that series into the kernel, but I'm hoping we can start the review
>>> process on this patch set independently. I have a branch at [3].
>>>
>>> I've posted a separate series concerning the HW part (contpte mapping) for arm64
>>> at [4].
>>>
>>>
>>> Performance
>>> -----------
>>>
>>> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
>>> javascript benchmark running in Chromium). Both cases are running on Ampere
>>> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
>>> is repeated 15 times over 5 reboots and averaged.
>>>
>>> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
>>> 'anonfolio' is the full patch set similar to the RFC with the additional changes
>>> to the extra 3 fault paths. The rest of the configs are described at [4].
>>>
>>> Kernel Compilation (smaller is better):
>>>
>>> | kernel          |   real-time |   kern-time |   user-time |
>>> |:----------------|------------:|------------:|------------:|
>>> | baseline-4k     |        0.0% |        0.0% |        0.0% |
>>> | anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
>>> | anonfolio       |       -5.4% |      -46.0% |       -0.3% |
>>> | contpte         |       -6.8% |      -45.7% |       -2.1% |
>>> | exefolio        |       -8.4% |      -46.4% |       -3.7% |
>>> | baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
>>> | baseline-64k    |      -10.5% |      -66.0% |       -3.5% |
>>>
>>> Speedometer 2.0 (bigger is better):
>>>
>>> | kernel          |   runs_per_min |
>>> |:----------------|---------------:|
>>> | baseline-4k     |           0.0% |
>>> | anonfolio-basic |           0.7% |
>>> | anonfolio       |           1.2% |
>>> | contpte         |           3.1% |
>>> | exefolio        |           4.2% |
>>> | baseline-16k    |           5.3% |
>>
>> Thanks for pushing this forward!
>>
>>> Changes since RFCv2
>>> -------------------
>>>
>>>   - Simplified series to bare minimum (on David Hildenbrand's advice)
>>
>> My impression is that this series still includes many pieces that can
>> be split out and discussed separately with followup series.
>>
>> (I skipped 04/10 and will look at it tomorrow.)
> 
> I went through the series twice. Here what I think a bare minimum
> series (easier to review/debug/land) would look like:
> 1. a new arch specific function providing a prefered order within (0,
> PMD_ORDER).
> 2. an extended anon folio alloc API taking that order (02/10, partially).
> 3. an updated folio_add_new_anon_rmap() covering the large() &&
> !pmd_mappable() case (similar to 04/10).
> 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
> (06/10, reviewed-by provided).
> 5. finally, use the extended anon folio alloc API with the arch
> preferred order in do_anonymous_page() (10/10, partially).
> 
> The rest can be split out into separate series and move forward in
> parallel with probably a long list of things we need/want to do.

Thanks for the fadt review - I really appreciate it!

I've responded to many of your comments. I'd appreciate if we can close those
points then I will work up a v2.

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
  2023-06-27  9:41           ` Ryan Roberts
  (?)
@ 2023-06-27 18:26             ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27 18:26 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Tue, Jun 27, 2023 at 3:41 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/06/2023 09:29, Yu Zhao wrote:
> > On Tue, Jun 27, 2023 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/06/2023 02:55, Yu Zhao wrote:
> >>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> In preparation for extending vma_alloc_zeroed_movable_folio() to
> >>>> allocate a arbitrary order folio, expose clear_huge_page()
> >>>> unconditionally, so that it can be used to zero the allocated folio in
> >>>> the generic implementation of vma_alloc_zeroed_movable_folio().
> >>>>
> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>> ---
> >>>>  include/linux/mm.h | 3 ++-
> >>>>  mm/memory.c        | 2 +-
> >>>>  2 files changed, 3 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >>>> index 7f1741bd870a..7e3bf45e6491 100644
> >>>> --- a/include/linux/mm.h
> >>>> +++ b/include/linux/mm.h
> >>>> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
> >>>>   */
> >>>>  extern const struct attribute_group memory_failure_attr_group;
> >>>>
> >>>> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> >>>>  extern void clear_huge_page(struct page *page,
> >>>>                             unsigned long addr_hint,
> >>>>                             unsigned int pages_per_huge_page);
> >>>> +
> >>>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> >>>
> >>> We might not want to depend on THP eventually. Right now, we still
> >>> have to, unless splitting is optional, which seems to contradict
> >>> 06/10. (deferred_split_folio()  is a nop without THP.)
> >>
> >> Yes, I agree - for large anon folios to work, we depend on THP. But I don't
> >> think that helps us here.
> >>
> >> In the next patch, I give vma_alloc_zeroed_movable_folio() an extra `order`
> >> parameter. So the generic/default version of the function now needs a way to
> >> clear a compound page.
> >>
> >> I guess I could do something like:
> >>
> >>  static inline
> >>  struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> >>                                    unsigned long vaddr, gfp_t gfp, int order)
> >>  {
> >>         struct folio *folio;
> >>
> >>         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
> >>                                         order, vma, vaddr, false);
> >>         if (folio) {
> >> #ifdef CONFIG_LARGE_FOLIO
> >>                 clear_huge_page(&folio->page, vaddr, 1U << order);
> >> #else
> >>                 BUG_ON(order != 0);
> >>                 clear_user_highpage(&folio->page, vaddr);
> >> #endif
> >>         }
> >>
> >>         return folio;
> >>  }
> >>
> >> But that's pretty messy and there's no reason why other users might come along
> >> that pass order != 0 and will be surprised by the BUG_ON.
> >
> > #ifdef CONFIG_LARGE_ANON_FOLIO // depends on CONFIG_TRANSPARENT_HUGE_PAGE
> > struct folio *alloc_anon_folio(struct vm_area_struct *vma, unsigned
> > long vaddr, int order)
> > {
> >   // how do_huge_pmd_anonymous_page() allocs and clears
> >   vma_alloc_folio(..., *true*);
>
> This controls the mem allocation policy (see mempolicy.c::vma_alloc_folio()) not
> clearing. Clearing is done in __do_huge_pmd_anonymous_page():
>
>   clear_huge_page(page, vmf->address, HPAGE_PMD_NR);

Sorry for rushing this previously. This is what I meant. The #ifdef
makes it safe to use clear_huge_page() without 01/10. I highlighted
the last parameter to vma_alloc_folio() only because it's different
from what you chose (not implying it clears the folio).

> > }
> > #else
> > #define alloc_anon_folio(vma, addr, order)
> > vma_alloc_zeroed_movable_folio(vma, addr)
> > #endif
>
> Sorry I don't get this at all... If you are suggesting to bypass
> vma_alloc_zeroed_movable_folio() entirely for the LARGE_ANON_FOLIO case

Correct.

> I don't
> think that works because the arch code adds its own gfp flags there. For
> example, arm64 adds __GFP_ZEROTAGS for VM_MTE VMAs.

I think it's the opposite: it should be safer to reuse the THP code because
1. It's an existing case that has been working for PMD_ORDER folios
mapped by PTEs, and it's an arch-independent API which would be easier
to review.
2. Use vma_alloc_zeroed_movable_folio() for large folios is a *new*
case. It's an arch-*dependent* API which I have no idea what VM_MTE
does (should do) to large folios and don't plan to answer that for
now.

> Perhaps we can do away with an arch-owned vma_alloc_zeroed_movable_folio() and
> replace it with a new arch_get_zeroed_movable_gfp_flags() then
> alloc_anon_folio() add in those flags?
>
> But I still think the cleanest, simplest change is just to unconditionally
> expose clear_huge_page() as I've done it.

The fundamental choice there as I see it is to whether the first step
of large anon folios should lean toward the THP code base or the base
page code base (I'm a big fan of the answer "Neither -- we should
create something entirely new instead"). My POV is that the THP code
base would allow us to move faster, since it's proven to work for a
very similar case (PMD_ORDER folios mapped by PTEs).

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
@ 2023-06-27 18:26             ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27 18:26 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Tue, Jun 27, 2023 at 3:41 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/06/2023 09:29, Yu Zhao wrote:
> > On Tue, Jun 27, 2023 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/06/2023 02:55, Yu Zhao wrote:
> >>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> In preparation for extending vma_alloc_zeroed_movable_folio() to
> >>>> allocate a arbitrary order folio, expose clear_huge_page()
> >>>> unconditionally, so that it can be used to zero the allocated folio in
> >>>> the generic implementation of vma_alloc_zeroed_movable_folio().
> >>>>
> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>> ---
> >>>>  include/linux/mm.h | 3 ++-
> >>>>  mm/memory.c        | 2 +-
> >>>>  2 files changed, 3 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >>>> index 7f1741bd870a..7e3bf45e6491 100644
> >>>> --- a/include/linux/mm.h
> >>>> +++ b/include/linux/mm.h
> >>>> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
> >>>>   */
> >>>>  extern const struct attribute_group memory_failure_attr_group;
> >>>>
> >>>> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> >>>>  extern void clear_huge_page(struct page *page,
> >>>>                             unsigned long addr_hint,
> >>>>                             unsigned int pages_per_huge_page);
> >>>> +
> >>>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> >>>
> >>> We might not want to depend on THP eventually. Right now, we still
> >>> have to, unless splitting is optional, which seems to contradict
> >>> 06/10. (deferred_split_folio()  is a nop without THP.)
> >>
> >> Yes, I agree - for large anon folios to work, we depend on THP. But I don't
> >> think that helps us here.
> >>
> >> In the next patch, I give vma_alloc_zeroed_movable_folio() an extra `order`
> >> parameter. So the generic/default version of the function now needs a way to
> >> clear a compound page.
> >>
> >> I guess I could do something like:
> >>
> >>  static inline
> >>  struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> >>                                    unsigned long vaddr, gfp_t gfp, int order)
> >>  {
> >>         struct folio *folio;
> >>
> >>         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
> >>                                         order, vma, vaddr, false);
> >>         if (folio) {
> >> #ifdef CONFIG_LARGE_FOLIO
> >>                 clear_huge_page(&folio->page, vaddr, 1U << order);
> >> #else
> >>                 BUG_ON(order != 0);
> >>                 clear_user_highpage(&folio->page, vaddr);
> >> #endif
> >>         }
> >>
> >>         return folio;
> >>  }
> >>
> >> But that's pretty messy and there's no reason why other users might come along
> >> that pass order != 0 and will be surprised by the BUG_ON.
> >
> > #ifdef CONFIG_LARGE_ANON_FOLIO // depends on CONFIG_TRANSPARENT_HUGE_PAGE
> > struct folio *alloc_anon_folio(struct vm_area_struct *vma, unsigned
> > long vaddr, int order)
> > {
> >   // how do_huge_pmd_anonymous_page() allocs and clears
> >   vma_alloc_folio(..., *true*);
>
> This controls the mem allocation policy (see mempolicy.c::vma_alloc_folio()) not
> clearing. Clearing is done in __do_huge_pmd_anonymous_page():
>
>   clear_huge_page(page, vmf->address, HPAGE_PMD_NR);

Sorry for rushing this previously. This is what I meant. The #ifdef
makes it safe to use clear_huge_page() without 01/10. I highlighted
the last parameter to vma_alloc_folio() only because it's different
from what you chose (not implying it clears the folio).

> > }
> > #else
> > #define alloc_anon_folio(vma, addr, order)
> > vma_alloc_zeroed_movable_folio(vma, addr)
> > #endif
>
> Sorry I don't get this at all... If you are suggesting to bypass
> vma_alloc_zeroed_movable_folio() entirely for the LARGE_ANON_FOLIO case

Correct.

> I don't
> think that works because the arch code adds its own gfp flags there. For
> example, arm64 adds __GFP_ZEROTAGS for VM_MTE VMAs.

I think it's the opposite: it should be safer to reuse the THP code because
1. It's an existing case that has been working for PMD_ORDER folios
mapped by PTEs, and it's an arch-independent API which would be easier
to review.
2. Use vma_alloc_zeroed_movable_folio() for large folios is a *new*
case. It's an arch-*dependent* API which I have no idea what VM_MTE
does (should do) to large folios and don't plan to answer that for
now.

> Perhaps we can do away with an arch-owned vma_alloc_zeroed_movable_folio() and
> replace it with a new arch_get_zeroed_movable_gfp_flags() then
> alloc_anon_folio() add in those flags?
>
> But I still think the cleanest, simplest change is just to unconditionally
> expose clear_huge_page() as I've done it.

The fundamental choice there as I see it is to whether the first step
of large anon folios should lean toward the THP code base or the base
page code base (I'm a big fan of the answer "Neither -- we should
create something entirely new instead"). My POV is that the THP code
base would allow us to move faster, since it's proven to work for a
very similar case (PMD_ORDER folios mapped by PTEs).

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
@ 2023-06-27 18:26             ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27 18:26 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On Tue, Jun 27, 2023 at 3:41 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/06/2023 09:29, Yu Zhao wrote:
> > On Tue, Jun 27, 2023 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/06/2023 02:55, Yu Zhao wrote:
> >>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> In preparation for extending vma_alloc_zeroed_movable_folio() to
> >>>> allocate a arbitrary order folio, expose clear_huge_page()
> >>>> unconditionally, so that it can be used to zero the allocated folio in
> >>>> the generic implementation of vma_alloc_zeroed_movable_folio().
> >>>>
> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>> ---
> >>>>  include/linux/mm.h | 3 ++-
> >>>>  mm/memory.c        | 2 +-
> >>>>  2 files changed, 3 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >>>> index 7f1741bd870a..7e3bf45e6491 100644
> >>>> --- a/include/linux/mm.h
> >>>> +++ b/include/linux/mm.h
> >>>> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
> >>>>   */
> >>>>  extern const struct attribute_group memory_failure_attr_group;
> >>>>
> >>>> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> >>>>  extern void clear_huge_page(struct page *page,
> >>>>                             unsigned long addr_hint,
> >>>>                             unsigned int pages_per_huge_page);
> >>>> +
> >>>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> >>>
> >>> We might not want to depend on THP eventually. Right now, we still
> >>> have to, unless splitting is optional, which seems to contradict
> >>> 06/10. (deferred_split_folio()  is a nop without THP.)
> >>
> >> Yes, I agree - for large anon folios to work, we depend on THP. But I don't
> >> think that helps us here.
> >>
> >> In the next patch, I give vma_alloc_zeroed_movable_folio() an extra `order`
> >> parameter. So the generic/default version of the function now needs a way to
> >> clear a compound page.
> >>
> >> I guess I could do something like:
> >>
> >>  static inline
> >>  struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> >>                                    unsigned long vaddr, gfp_t gfp, int order)
> >>  {
> >>         struct folio *folio;
> >>
> >>         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
> >>                                         order, vma, vaddr, false);
> >>         if (folio) {
> >> #ifdef CONFIG_LARGE_FOLIO
> >>                 clear_huge_page(&folio->page, vaddr, 1U << order);
> >> #else
> >>                 BUG_ON(order != 0);
> >>                 clear_user_highpage(&folio->page, vaddr);
> >> #endif
> >>         }
> >>
> >>         return folio;
> >>  }
> >>
> >> But that's pretty messy and there's no reason why other users might come along
> >> that pass order != 0 and will be surprised by the BUG_ON.
> >
> > #ifdef CONFIG_LARGE_ANON_FOLIO // depends on CONFIG_TRANSPARENT_HUGE_PAGE
> > struct folio *alloc_anon_folio(struct vm_area_struct *vma, unsigned
> > long vaddr, int order)
> > {
> >   // how do_huge_pmd_anonymous_page() allocs and clears
> >   vma_alloc_folio(..., *true*);
>
> This controls the mem allocation policy (see mempolicy.c::vma_alloc_folio()) not
> clearing. Clearing is done in __do_huge_pmd_anonymous_page():
>
>   clear_huge_page(page, vmf->address, HPAGE_PMD_NR);

Sorry for rushing this previously. This is what I meant. The #ifdef
makes it safe to use clear_huge_page() without 01/10. I highlighted
the last parameter to vma_alloc_folio() only because it's different
from what you chose (not implying it clears the folio).

> > }
> > #else
> > #define alloc_anon_folio(vma, addr, order)
> > vma_alloc_zeroed_movable_folio(vma, addr)
> > #endif
>
> Sorry I don't get this at all... If you are suggesting to bypass
> vma_alloc_zeroed_movable_folio() entirely for the LARGE_ANON_FOLIO case

Correct.

> I don't
> think that works because the arch code adds its own gfp flags there. For
> example, arm64 adds __GFP_ZEROTAGS for VM_MTE VMAs.

I think it's the opposite: it should be safer to reuse the THP code because
1. It's an existing case that has been working for PMD_ORDER folios
mapped by PTEs, and it's an arch-independent API which would be easier
to review.
2. Use vma_alloc_zeroed_movable_folio() for large folios is a *new*
case. It's an arch-*dependent* API which I have no idea what VM_MTE
does (should do) to large folios and don't plan to answer that for
now.

> Perhaps we can do away with an arch-owned vma_alloc_zeroed_movable_folio() and
> replace it with a new arch_get_zeroed_movable_gfp_flags() then
> alloc_anon_folio() add in those flags?
>
> But I still think the cleanest, simplest change is just to unconditionally
> expose clear_huge_page() as I've done it.

The fundamental choice there as I see it is to whether the first step
of large anon folios should lean toward the THP code base or the base
page code base (I'm a big fan of the answer "Neither -- we should
create something entirely new instead"). My POV is that the THP code
base would allow us to move faster, since it's proven to work for a
very similar case (PMD_ORDER folios mapped by PTEs).

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
  2023-06-27  9:57       ` Ryan Roberts
  (?)
@ 2023-06-27 18:33         ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27 18:33 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Tue, Jun 27, 2023 at 3:57 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/06/2023 04:01, Yu Zhao wrote:
> > On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> With all of the enabler patches in place, modify the anonymous memory
> >> write allocation path so that it opportunistically attempts to allocate
> >> a large folio up to `max_anon_folio_order()` size (This value is
> >> ultimately configured by the architecture). This reduces the number of
> >> page faults, reduces the size of (e.g. LRU) lists, and generally
> >> improves performance by batching what were per-page operations into
> >> per-(large)-folio operations.
> >>
> >> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> >> `max_anon_folio_order()` always returns 0, meaning we get the existing
> >> allocation behaviour.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
> >>  1 file changed, 144 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index a8f7e2b28d7a..d23c44cc5092 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
> >>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
> >>  }
> >>
> >> +/*
> >> + * Returns index of first pte that is not none, or nr if all are none.
> >> + */
> >> +static inline int check_ptes_none(pte_t *pte, int nr)
> >> +{
> >> +       int i;
> >> +
> >> +       for (i = 0; i < nr; i++) {
> >> +               if (!pte_none(ptep_get(pte++)))
> >> +                       return i;
> >> +       }
> >> +
> >> +       return nr;
> >> +}
> >> +
> >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> >
> > As suggested previously in 03/10, we can leave this for later.
>
> I disagree. This is the logic that prevents us from accidentally replacing
> already set PTEs, or wandering out of the VMA bounds etc. How would you catch
> all those corener cases without this?

Again, sorry for not being clear previously: we definitely need to
handle alignments & overlapps. But the fallback, i.e., "for (; order >
1; order--) {" in calc_anon_folio_order_alloc() is not necessary.

For now, we just need something like

  bool is_order_suitable() {
    // check whether it fits properly
  }

Later on, we could add

  alloc_anon_folio_best_effort()
  {
    for a list of fallback orders
      is_order_suitable()
  }

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
@ 2023-06-27 18:33         ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27 18:33 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Tue, Jun 27, 2023 at 3:57 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/06/2023 04:01, Yu Zhao wrote:
> > On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> With all of the enabler patches in place, modify the anonymous memory
> >> write allocation path so that it opportunistically attempts to allocate
> >> a large folio up to `max_anon_folio_order()` size (This value is
> >> ultimately configured by the architecture). This reduces the number of
> >> page faults, reduces the size of (e.g. LRU) lists, and generally
> >> improves performance by batching what were per-page operations into
> >> per-(large)-folio operations.
> >>
> >> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> >> `max_anon_folio_order()` always returns 0, meaning we get the existing
> >> allocation behaviour.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
> >>  1 file changed, 144 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index a8f7e2b28d7a..d23c44cc5092 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
> >>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
> >>  }
> >>
> >> +/*
> >> + * Returns index of first pte that is not none, or nr if all are none.
> >> + */
> >> +static inline int check_ptes_none(pte_t *pte, int nr)
> >> +{
> >> +       int i;
> >> +
> >> +       for (i = 0; i < nr; i++) {
> >> +               if (!pte_none(ptep_get(pte++)))
> >> +                       return i;
> >> +       }
> >> +
> >> +       return nr;
> >> +}
> >> +
> >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> >
> > As suggested previously in 03/10, we can leave this for later.
>
> I disagree. This is the logic that prevents us from accidentally replacing
> already set PTEs, or wandering out of the VMA bounds etc. How would you catch
> all those corener cases without this?

Again, sorry for not being clear previously: we definitely need to
handle alignments & overlapps. But the fallback, i.e., "for (; order >
1; order--) {" in calc_anon_folio_order_alloc() is not necessary.

For now, we just need something like

  bool is_order_suitable() {
    // check whether it fits properly
  }

Later on, we could add

  alloc_anon_folio_best_effort()
  {
    for a list of fallback orders
      is_order_suitable()
  }

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
@ 2023-06-27 18:33         ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-27 18:33 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On Tue, Jun 27, 2023 at 3:57 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/06/2023 04:01, Yu Zhao wrote:
> > On Mon, Jun 26, 2023 at 11:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> With all of the enabler patches in place, modify the anonymous memory
> >> write allocation path so that it opportunistically attempts to allocate
> >> a large folio up to `max_anon_folio_order()` size (This value is
> >> ultimately configured by the architecture). This reduces the number of
> >> page faults, reduces the size of (e.g. LRU) lists, and generally
> >> improves performance by batching what were per-page operations into
> >> per-(large)-folio operations.
> >>
> >> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> >> `max_anon_folio_order()` always returns 0, meaning we get the existing
> >> allocation behaviour.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
> >>  1 file changed, 144 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index a8f7e2b28d7a..d23c44cc5092 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
> >>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
> >>  }
> >>
> >> +/*
> >> + * Returns index of first pte that is not none, or nr if all are none.
> >> + */
> >> +static inline int check_ptes_none(pte_t *pte, int nr)
> >> +{
> >> +       int i;
> >> +
> >> +       for (i = 0; i < nr; i++) {
> >> +               if (!pte_none(ptep_get(pte++)))
> >> +                       return i;
> >> +       }
> >> +
> >> +       return nr;
> >> +}
> >> +
> >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> >
> > As suggested previously in 03/10, we can leave this for later.
>
> I disagree. This is the logic that prevents us from accidentally replacing
> already set PTEs, or wandering out of the VMA bounds etc. How would you catch
> all those corener cases without this?

Again, sorry for not being clear previously: we definitely need to
handle alignments & overlapps. But the fallback, i.e., "for (; order >
1; order--) {" in calc_anon_folio_order_alloc() is not necessary.

For now, we just need something like

  bool is_order_suitable() {
    // check whether it fits properly
  }

Later on, we could add

  alloc_anon_folio_best_effort()
  {
    for a list of fallback orders
      is_order_suitable()
  }

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
  2023-06-27  7:08     ` Yu Zhao
  (?)
@ 2023-06-28  2:17       ` Yin Fengwei
  -1 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-28  2:17 UTC (permalink / raw)
  To: Yu Zhao, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390



On 6/27/23 15:08, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
>> belonging to a folio, for effciency savings. All pages are accounted as
>> small pages.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/rmap.h |  2 ++
>>  mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 45 insertions(+)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index a3825ce81102..15433a3d0cbf 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>>                 unsigned long address);
>>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>                 unsigned long address);
>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>> +               int nr, struct vm_area_struct *vma, unsigned long address);
> 
> We should update folio_add_new_anon_rmap() to support large() &&
> !folio_test_pmd_mappable() folios instead.
> 
> I double checked all places currently using folio_add_new_anon_rmap(),
> and as expected, none actually allocates large() &&
> !folio_test_pmd_mappable() and maps it one by one, which makes the
> cases simpler, i.e.,
>   if (!large())
>     // the existing basepage case
>   else if (!folio_test_pmd_mappable())
>     // our new case
>   else
>     // the existing THP case
I suppose we can merge the new case and existing THP case.


Regards
Yin, Fengwei

> 
>>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>                 bool compound);
>>  void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 1d8369549424..4050bcea7ae7 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>         __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>  }
>>
>> +/**
>> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
>> + * anonymous potentially large folio.
>> + * @folio:      The folio containing the pages to be mapped
>> + * @page:       First page in the folio to be mapped
>> + * @nr:         Number of pages to be mapped
>> + * @vma:        the vm area in which the mapping is added
>> + * @address:    the user virtual address of the first page to be mapped
>> + *
>> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
>> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
>> + * bypassed and the folio does not have to be locked. All pages in the folio are
>> + * individually accounted.
>> + *
>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>> + * process.
>> + */
>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>> +               int nr, struct vm_area_struct *vma, unsigned long address)
>> +{
>> +       int i;
>> +
>> +       VM_BUG_ON_VMA(address < vma->vm_start ||
>> +                     address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
> 
> BTW, VM_BUG_ON* shouldn't be used in new code:
> Documentation/process/coding-style.rst

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
@ 2023-06-28  2:17       ` Yin Fengwei
  0 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-28  2:17 UTC (permalink / raw)
  To: Yu Zhao, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390



On 6/27/23 15:08, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
>> belonging to a folio, for effciency savings. All pages are accounted as
>> small pages.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/rmap.h |  2 ++
>>  mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 45 insertions(+)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index a3825ce81102..15433a3d0cbf 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>>                 unsigned long address);
>>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>                 unsigned long address);
>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>> +               int nr, struct vm_area_struct *vma, unsigned long address);
> 
> We should update folio_add_new_anon_rmap() to support large() &&
> !folio_test_pmd_mappable() folios instead.
> 
> I double checked all places currently using folio_add_new_anon_rmap(),
> and as expected, none actually allocates large() &&
> !folio_test_pmd_mappable() and maps it one by one, which makes the
> cases simpler, i.e.,
>   if (!large())
>     // the existing basepage case
>   else if (!folio_test_pmd_mappable())
>     // our new case
>   else
>     // the existing THP case
I suppose we can merge the new case and existing THP case.


Regards
Yin, Fengwei

> 
>>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>                 bool compound);
>>  void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 1d8369549424..4050bcea7ae7 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>         __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>  }
>>
>> +/**
>> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
>> + * anonymous potentially large folio.
>> + * @folio:      The folio containing the pages to be mapped
>> + * @page:       First page in the folio to be mapped
>> + * @nr:         Number of pages to be mapped
>> + * @vma:        the vm area in which the mapping is added
>> + * @address:    the user virtual address of the first page to be mapped
>> + *
>> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
>> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
>> + * bypassed and the folio does not have to be locked. All pages in the folio are
>> + * individually accounted.
>> + *
>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>> + * process.
>> + */
>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>> +               int nr, struct vm_area_struct *vma, unsigned long address)
>> +{
>> +       int i;
>> +
>> +       VM_BUG_ON_VMA(address < vma->vm_start ||
>> +                     address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
> 
> BTW, VM_BUG_ON* shouldn't be used in new code:
> Documentation/process/coding-style.rst

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
@ 2023-06-28  2:17       ` Yin Fengwei
  0 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-28  2:17 UTC (permalink / raw)
  To: Yu Zhao, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k



On 6/27/23 15:08, Yu Zhao wrote:
> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
>> belonging to a folio, for effciency savings. All pages are accounted as
>> small pages.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/rmap.h |  2 ++
>>  mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 45 insertions(+)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index a3825ce81102..15433a3d0cbf 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>>                 unsigned long address);
>>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>                 unsigned long address);
>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>> +               int nr, struct vm_area_struct *vma, unsigned long address);
> 
> We should update folio_add_new_anon_rmap() to support large() &&
> !folio_test_pmd_mappable() folios instead.
> 
> I double checked all places currently using folio_add_new_anon_rmap(),
> and as expected, none actually allocates large() &&
> !folio_test_pmd_mappable() and maps it one by one, which makes the
> cases simpler, i.e.,
>   if (!large())
>     // the existing basepage case
>   else if (!folio_test_pmd_mappable())
>     // our new case
>   else
>     // the existing THP case
I suppose we can merge the new case and existing THP case.


Regards
Yin, Fengwei

> 
>>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>                 bool compound);
>>  void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 1d8369549424..4050bcea7ae7 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>         __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>  }
>>
>> +/**
>> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
>> + * anonymous potentially large folio.
>> + * @folio:      The folio containing the pages to be mapped
>> + * @page:       First page in the folio to be mapped
>> + * @nr:         Number of pages to be mapped
>> + * @vma:        the vm area in which the mapping is added
>> + * @address:    the user virtual address of the first page to be mapped
>> + *
>> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
>> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
>> + * bypassed and the folio does not have to be locked. All pages in the folio are
>> + * individually accounted.
>> + *
>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>> + * process.
>> + */
>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>> +               int nr, struct vm_area_struct *vma, unsigned long address)
>> +{
>> +       int i;
>> +
>> +       VM_BUG_ON_VMA(address < vma->vm_start ||
>> +                     address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
> 
> BTW, VM_BUG_ON* shouldn't be used in new code:
> Documentation/process/coding-style.rst

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
  2023-06-27  8:09       ` Ryan Roberts
  (?)
@ 2023-06-28  2:20         ` Yin Fengwei
  -1 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-28  2:20 UTC (permalink / raw)
  To: Ryan Roberts, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390



On 6/27/23 16:09, Ryan Roberts wrote:
> On 27/06/2023 08:08, Yu Zhao wrote:
>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
>>> belonging to a folio, for effciency savings. All pages are accounted as
>>> small pages.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  include/linux/rmap.h |  2 ++
>>>  mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>>>  2 files changed, 45 insertions(+)
>>>
>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>> index a3825ce81102..15433a3d0cbf 100644
>>> --- a/include/linux/rmap.h
>>> +++ b/include/linux/rmap.h
>>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>>>                 unsigned long address);
>>>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>>                 unsigned long address);
>>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>>> +               int nr, struct vm_area_struct *vma, unsigned long address);
>>
>> We should update folio_add_new_anon_rmap() to support large() &&
>> !folio_test_pmd_mappable() folios instead.
>>
>> I double checked all places currently using folio_add_new_anon_rmap(),
>> and as expected, none actually allocates large() &&
>> !folio_test_pmd_mappable() and maps it one by one, which makes the
>> cases simpler, i.e.,
>>   if (!large())
>>     // the existing basepage case
>>   else if (!folio_test_pmd_mappable())
>>     // our new case
>>   else
>>     // the existing THP case
> 
> I don't have a strong opinion either way. Happy to go with this suggestion. But
> the reason I did it as a new function was because I was following the pattern in
> [1] which adds a new folio_add_file_rmap_range() function.
> 
> [1] https://lore.kernel.org/linux-mm/20230315051444.3229621-35-willy@infradead.org/
Oh. There is different here:
For page cache, large folio could be created by previous file access. But later
file access by other process just need map partial large folio. In this case, we need
_range for filemap.

But for anonymous, I suppose we always map whole folio in. So I agree with Yu. We
don't need _range for folio_add_new_anon_rmap(). Thanks.


Regards
Yin, Fengwei

> 
> 
>>
>>>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>>                 bool compound);
>>>  void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 1d8369549424..4050bcea7ae7 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>>         __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>>  }
>>>
>>> +/**
>>> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
>>> + * anonymous potentially large folio.
>>> + * @folio:      The folio containing the pages to be mapped
>>> + * @page:       First page in the folio to be mapped
>>> + * @nr:         Number of pages to be mapped
>>> + * @vma:        the vm area in which the mapping is added
>>> + * @address:    the user virtual address of the first page to be mapped
>>> + *
>>> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
>>> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
>>> + * bypassed and the folio does not have to be locked. All pages in the folio are
>>> + * individually accounted.
>>> + *
>>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>>> + * process.
>>> + */
>>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>>> +               int nr, struct vm_area_struct *vma, unsigned long address)
>>> +{
>>> +       int i;
>>> +
>>> +       VM_BUG_ON_VMA(address < vma->vm_start ||
>>> +                     address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>
>> BTW, VM_BUG_ON* shouldn't be used in new code:
>> Documentation/process/coding-style.rst
> 
> Thanks, sorry about that. Was copy-pasting from folio_add_new_anon_rmap().
> 

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
@ 2023-06-28  2:20         ` Yin Fengwei
  0 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-28  2:20 UTC (permalink / raw)
  To: Ryan Roberts, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390



On 6/27/23 16:09, Ryan Roberts wrote:
> On 27/06/2023 08:08, Yu Zhao wrote:
>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
>>> belonging to a folio, for effciency savings. All pages are accounted as
>>> small pages.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  include/linux/rmap.h |  2 ++
>>>  mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>>>  2 files changed, 45 insertions(+)
>>>
>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>> index a3825ce81102..15433a3d0cbf 100644
>>> --- a/include/linux/rmap.h
>>> +++ b/include/linux/rmap.h
>>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>>>                 unsigned long address);
>>>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>>                 unsigned long address);
>>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>>> +               int nr, struct vm_area_struct *vma, unsigned long address);
>>
>> We should update folio_add_new_anon_rmap() to support large() &&
>> !folio_test_pmd_mappable() folios instead.
>>
>> I double checked all places currently using folio_add_new_anon_rmap(),
>> and as expected, none actually allocates large() &&
>> !folio_test_pmd_mappable() and maps it one by one, which makes the
>> cases simpler, i.e.,
>>   if (!large())
>>     // the existing basepage case
>>   else if (!folio_test_pmd_mappable())
>>     // our new case
>>   else
>>     // the existing THP case
> 
> I don't have a strong opinion either way. Happy to go with this suggestion. But
> the reason I did it as a new function was because I was following the pattern in
> [1] which adds a new folio_add_file_rmap_range() function.
> 
> [1] https://lore.kernel.org/linux-mm/20230315051444.3229621-35-willy@infradead.org/
Oh. There is different here:
For page cache, large folio could be created by previous file access. But later
file access by other process just need map partial large folio. In this case, we need
_range for filemap.

But for anonymous, I suppose we always map whole folio in. So I agree with Yu. We
don't need _range for folio_add_new_anon_rmap(). Thanks.


Regards
Yin, Fengwei

> 
> 
>>
>>>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>>                 bool compound);
>>>  void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 1d8369549424..4050bcea7ae7 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>>         __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>>  }
>>>
>>> +/**
>>> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
>>> + * anonymous potentially large folio.
>>> + * @folio:      The folio containing the pages to be mapped
>>> + * @page:       First page in the folio to be mapped
>>> + * @nr:         Number of pages to be mapped
>>> + * @vma:        the vm area in which the mapping is added
>>> + * @address:    the user virtual address of the first page to be mapped
>>> + *
>>> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
>>> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
>>> + * bypassed and the folio does not have to be locked. All pages in the folio are
>>> + * individually accounted.
>>> + *
>>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>>> + * process.
>>> + */
>>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>>> +               int nr, struct vm_area_struct *vma, unsigned long address)
>>> +{
>>> +       int i;
>>> +
>>> +       VM_BUG_ON_VMA(address < vma->vm_start ||
>>> +                     address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>
>> BTW, VM_BUG_ON* shouldn't be used in new code:
>> Documentation/process/coding-style.rst
> 
> Thanks, sorry about that. Was copy-pasting from folio_add_new_anon_rmap().
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
@ 2023-06-28  2:20         ` Yin Fengwei
  0 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-28  2:20 UTC (permalink / raw)
  To: Ryan Roberts, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k



On 6/27/23 16:09, Ryan Roberts wrote:
> On 27/06/2023 08:08, Yu Zhao wrote:
>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
>>> belonging to a folio, for effciency savings. All pages are accounted as
>>> small pages.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  include/linux/rmap.h |  2 ++
>>>  mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>>>  2 files changed, 45 insertions(+)
>>>
>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>> index a3825ce81102..15433a3d0cbf 100644
>>> --- a/include/linux/rmap.h
>>> +++ b/include/linux/rmap.h
>>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>>>                 unsigned long address);
>>>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>>                 unsigned long address);
>>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>>> +               int nr, struct vm_area_struct *vma, unsigned long address);
>>
>> We should update folio_add_new_anon_rmap() to support large() &&
>> !folio_test_pmd_mappable() folios instead.
>>
>> I double checked all places currently using folio_add_new_anon_rmap(),
>> and as expected, none actually allocates large() &&
>> !folio_test_pmd_mappable() and maps it one by one, which makes the
>> cases simpler, i.e.,
>>   if (!large())
>>     // the existing basepage case
>>   else if (!folio_test_pmd_mappable())
>>     // our new case
>>   else
>>     // the existing THP case
> 
> I don't have a strong opinion either way. Happy to go with this suggestion. But
> the reason I did it as a new function was because I was following the pattern in
> [1] which adds a new folio_add_file_rmap_range() function.
> 
> [1] https://lore.kernel.org/linux-mm/20230315051444.3229621-35-willy@infradead.org/
Oh. There is different here:
For page cache, large folio could be created by previous file access. But later
file access by other process just need map partial large folio. In this case, we need
_range for filemap.

But for anonymous, I suppose we always map whole folio in. So I agree with Yu. We
don't need _range for folio_add_new_anon_rmap(). Thanks.


Regards
Yin, Fengwei

> 
> 
>>
>>>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>>                 bool compound);
>>>  void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 1d8369549424..4050bcea7ae7 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>>         __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>>  }
>>>
>>> +/**
>>> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
>>> + * anonymous potentially large folio.
>>> + * @folio:      The folio containing the pages to be mapped
>>> + * @page:       First page in the folio to be mapped
>>> + * @nr:         Number of pages to be mapped
>>> + * @vma:        the vm area in which the mapping is added
>>> + * @address:    the user virtual address of the first page to be mapped
>>> + *
>>> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
>>> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
>>> + * bypassed and the folio does not have to be locked. All pages in the folio are
>>> + * individually accounted.
>>> + *
>>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>>> + * process.
>>> + */
>>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>>> +               int nr, struct vm_area_struct *vma, unsigned long address)
>>> +{
>>> +       int i;
>>> +
>>> +       VM_BUG_ON_VMA(address < vma->vm_start ||
>>> +                     address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>
>> BTW, VM_BUG_ON* shouldn't be used in new code:
>> Documentation/process/coding-style.rst
> 
> Thanks, sorry about that. Was copy-pasting from folio_add_new_anon_rmap().
> 

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
  2023-06-27  7:56         ` Ryan Roberts
  (?)
@ 2023-06-28  2:32           ` Yin Fengwei
  -1 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-28  2:32 UTC (permalink / raw)
  To: Ryan Roberts, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390



On 6/27/23 15:56, Ryan Roberts wrote:
> On 27/06/2023 06:29, Yu Zhao wrote:
>> On Mon, Jun 26, 2023 at 8:34 PM Yu Zhao <yuzhao@google.com> wrote:
>>>
>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Opportunistically attempt to allocate high-order folios in highmem,
>>>> optionally zeroed. Retry with lower orders all the way to order-0, until
>>>> success. Although, of note, order-1 allocations are skipped since a
>>>> large folio must be at least order-2 to work with the THP machinery. The
>>>> user must check what they got with folio_order().
>>>>
>>>> This will be used to oportunistically allocate large folios for
>>>> anonymous memory with a sensible fallback under memory pressure.
>>>>
>>>> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
>>>> high latency due to reclaim, instead preferring to just try for a lower
>>>> order. The same approach is used by the readahead code when allocating
>>>> large folios.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
>>>>  1 file changed, 33 insertions(+)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 367bbbb29d91..53896d46e686 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>>>         return 0;
>>>>  }
>>>>
>>>> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
>>>> +                               unsigned long vaddr, int order, bool zeroed)
>>>> +{
>>>> +       gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
>>>> +
>>>> +       if (zeroed)
>>>> +               return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
>>>> +       else
>>>> +               return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
>>>> +                                                               vaddr, false);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Opportunistically attempt to allocate high-order folios, retrying with lower
>>>> + * orders all the way to order-0, until success. order-1 allocations are skipped
>>>> + * since a folio must be at least order-2 to work with the THP machinery. The
>>>> + * user must check what they got with folio_order(). vaddr can be any virtual
>>>> + * address that will be mapped by the allocated folio.
>>>> + */
>>>> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>>>> +                               unsigned long vaddr, int order, bool zeroed)
>>>> +{
>>>> +       struct folio *folio;
>>>> +
>>>> +       for (; order > 1; order--) {
>>>> +               folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
>>>> +               if (folio)
>>>> +                       return folio;
>>>> +       }
>>>> +
>>>> +       return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>>>> +}
>>>
>>> I'd drop this patch. Instead, in do_anonymous_page():
>>>
>>>   if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER))
>>>     folio = vma_alloc_zeroed_movable_folio(vma, addr,
>>> CONFIG_ARCH_WANTS_PTE_ORDER))
>>>
>>>   if (!folio)
>>>     folio = vma_alloc_zeroed_movable_folio(vma, addr, 0);
>>
>> I meant a runtime function arch_wants_pte_order() (Its default
>> implementation would return 0.)
> 
> There are a bunch of things which you are implying here which I'll try to make
> explicit:
> 
> I think you are implying that we shouldn't retry allocation with intermediate
> orders; but only try the order requested by the arch (arch_wants_pte_order())
> and 0. Correct? For arm64 at least, I would like the VMA's THP hint to be a
> factor in determining the preferred order (see patches 8 and 9). So I would add
> a vma parameter to arch_wants_pte_order() to allow for this.
> 
> For the case where the THP hint is present, then the arch will request 2M (if
> the page size is 16K or 64K). If that fails to allocate, there is still value in
> allocating a 64K folio (which is order 2 in the 16K case). Without the retry
> with intermediate orders logic, we would not get this.
> 
> We can't just blindly allocate a folio of arch_wants_pte_order() size because it
> might overlap with existing populated PTEs, or cross the bounds of the VMA (or a
> number of other things - see calc_anon_folio_order_alloc() in patch 10). Are you
> implying that if there is any kind of issue like this, then we should go
> directly to order 0? I can kind of see the argument from a minimizing
> fragmentation perspective, but for best possible performance I think we are
> better off "packing the bin" with intermediate orders.

One drawback of the retry is that it could introduce large tail latency (by
memory zeroing, memory reclaiming or existing populated PTEs). That may not
be appreciated by some applications. Thanks.


Regards
Yin, Fengwei

> 
> You're also implying that a runtime arch_wants_pte_order() function is better
> than the Kconfig stuff I did in patch 8. On reflection, I agree with you here. I
> think you mentioned that AMD supports coalescing 8 pages on some CPUs - so you
> would probably want runtime logic to determine if you are on an appropriate AMD
> CPU as part of the decision in that function?
> 
> The real reason for the existance of try_vma_alloc_movable_folio() is that I'm
> reusing it on the other fault paths (which are no longer part of this series).
> But I guess that's not a good reason to keep this until we get to those patches.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
@ 2023-06-28  2:32           ` Yin Fengwei
  0 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-28  2:32 UTC (permalink / raw)
  To: Ryan Roberts, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390



On 6/27/23 15:56, Ryan Roberts wrote:
> On 27/06/2023 06:29, Yu Zhao wrote:
>> On Mon, Jun 26, 2023 at 8:34 PM Yu Zhao <yuzhao@google.com> wrote:
>>>
>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Opportunistically attempt to allocate high-order folios in highmem,
>>>> optionally zeroed. Retry with lower orders all the way to order-0, until
>>>> success. Although, of note, order-1 allocations are skipped since a
>>>> large folio must be at least order-2 to work with the THP machinery. The
>>>> user must check what they got with folio_order().
>>>>
>>>> This will be used to oportunistically allocate large folios for
>>>> anonymous memory with a sensible fallback under memory pressure.
>>>>
>>>> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
>>>> high latency due to reclaim, instead preferring to just try for a lower
>>>> order. The same approach is used by the readahead code when allocating
>>>> large folios.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
>>>>  1 file changed, 33 insertions(+)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 367bbbb29d91..53896d46e686 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>>>         return 0;
>>>>  }
>>>>
>>>> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
>>>> +                               unsigned long vaddr, int order, bool zeroed)
>>>> +{
>>>> +       gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
>>>> +
>>>> +       if (zeroed)
>>>> +               return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
>>>> +       else
>>>> +               return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
>>>> +                                                               vaddr, false);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Opportunistically attempt to allocate high-order folios, retrying with lower
>>>> + * orders all the way to order-0, until success. order-1 allocations are skipped
>>>> + * since a folio must be at least order-2 to work with the THP machinery. The
>>>> + * user must check what they got with folio_order(). vaddr can be any virtual
>>>> + * address that will be mapped by the allocated folio.
>>>> + */
>>>> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>>>> +                               unsigned long vaddr, int order, bool zeroed)
>>>> +{
>>>> +       struct folio *folio;
>>>> +
>>>> +       for (; order > 1; order--) {
>>>> +               folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
>>>> +               if (folio)
>>>> +                       return folio;
>>>> +       }
>>>> +
>>>> +       return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>>>> +}
>>>
>>> I'd drop this patch. Instead, in do_anonymous_page():
>>>
>>>   if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER))
>>>     folio = vma_alloc_zeroed_movable_folio(vma, addr,
>>> CONFIG_ARCH_WANTS_PTE_ORDER))
>>>
>>>   if (!folio)
>>>     folio = vma_alloc_zeroed_movable_folio(vma, addr, 0);
>>
>> I meant a runtime function arch_wants_pte_order() (Its default
>> implementation would return 0.)
> 
> There are a bunch of things which you are implying here which I'll try to make
> explicit:
> 
> I think you are implying that we shouldn't retry allocation with intermediate
> orders; but only try the order requested by the arch (arch_wants_pte_order())
> and 0. Correct? For arm64 at least, I would like the VMA's THP hint to be a
> factor in determining the preferred order (see patches 8 and 9). So I would add
> a vma parameter to arch_wants_pte_order() to allow for this.
> 
> For the case where the THP hint is present, then the arch will request 2M (if
> the page size is 16K or 64K). If that fails to allocate, there is still value in
> allocating a 64K folio (which is order 2 in the 16K case). Without the retry
> with intermediate orders logic, we would not get this.
> 
> We can't just blindly allocate a folio of arch_wants_pte_order() size because it
> might overlap with existing populated PTEs, or cross the bounds of the VMA (or a
> number of other things - see calc_anon_folio_order_alloc() in patch 10). Are you
> implying that if there is any kind of issue like this, then we should go
> directly to order 0? I can kind of see the argument from a minimizing
> fragmentation perspective, but for best possible performance I think we are
> better off "packing the bin" with intermediate orders.

One drawback of the retry is that it could introduce large tail latency (by
memory zeroing, memory reclaiming or existing populated PTEs). That may not
be appreciated by some applications. Thanks.


Regards
Yin, Fengwei

> 
> You're also implying that a runtime arch_wants_pte_order() function is better
> than the Kconfig stuff I did in patch 8. On reflection, I agree with you here. I
> think you mentioned that AMD supports coalescing 8 pages on some CPUs - so you
> would probably want runtime logic to determine if you are on an appropriate AMD
> CPU as part of the decision in that function?
> 
> The real reason for the existance of try_vma_alloc_movable_folio() is that I'm
> reusing it on the other fault paths (which are no longer part of this series).
> But I guess that's not a good reason to keep this until we get to those patches.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
@ 2023-06-28  2:32           ` Yin Fengwei
  0 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-28  2:32 UTC (permalink / raw)
  To: Ryan Roberts, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k



On 6/27/23 15:56, Ryan Roberts wrote:
> On 27/06/2023 06:29, Yu Zhao wrote:
>> On Mon, Jun 26, 2023 at 8:34 PM Yu Zhao <yuzhao@google.com> wrote:
>>>
>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Opportunistically attempt to allocate high-order folios in highmem,
>>>> optionally zeroed. Retry with lower orders all the way to order-0, until
>>>> success. Although, of note, order-1 allocations are skipped since a
>>>> large folio must be at least order-2 to work with the THP machinery. The
>>>> user must check what they got with folio_order().
>>>>
>>>> This will be used to oportunistically allocate large folios for
>>>> anonymous memory with a sensible fallback under memory pressure.
>>>>
>>>> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
>>>> high latency due to reclaim, instead preferring to just try for a lower
>>>> order. The same approach is used by the readahead code when allocating
>>>> large folios.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
>>>>  1 file changed, 33 insertions(+)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 367bbbb29d91..53896d46e686 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>>>         return 0;
>>>>  }
>>>>
>>>> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
>>>> +                               unsigned long vaddr, int order, bool zeroed)
>>>> +{
>>>> +       gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
>>>> +
>>>> +       if (zeroed)
>>>> +               return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
>>>> +       else
>>>> +               return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
>>>> +                                                               vaddr, false);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Opportunistically attempt to allocate high-order folios, retrying with lower
>>>> + * orders all the way to order-0, until success. order-1 allocations are skipped
>>>> + * since a folio must be at least order-2 to work with the THP machinery. The
>>>> + * user must check what they got with folio_order(). vaddr can be any virtual
>>>> + * address that will be mapped by the allocated folio.
>>>> + */
>>>> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>>>> +                               unsigned long vaddr, int order, bool zeroed)
>>>> +{
>>>> +       struct folio *folio;
>>>> +
>>>> +       for (; order > 1; order--) {
>>>> +               folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
>>>> +               if (folio)
>>>> +                       return folio;
>>>> +       }
>>>> +
>>>> +       return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>>>> +}
>>>
>>> I'd drop this patch. Instead, in do_anonymous_page():
>>>
>>>   if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER))
>>>     folio = vma_alloc_zeroed_movable_folio(vma, addr,
>>> CONFIG_ARCH_WANTS_PTE_ORDER))
>>>
>>>   if (!folio)
>>>     folio = vma_alloc_zeroed_movable_folio(vma, addr, 0);
>>
>> I meant a runtime function arch_wants_pte_order() (Its default
>> implementation would return 0.)
> 
> There are a bunch of things which you are implying here which I'll try to make
> explicit:
> 
> I think you are implying that we shouldn't retry allocation with intermediate
> orders; but only try the order requested by the arch (arch_wants_pte_order())
> and 0. Correct? For arm64 at least, I would like the VMA's THP hint to be a
> factor in determining the preferred order (see patches 8 and 9). So I would add
> a vma parameter to arch_wants_pte_order() to allow for this.
> 
> For the case where the THP hint is present, then the arch will request 2M (if
> the page size is 16K or 64K). If that fails to allocate, there is still value in
> allocating a 64K folio (which is order 2 in the 16K case). Without the retry
> with intermediate orders logic, we would not get this.
> 
> We can't just blindly allocate a folio of arch_wants_pte_order() size because it
> might overlap with existing populated PTEs, or cross the bounds of the VMA (or a
> number of other things - see calc_anon_folio_order_alloc() in patch 10). Are you
> implying that if there is any kind of issue like this, then we should go
> directly to order 0? I can kind of see the argument from a minimizing
> fragmentation perspective, but for best possible performance I think we are
> better off "packing the bin" with intermediate orders.

One drawback of the retry is that it could introduce large tail latency (by
memory zeroing, memory reclaiming or existing populated PTEs). That may not
be appreciated by some applications. Thanks.


Regards
Yin, Fengwei

> 
> You're also implying that a runtime arch_wants_pte_order() function is better
> than the Kconfig stuff I did in patch 8. On reflection, I agree with you here. I
> think you mentioned that AMD supports coalescing 8 pages on some CPUs - so you
> would probably want runtime logic to determine if you are on an appropriate AMD
> CPU as part of the decision in that function?
> 
> The real reason for the existance of try_vma_alloc_movable_folio() is that I'm
> reusing it on the other fault paths (which are no longer part of this series).
> But I guess that's not a good reason to keep this until we get to those patches.

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 06/10] mm: Allow deferred splitting of arbitrary large anon folios
  2023-06-26 17:14   ` Ryan Roberts
  (?)
@ 2023-06-28  2:43     ` Yin Fengwei
  -1 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-28  2:43 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin
  Cc: linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390



On 6/27/23 01:14, Ryan Roberts wrote:
> With the introduction of large folios for anonymous memory, we would
> like to be able to split them when they have unmapped subpages, in order
> to free those unused pages under memory pressure. So remove the
> artificial requirement that the large folio needed to be at least
> PMD-sized.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>


Regards
Yin, Fengwei

> ---
>  mm/rmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ac1d93d43f2b..3d11c5fb6090 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1567,7 +1567,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>  		 * page of the folio is unmapped and at least one page
>  		 * is still mapped.
>  		 */
> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
> +		if (folio_test_large(folio) && folio_test_anon(folio))
>  			if (!compound || nr < nr_pmdmapped)
>  				deferred_split_folio(folio);
>  	}

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 06/10] mm: Allow deferred splitting of arbitrary large anon folios
@ 2023-06-28  2:43     ` Yin Fengwei
  0 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-28  2:43 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin
  Cc: linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390



On 6/27/23 01:14, Ryan Roberts wrote:
> With the introduction of large folios for anonymous memory, we would
> like to be able to split them when they have unmapped subpages, in order
> to free those unused pages under memory pressure. So remove the
> artificial requirement that the large folio needed to be at least
> PMD-sized.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>


Regards
Yin, Fengwei

> ---
>  mm/rmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ac1d93d43f2b..3d11c5fb6090 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1567,7 +1567,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>  		 * page of the folio is unmapped and at least one page
>  		 * is still mapped.
>  		 */
> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
> +		if (folio_test_large(folio) && folio_test_anon(folio))
>  			if (!compound || nr < nr_pmdmapped)
>  				deferred_split_folio(folio);
>  	}

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 06/10] mm: Allow deferred splitting of arbitrary large anon folios
@ 2023-06-28  2:43     ` Yin Fengwei
  0 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-28  2:43 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin
  Cc: linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390



On 6/27/23 01:14, Ryan Roberts wrote:
> With the introduction of large folios for anonymous memory, we would
> like to be able to split them when they have unmapped subpages, in order
> to free those unused pages under memory pressure. So remove the
> artificial requirement that the large folio needed to be at least
> PMD-sized.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>


Regards
Yin, Fengwei

> ---
>  mm/rmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ac1d93d43f2b..3d11c5fb6090 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1567,7 +1567,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>  		 * page of the folio is unmapped and at least one page
>  		 * is still mapped.
>  		 */
> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
> +		if (folio_test_large(folio) && folio_test_anon(folio))
>  			if (!compound || nr < nr_pmdmapped)
>  				deferred_split_folio(folio);
>  	}

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
  2023-06-27 18:26             ` Yu Zhao
  (?)
@ 2023-06-28 10:56               ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-28 10:56 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 19:26, Yu Zhao wrote:
> On Tue, Jun 27, 2023 at 3:41 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/06/2023 09:29, Yu Zhao wrote:
>>> On Tue, Jun 27, 2023 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 27/06/2023 02:55, Yu Zhao wrote:
>>>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> In preparation for extending vma_alloc_zeroed_movable_folio() to
>>>>>> allocate a arbitrary order folio, expose clear_huge_page()
>>>>>> unconditionally, so that it can be used to zero the allocated folio in
>>>>>> the generic implementation of vma_alloc_zeroed_movable_folio().
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> ---
>>>>>>  include/linux/mm.h | 3 ++-
>>>>>>  mm/memory.c        | 2 +-
>>>>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>>>> index 7f1741bd870a..7e3bf45e6491 100644
>>>>>> --- a/include/linux/mm.h
>>>>>> +++ b/include/linux/mm.h
>>>>>> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
>>>>>>   */
>>>>>>  extern const struct attribute_group memory_failure_attr_group;
>>>>>>
>>>>>> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>>>>>>  extern void clear_huge_page(struct page *page,
>>>>>>                             unsigned long addr_hint,
>>>>>>                             unsigned int pages_per_huge_page);
>>>>>> +
>>>>>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>>>>>
>>>>> We might not want to depend on THP eventually. Right now, we still
>>>>> have to, unless splitting is optional, which seems to contradict
>>>>> 06/10. (deferred_split_folio()  is a nop without THP.)
>>>>
>>>> Yes, I agree - for large anon folios to work, we depend on THP. But I don't
>>>> think that helps us here.
>>>>
>>>> In the next patch, I give vma_alloc_zeroed_movable_folio() an extra `order`
>>>> parameter. So the generic/default version of the function now needs a way to
>>>> clear a compound page.
>>>>
>>>> I guess I could do something like:
>>>>
>>>>  static inline
>>>>  struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
>>>>                                    unsigned long vaddr, gfp_t gfp, int order)
>>>>  {
>>>>         struct folio *folio;
>>>>
>>>>         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
>>>>                                         order, vma, vaddr, false);
>>>>         if (folio) {
>>>> #ifdef CONFIG_LARGE_FOLIO
>>>>                 clear_huge_page(&folio->page, vaddr, 1U << order);
>>>> #else
>>>>                 BUG_ON(order != 0);
>>>>                 clear_user_highpage(&folio->page, vaddr);
>>>> #endif
>>>>         }
>>>>
>>>>         return folio;
>>>>  }
>>>>
>>>> But that's pretty messy and there's no reason why other users might come along
>>>> that pass order != 0 and will be surprised by the BUG_ON.
>>>
>>> #ifdef CONFIG_LARGE_ANON_FOLIO // depends on CONFIG_TRANSPARENT_HUGE_PAGE
>>> struct folio *alloc_anon_folio(struct vm_area_struct *vma, unsigned
>>> long vaddr, int order)
>>> {
>>>   // how do_huge_pmd_anonymous_page() allocs and clears
>>>   vma_alloc_folio(..., *true*);
>>
>> This controls the mem allocation policy (see mempolicy.c::vma_alloc_folio()) not
>> clearing. Clearing is done in __do_huge_pmd_anonymous_page():
>>
>>   clear_huge_page(page, vmf->address, HPAGE_PMD_NR);
> 
> Sorry for rushing this previously. This is what I meant. The #ifdef
> makes it safe to use clear_huge_page() without 01/10. I highlighted
> the last parameter to vma_alloc_folio() only because it's different
> from what you chose (not implying it clears the folio).>
>>> }
>>> #else
>>> #define alloc_anon_folio(vma, addr, order)
>>> vma_alloc_zeroed_movable_folio(vma, addr)
>>> #endif
>>
>> Sorry I don't get this at all... If you are suggesting to bypass
>> vma_alloc_zeroed_movable_folio() entirely for the LARGE_ANON_FOLIO case
> 
> Correct.
> 
>> I don't
>> think that works because the arch code adds its own gfp flags there. For
>> example, arm64 adds __GFP_ZEROTAGS for VM_MTE VMAs.
> 
> I think it's the opposite: it should be safer to reuse the THP code because
> 1. It's an existing case that has been working for PMD_ORDER folios
> mapped by PTEs, and it's an arch-independent API which would be easier
> to review.
> 2. Use vma_alloc_zeroed_movable_folio() for large folios is a *new*
> case. It's an arch-*dependent* API which I have no idea what VM_MTE
> does (should do) to large folios and don't plan to answer that for
> now.

I've done some archaology on this now, and convinced myself that your suggestion
is a good one - sorry for doubting it!

If you are interested here are the details: Only arm64 and ia64 do something
non-standard in vma_alloc_zeroed_movable_folio(). ia64 flushes the dcache for
the folio - but given it does not support THP this is not a problem for the THP
path. arm64 adds the __GFP_ZEROTAGS flag which means that the MTE tags will be
zeroed at the same time as the page is zeroed. This is a perf optimization - if
its not performed then it will be done at set_pte_at(), which is how this works
for the THP path.

So on that basis, I agree we can use your proposed alloc_anon_folio() approach.
arm64 will lose the MTE optimization but that can be added back later if needed.
So no need to unconditionally expose clear_huge_page() and no need to modify all
the arch vma_alloc_zeroed_movable_folio() implementations.

Thanks,
Ryan


> 
>> Perhaps we can do away with an arch-owned vma_alloc_zeroed_movable_folio() and
>> replace it with a new arch_get_zeroed_movable_gfp_flags() then
>> alloc_anon_folio() add in those flags?
>>
>> But I still think the cleanest, simplest change is just to unconditionally
>> expose clear_huge_page() as I've done it.
> 
> The fundamental choice there as I see it is to whether the first step
> of large anon folios should lean toward the THP code base or the base
> page code base (I'm a big fan of the answer "Neither -- we should
> create something entirely new instead"). My POV is that the THP code
> base would allow us to move faster, since it's proven to work for a
> very similar case (PMD_ORDER folios mapped by PTEs).


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
@ 2023-06-28 10:56               ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-28 10:56 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 27/06/2023 19:26, Yu Zhao wrote:
> On Tue, Jun 27, 2023 at 3:41 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/06/2023 09:29, Yu Zhao wrote:
>>> On Tue, Jun 27, 2023 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 27/06/2023 02:55, Yu Zhao wrote:
>>>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> In preparation for extending vma_alloc_zeroed_movable_folio() to
>>>>>> allocate a arbitrary order folio, expose clear_huge_page()
>>>>>> unconditionally, so that it can be used to zero the allocated folio in
>>>>>> the generic implementation of vma_alloc_zeroed_movable_folio().
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> ---
>>>>>>  include/linux/mm.h | 3 ++-
>>>>>>  mm/memory.c        | 2 +-
>>>>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>>>> index 7f1741bd870a..7e3bf45e6491 100644
>>>>>> --- a/include/linux/mm.h
>>>>>> +++ b/include/linux/mm.h
>>>>>> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
>>>>>>   */
>>>>>>  extern const struct attribute_group memory_failure_attr_group;
>>>>>>
>>>>>> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>>>>>>  extern void clear_huge_page(struct page *page,
>>>>>>                             unsigned long addr_hint,
>>>>>>                             unsigned int pages_per_huge_page);
>>>>>> +
>>>>>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>>>>>
>>>>> We might not want to depend on THP eventually. Right now, we still
>>>>> have to, unless splitting is optional, which seems to contradict
>>>>> 06/10. (deferred_split_folio()  is a nop without THP.)
>>>>
>>>> Yes, I agree - for large anon folios to work, we depend on THP. But I don't
>>>> think that helps us here.
>>>>
>>>> In the next patch, I give vma_alloc_zeroed_movable_folio() an extra `order`
>>>> parameter. So the generic/default version of the function now needs a way to
>>>> clear a compound page.
>>>>
>>>> I guess I could do something like:
>>>>
>>>>  static inline
>>>>  struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
>>>>                                    unsigned long vaddr, gfp_t gfp, int order)
>>>>  {
>>>>         struct folio *folio;
>>>>
>>>>         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
>>>>                                         order, vma, vaddr, false);
>>>>         if (folio) {
>>>> #ifdef CONFIG_LARGE_FOLIO
>>>>                 clear_huge_page(&folio->page, vaddr, 1U << order);
>>>> #else
>>>>                 BUG_ON(order != 0);
>>>>                 clear_user_highpage(&folio->page, vaddr);
>>>> #endif
>>>>         }
>>>>
>>>>         return folio;
>>>>  }
>>>>
>>>> But that's pretty messy and there's no reason why other users might come along
>>>> that pass order != 0 and will be surprised by the BUG_ON.
>>>
>>> #ifdef CONFIG_LARGE_ANON_FOLIO // depends on CONFIG_TRANSPARENT_HUGE_PAGE
>>> struct folio *alloc_anon_folio(struct vm_area_struct *vma, unsigned
>>> long vaddr, int order)
>>> {
>>>   // how do_huge_pmd_anonymous_page() allocs and clears
>>>   vma_alloc_folio(..., *true*);
>>
>> This controls the mem allocation policy (see mempolicy.c::vma_alloc_folio()) not
>> clearing. Clearing is done in __do_huge_pmd_anonymous_page():
>>
>>   clear_huge_page(page, vmf->address, HPAGE_PMD_NR);
> 
> Sorry for rushing this previously. This is what I meant. The #ifdef
> makes it safe to use clear_huge_page() without 01/10. I highlighted
> the last parameter to vma_alloc_folio() only because it's different
> from what you chose (not implying it clears the folio).>
>>> }
>>> #else
>>> #define alloc_anon_folio(vma, addr, order)
>>> vma_alloc_zeroed_movable_folio(vma, addr)
>>> #endif
>>
>> Sorry I don't get this at all... If you are suggesting to bypass
>> vma_alloc_zeroed_movable_folio() entirely for the LARGE_ANON_FOLIO case
> 
> Correct.
> 
>> I don't
>> think that works because the arch code adds its own gfp flags there. For
>> example, arm64 adds __GFP_ZEROTAGS for VM_MTE VMAs.
> 
> I think it's the opposite: it should be safer to reuse the THP code because
> 1. It's an existing case that has been working for PMD_ORDER folios
> mapped by PTEs, and it's an arch-independent API which would be easier
> to review.
> 2. Use vma_alloc_zeroed_movable_folio() for large folios is a *new*
> case. It's an arch-*dependent* API which I have no idea what VM_MTE
> does (should do) to large folios and don't plan to answer that for
> now.

I've done some archaology on this now, and convinced myself that your suggestion
is a good one - sorry for doubting it!

If you are interested here are the details: Only arm64 and ia64 do something
non-standard in vma_alloc_zeroed_movable_folio(). ia64 flushes the dcache for
the folio - but given it does not support THP this is not a problem for the THP
path. arm64 adds the __GFP_ZEROTAGS flag which means that the MTE tags will be
zeroed at the same time as the page is zeroed. This is a perf optimization - if
its not performed then it will be done at set_pte_at(), which is how this works
for the THP path.

So on that basis, I agree we can use your proposed alloc_anon_folio() approach.
arm64 will lose the MTE optimization but that can be added back later if needed.
So no need to unconditionally expose clear_huge_page() and no need to modify all
the arch vma_alloc_zeroed_movable_folio() implementations.

Thanks,
Ryan


> 
>> Perhaps we can do away with an arch-owned vma_alloc_zeroed_movable_folio() and
>> replace it with a new arch_get_zeroed_movable_gfp_flags() then
>> alloc_anon_folio() add in those flags?
>>
>> But I still think the cleanest, simplest change is just to unconditionally
>> expose clear_huge_page() as I've done it.
> 
> The fundamental choice there as I see it is to whether the first step
> of large anon folios should lean toward the THP code base or the base
> page code base (I'm a big fan of the answer "Neither -- we should
> create something entirely new instead"). My POV is that the THP code
> base would allow us to move faster, since it's proven to work for a
> very similar case (PMD_ORDER folios mapped by PTEs).


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally
@ 2023-06-28 10:56               ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-28 10:56 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k

On 27/06/2023 19:26, Yu Zhao wrote:
> On Tue, Jun 27, 2023 at 3:41 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/06/2023 09:29, Yu Zhao wrote:
>>> On Tue, Jun 27, 2023 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 27/06/2023 02:55, Yu Zhao wrote:
>>>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> In preparation for extending vma_alloc_zeroed_movable_folio() to
>>>>>> allocate a arbitrary order folio, expose clear_huge_page()
>>>>>> unconditionally, so that it can be used to zero the allocated folio in
>>>>>> the generic implementation of vma_alloc_zeroed_movable_folio().
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> ---
>>>>>>  include/linux/mm.h | 3 ++-
>>>>>>  mm/memory.c        | 2 +-
>>>>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>>>> index 7f1741bd870a..7e3bf45e6491 100644
>>>>>> --- a/include/linux/mm.h
>>>>>> +++ b/include/linux/mm.h
>>>>>> @@ -3684,10 +3684,11 @@ enum mf_action_page_type {
>>>>>>   */
>>>>>>  extern const struct attribute_group memory_failure_attr_group;
>>>>>>
>>>>>> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>>>>>>  extern void clear_huge_page(struct page *page,
>>>>>>                             unsigned long addr_hint,
>>>>>>                             unsigned int pages_per_huge_page);
>>>>>> +
>>>>>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>>>>>
>>>>> We might not want to depend on THP eventually. Right now, we still
>>>>> have to, unless splitting is optional, which seems to contradict
>>>>> 06/10. (deferred_split_folio()  is a nop without THP.)
>>>>
>>>> Yes, I agree - for large anon folios to work, we depend on THP. But I don't
>>>> think that helps us here.
>>>>
>>>> In the next patch, I give vma_alloc_zeroed_movable_folio() an extra `order`
>>>> parameter. So the generic/default version of the function now needs a way to
>>>> clear a compound page.
>>>>
>>>> I guess I could do something like:
>>>>
>>>>  static inline
>>>>  struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
>>>>                                    unsigned long vaddr, gfp_t gfp, int order)
>>>>  {
>>>>         struct folio *folio;
>>>>
>>>>         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
>>>>                                         order, vma, vaddr, false);
>>>>         if (folio) {
>>>> #ifdef CONFIG_LARGE_FOLIO
>>>>                 clear_huge_page(&folio->page, vaddr, 1U << order);
>>>> #else
>>>>                 BUG_ON(order != 0);
>>>>                 clear_user_highpage(&folio->page, vaddr);
>>>> #endif
>>>>         }
>>>>
>>>>         return folio;
>>>>  }
>>>>
>>>> But that's pretty messy and there's no reason why other users might come along
>>>> that pass order != 0 and will be surprised by the BUG_ON.
>>>
>>> #ifdef CONFIG_LARGE_ANON_FOLIO // depends on CONFIG_TRANSPARENT_HUGE_PAGE
>>> struct folio *alloc_anon_folio(struct vm_area_struct *vma, unsigned
>>> long vaddr, int order)
>>> {
>>>   // how do_huge_pmd_anonymous_page() allocs and clears
>>>   vma_alloc_folio(..., *true*);
>>
>> This controls the mem allocation policy (see mempolicy.c::vma_alloc_folio()) not
>> clearing. Clearing is done in __do_huge_pmd_anonymous_page():
>>
>>   clear_huge_page(page, vmf->address, HPAGE_PMD_NR);
> 
> Sorry for rushing this previously. This is what I meant. The #ifdef
> makes it safe to use clear_huge_page() without 01/10. I highlighted
> the last parameter to vma_alloc_folio() only because it's different
> from what you chose (not implying it clears the folio).>
>>> }
>>> #else
>>> #define alloc_anon_folio(vma, addr, order)
>>> vma_alloc_zeroed_movable_folio(vma, addr)
>>> #endif
>>
>> Sorry I don't get this at all... If you are suggesting to bypass
>> vma_alloc_zeroed_movable_folio() entirely for the LARGE_ANON_FOLIO case
> 
> Correct.
> 
>> I don't
>> think that works because the arch code adds its own gfp flags there. For
>> example, arm64 adds __GFP_ZEROTAGS for VM_MTE VMAs.
> 
> I think it's the opposite: it should be safer to reuse the THP code because
> 1. It's an existing case that has been working for PMD_ORDER folios
> mapped by PTEs, and it's an arch-independent API which would be easier
> to review.
> 2. Use vma_alloc_zeroed_movable_folio() for large folios is a *new*
> case. It's an arch-*dependent* API which I have no idea what VM_MTE
> does (should do) to large folios and don't plan to answer that for
> now.

I've done some archaology on this now, and convinced myself that your suggestion
is a good one - sorry for doubting it!

If you are interested here are the details: Only arm64 and ia64 do something
non-standard in vma_alloc_zeroed_movable_folio(). ia64 flushes the dcache for
the folio - but given it does not support THP this is not a problem for the THP
path. arm64 adds the __GFP_ZEROTAGS flag which means that the MTE tags will be
zeroed at the same time as the page is zeroed. This is a perf optimization - if
its not performed then it will be done at set_pte_at(), which is how this works
for the THP path.

So on that basis, I agree we can use your proposed alloc_anon_folio() approach.
arm64 will lose the MTE optimization but that can be added back later if needed.
So no need to unconditionally expose clear_huge_page() and no need to modify all
the arch vma_alloc_zeroed_movable_folio() implementations.

Thanks,
Ryan


> 
>> Perhaps we can do away with an arch-owned vma_alloc_zeroed_movable_folio() and
>> replace it with a new arch_get_zeroed_movable_gfp_flags() then
>> alloc_anon_folio() add in those flags?
>>
>> But I still think the cleanest, simplest change is just to unconditionally
>> expose clear_huge_page() as I've done it.
> 
> The fundamental choice there as I see it is to whether the first step
> of large anon folios should lean toward the THP code base or the base
> page code base (I'm a big fan of the answer "Neither -- we should
> create something entirely new instead"). My POV is that the THP code
> base would allow us to move faster, since it's proven to work for a
> very similar case (PMD_ORDER folios mapped by PTEs).


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
  2023-06-28  2:32           ` Yin Fengwei
@ 2023-06-28 11:06             ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-28 11:06 UTC (permalink / raw)
  To: Yin Fengwei, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

On 28/06/2023 03:32, Yin Fengwei wrote:
> 
> 
> On 6/27/23 15:56, Ryan Roberts wrote:
>> On 27/06/2023 06:29, Yu Zhao wrote:
>>> On Mon, Jun 26, 2023 at 8:34 PM Yu Zhao <yuzhao@google.com> wrote:
>>>>
>>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> Opportunistically attempt to allocate high-order folios in highmem,
>>>>> optionally zeroed. Retry with lower orders all the way to order-0, until
>>>>> success. Although, of note, order-1 allocations are skipped since a
>>>>> large folio must be at least order-2 to work with the THP machinery. The
>>>>> user must check what they got with folio_order().
>>>>>
>>>>> This will be used to oportunistically allocate large folios for
>>>>> anonymous memory with a sensible fallback under memory pressure.
>>>>>
>>>>> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
>>>>> high latency due to reclaim, instead preferring to just try for a lower
>>>>> order. The same approach is used by the readahead code when allocating
>>>>> large folios.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 33 insertions(+)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 367bbbb29d91..53896d46e686 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>>>>         return 0;
>>>>>  }
>>>>>
>>>>> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
>>>>> +                               unsigned long vaddr, int order, bool zeroed)
>>>>> +{
>>>>> +       gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
>>>>> +
>>>>> +       if (zeroed)
>>>>> +               return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
>>>>> +       else
>>>>> +               return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
>>>>> +                                                               vaddr, false);
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Opportunistically attempt to allocate high-order folios, retrying with lower
>>>>> + * orders all the way to order-0, until success. order-1 allocations are skipped
>>>>> + * since a folio must be at least order-2 to work with the THP machinery. The
>>>>> + * user must check what they got with folio_order(). vaddr can be any virtual
>>>>> + * address that will be mapped by the allocated folio.
>>>>> + */
>>>>> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>>>>> +                               unsigned long vaddr, int order, bool zeroed)
>>>>> +{
>>>>> +       struct folio *folio;
>>>>> +
>>>>> +       for (; order > 1; order--) {
>>>>> +               folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
>>>>> +               if (folio)
>>>>> +                       return folio;
>>>>> +       }
>>>>> +
>>>>> +       return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>>>>> +}
>>>>
>>>> I'd drop this patch. Instead, in do_anonymous_page():
>>>>
>>>>   if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER))
>>>>     folio = vma_alloc_zeroed_movable_folio(vma, addr,
>>>> CONFIG_ARCH_WANTS_PTE_ORDER))
>>>>
>>>>   if (!folio)
>>>>     folio = vma_alloc_zeroed_movable_folio(vma, addr, 0);
>>>
>>> I meant a runtime function arch_wants_pte_order() (Its default
>>> implementation would return 0.)
>>
>> There are a bunch of things which you are implying here which I'll try to make
>> explicit:
>>
>> I think you are implying that we shouldn't retry allocation with intermediate
>> orders; but only try the order requested by the arch (arch_wants_pte_order())
>> and 0. Correct? For arm64 at least, I would like the VMA's THP hint to be a
>> factor in determining the preferred order (see patches 8 and 9). So I would add
>> a vma parameter to arch_wants_pte_order() to allow for this.
>>
>> For the case where the THP hint is present, then the arch will request 2M (if
>> the page size is 16K or 64K). If that fails to allocate, there is still value in
>> allocating a 64K folio (which is order 2 in the 16K case). Without the retry
>> with intermediate orders logic, we would not get this.
>>
>> We can't just blindly allocate a folio of arch_wants_pte_order() size because it
>> might overlap with existing populated PTEs, or cross the bounds of the VMA (or a
>> number of other things - see calc_anon_folio_order_alloc() in patch 10). Are you
>> implying that if there is any kind of issue like this, then we should go
>> directly to order 0? I can kind of see the argument from a minimizing
>> fragmentation perspective, but for best possible performance I think we are
>> better off "packing the bin" with intermediate orders.
> 
> One drawback of the retry is that it could introduce large tail latency (by
> memory zeroing, memory reclaiming or existing populated PTEs). That may not
> be appreciated by some applications. Thanks.

Good point. based on all the discussion, I think the conclusion is:

 - ask the arch to for preferred folio order with runtime function
 - check the folio will fit (racy) - if does not fit fall back to order-0
 - allocate the folio
 - take the ptl
 - check the folio still fits (not racy) - if does not fit fall back to order-0

So in the worst case the latency will be allocating and zeroing a large folio,
then allocating and zeroing an order-0 folio. Which is obviously better than
iterating through every order from preferred to 0.

I'll work this flow into a v2.

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> You're also implying that a runtime arch_wants_pte_order() function is better
>> than the Kconfig stuff I did in patch 8. On reflection, I agree with you here. I
>> think you mentioned that AMD supports coalescing 8 pages on some CPUs - so you
>> would probably want runtime logic to determine if you are on an appropriate AMD
>> CPU as part of the decision in that function?
>>
>> The real reason for the existance of try_vma_alloc_movable_folio() is that I'm
>> reusing it on the other fault paths (which are no longer part of this series).
>> But I guess that's not a good reason to keep this until we get to those patches.


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio()
@ 2023-06-28 11:06             ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-28 11:06 UTC (permalink / raw)
  To: Yin Fengwei, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

On 28/06/2023 03:32, Yin Fengwei wrote:
> 
> 
> On 6/27/23 15:56, Ryan Roberts wrote:
>> On 27/06/2023 06:29, Yu Zhao wrote:
>>> On Mon, Jun 26, 2023 at 8:34 PM Yu Zhao <yuzhao@google.com> wrote:
>>>>
>>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> Opportunistically attempt to allocate high-order folios in highmem,
>>>>> optionally zeroed. Retry with lower orders all the way to order-0, until
>>>>> success. Although, of note, order-1 allocations are skipped since a
>>>>> large folio must be at least order-2 to work with the THP machinery. The
>>>>> user must check what they got with folio_order().
>>>>>
>>>>> This will be used to oportunistically allocate large folios for
>>>>> anonymous memory with a sensible fallback under memory pressure.
>>>>>
>>>>> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
>>>>> high latency due to reclaim, instead preferring to just try for a lower
>>>>> order. The same approach is used by the readahead code when allocating
>>>>> large folios.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>  mm/memory.c | 33 +++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 33 insertions(+)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 367bbbb29d91..53896d46e686 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>>>>         return 0;
>>>>>  }
>>>>>
>>>>> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma,
>>>>> +                               unsigned long vaddr, int order, bool zeroed)
>>>>> +{
>>>>> +       gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0;
>>>>> +
>>>>> +       if (zeroed)
>>>>> +               return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
>>>>> +       else
>>>>> +               return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma,
>>>>> +                                                               vaddr, false);
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Opportunistically attempt to allocate high-order folios, retrying with lower
>>>>> + * orders all the way to order-0, until success. order-1 allocations are skipped
>>>>> + * since a folio must be at least order-2 to work with the THP machinery. The
>>>>> + * user must check what they got with folio_order(). vaddr can be any virtual
>>>>> + * address that will be mapped by the allocated folio.
>>>>> + */
>>>>> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>>>>> +                               unsigned long vaddr, int order, bool zeroed)
>>>>> +{
>>>>> +       struct folio *folio;
>>>>> +
>>>>> +       for (; order > 1; order--) {
>>>>> +               folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed);
>>>>> +               if (folio)
>>>>> +                       return folio;
>>>>> +       }
>>>>> +
>>>>> +       return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>>>>> +}
>>>>
>>>> I'd drop this patch. Instead, in do_anonymous_page():
>>>>
>>>>   if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER))
>>>>     folio = vma_alloc_zeroed_movable_folio(vma, addr,
>>>> CONFIG_ARCH_WANTS_PTE_ORDER))
>>>>
>>>>   if (!folio)
>>>>     folio = vma_alloc_zeroed_movable_folio(vma, addr, 0);
>>>
>>> I meant a runtime function arch_wants_pte_order() (Its default
>>> implementation would return 0.)
>>
>> There are a bunch of things which you are implying here which I'll try to make
>> explicit:
>>
>> I think you are implying that we shouldn't retry allocation with intermediate
>> orders; but only try the order requested by the arch (arch_wants_pte_order())
>> and 0. Correct? For arm64 at least, I would like the VMA's THP hint to be a
>> factor in determining the preferred order (see patches 8 and 9). So I would add
>> a vma parameter to arch_wants_pte_order() to allow for this.
>>
>> For the case where the THP hint is present, then the arch will request 2M (if
>> the page size is 16K or 64K). If that fails to allocate, there is still value in
>> allocating a 64K folio (which is order 2 in the 16K case). Without the retry
>> with intermediate orders logic, we would not get this.
>>
>> We can't just blindly allocate a folio of arch_wants_pte_order() size because it
>> might overlap with existing populated PTEs, or cross the bounds of the VMA (or a
>> number of other things - see calc_anon_folio_order_alloc() in patch 10). Are you
>> implying that if there is any kind of issue like this, then we should go
>> directly to order 0? I can kind of see the argument from a minimizing
>> fragmentation perspective, but for best possible performance I think we are
>> better off "packing the bin" with intermediate orders.
> 
> One drawback of the retry is that it could introduce large tail latency (by
> memory zeroing, memory reclaiming or existing populated PTEs). That may not
> be appreciated by some applications. Thanks.

Good point. based on all the discussion, I think the conclusion is:

 - ask the arch to for preferred folio order with runtime function
 - check the folio will fit (racy) - if does not fit fall back to order-0
 - allocate the folio
 - take the ptl
 - check the folio still fits (not racy) - if does not fit fall back to order-0

So in the worst case the latency will be allocating and zeroing a large folio,
then allocating and zeroing an order-0 folio. Which is obviously better than
iterating through every order from preferred to 0.

I'll work this flow into a v2.

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> You're also implying that a runtime arch_wants_pte_order() function is better
>> than the Kconfig stuff I did in patch 8. On reflection, I agree with you here. I
>> think you mentioned that AMD supports coalescing 8 pages on some CPUs - so you
>> would probably want runtime logic to determine if you are on an appropriate AMD
>> CPU as part of the decision in that function?
>>
>> The real reason for the existance of try_vma_alloc_movable_folio() is that I'm
>> reusing it on the other fault paths (which are no longer part of this series).
>> But I guess that's not a good reason to keep this until we get to those patches.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
  2023-06-28  2:20         ` Yin Fengwei
  (?)
@ 2023-06-28 11:09           ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-28 11:09 UTC (permalink / raw)
  To: Yin Fengwei, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

On 28/06/2023 03:20, Yin Fengwei wrote:
> 
> 
> On 6/27/23 16:09, Ryan Roberts wrote:
>> On 27/06/2023 08:08, Yu Zhao wrote:
>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
>>>> belonging to a folio, for effciency savings. All pages are accounted as
>>>> small pages.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/rmap.h |  2 ++
>>>>  mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>>>>  2 files changed, 45 insertions(+)
>>>>
>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>> index a3825ce81102..15433a3d0cbf 100644
>>>> --- a/include/linux/rmap.h
>>>> +++ b/include/linux/rmap.h
>>>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>>>>                 unsigned long address);
>>>>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>>>                 unsigned long address);
>>>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>>>> +               int nr, struct vm_area_struct *vma, unsigned long address);
>>>
>>> We should update folio_add_new_anon_rmap() to support large() &&
>>> !folio_test_pmd_mappable() folios instead.
>>>
>>> I double checked all places currently using folio_add_new_anon_rmap(),
>>> and as expected, none actually allocates large() &&
>>> !folio_test_pmd_mappable() and maps it one by one, which makes the
>>> cases simpler, i.e.,
>>>   if (!large())
>>>     // the existing basepage case
>>>   else if (!folio_test_pmd_mappable())
>>>     // our new case
>>>   else
>>>     // the existing THP case
>>
>> I don't have a strong opinion either way. Happy to go with this suggestion. But
>> the reason I did it as a new function was because I was following the pattern in
>> [1] which adds a new folio_add_file_rmap_range() function.
>>
>> [1] https://lore.kernel.org/linux-mm/20230315051444.3229621-35-willy@infradead.org/
> Oh. There is different here:
> For page cache, large folio could be created by previous file access. But later
> file access by other process just need map partial large folio. In this case, we need
> _range for filemap.
> 
> But for anonymous, I suppose we always map whole folio in. So I agree with Yu. We
> don't need _range for folio_add_new_anon_rmap(). Thanks.

Yes that makes sense - thanks. I'll merge the new case into
folio_add_new_anon_rmap() for v2.

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>>
>>>
>>>>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>>>                 bool compound);
>>>>  void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 1d8369549424..4050bcea7ae7 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>>>         __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>>>  }
>>>>
>>>> +/**
>>>> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
>>>> + * anonymous potentially large folio.
>>>> + * @folio:      The folio containing the pages to be mapped
>>>> + * @page:       First page in the folio to be mapped
>>>> + * @nr:         Number of pages to be mapped
>>>> + * @vma:        the vm area in which the mapping is added
>>>> + * @address:    the user virtual address of the first page to be mapped
>>>> + *
>>>> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
>>>> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
>>>> + * bypassed and the folio does not have to be locked. All pages in the folio are
>>>> + * individually accounted.
>>>> + *
>>>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>>>> + * process.
>>>> + */
>>>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>>>> +               int nr, struct vm_area_struct *vma, unsigned long address)
>>>> +{
>>>> +       int i;
>>>> +
>>>> +       VM_BUG_ON_VMA(address < vma->vm_start ||
>>>> +                     address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>>
>>> BTW, VM_BUG_ON* shouldn't be used in new code:
>>> Documentation/process/coding-style.rst
>>
>> Thanks, sorry about that. Was copy-pasting from folio_add_new_anon_rmap().
>>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
@ 2023-06-28 11:09           ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-28 11:09 UTC (permalink / raw)
  To: Yin Fengwei, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

On 28/06/2023 03:20, Yin Fengwei wrote:
> 
> 
> On 6/27/23 16:09, Ryan Roberts wrote:
>> On 27/06/2023 08:08, Yu Zhao wrote:
>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
>>>> belonging to a folio, for effciency savings. All pages are accounted as
>>>> small pages.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/rmap.h |  2 ++
>>>>  mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>>>>  2 files changed, 45 insertions(+)
>>>>
>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>> index a3825ce81102..15433a3d0cbf 100644
>>>> --- a/include/linux/rmap.h
>>>> +++ b/include/linux/rmap.h
>>>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>>>>                 unsigned long address);
>>>>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>>>                 unsigned long address);
>>>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>>>> +               int nr, struct vm_area_struct *vma, unsigned long address);
>>>
>>> We should update folio_add_new_anon_rmap() to support large() &&
>>> !folio_test_pmd_mappable() folios instead.
>>>
>>> I double checked all places currently using folio_add_new_anon_rmap(),
>>> and as expected, none actually allocates large() &&
>>> !folio_test_pmd_mappable() and maps it one by one, which makes the
>>> cases simpler, i.e.,
>>>   if (!large())
>>>     // the existing basepage case
>>>   else if (!folio_test_pmd_mappable())
>>>     // our new case
>>>   else
>>>     // the existing THP case
>>
>> I don't have a strong opinion either way. Happy to go with this suggestion. But
>> the reason I did it as a new function was because I was following the pattern in
>> [1] which adds a new folio_add_file_rmap_range() function.
>>
>> [1] https://lore.kernel.org/linux-mm/20230315051444.3229621-35-willy@infradead.org/
> Oh. There is different here:
> For page cache, large folio could be created by previous file access. But later
> file access by other process just need map partial large folio. In this case, we need
> _range for filemap.
> 
> But for anonymous, I suppose we always map whole folio in. So I agree with Yu. We
> don't need _range for folio_add_new_anon_rmap(). Thanks.

Yes that makes sense - thanks. I'll merge the new case into
folio_add_new_anon_rmap() for v2.

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>>
>>>
>>>>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>>>                 bool compound);
>>>>  void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 1d8369549424..4050bcea7ae7 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>>>         __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>>>  }
>>>>
>>>> +/**
>>>> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
>>>> + * anonymous potentially large folio.
>>>> + * @folio:      The folio containing the pages to be mapped
>>>> + * @page:       First page in the folio to be mapped
>>>> + * @nr:         Number of pages to be mapped
>>>> + * @vma:        the vm area in which the mapping is added
>>>> + * @address:    the user virtual address of the first page to be mapped
>>>> + *
>>>> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
>>>> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
>>>> + * bypassed and the folio does not have to be locked. All pages in the folio are
>>>> + * individually accounted.
>>>> + *
>>>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>>>> + * process.
>>>> + */
>>>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>>>> +               int nr, struct vm_area_struct *vma, unsigned long address)
>>>> +{
>>>> +       int i;
>>>> +
>>>> +       VM_BUG_ON_VMA(address < vma->vm_start ||
>>>> +                     address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>>
>>> BTW, VM_BUG_ON* shouldn't be used in new code:
>>> Documentation/process/coding-style.rst
>>
>> Thanks, sorry about that. Was copy-pasting from folio_add_new_anon_rmap().
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range()
@ 2023-06-28 11:09           ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-28 11:09 UTC (permalink / raw)
  To: Yin Fengwei, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

On 28/06/2023 03:20, Yin Fengwei wrote:
> 
> 
> On 6/27/23 16:09, Ryan Roberts wrote:
>> On 27/06/2023 08:08, Yu Zhao wrote:
>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Like folio_add_new_anon_rmap() but batch-rmaps a range of pages
>>>> belonging to a folio, for effciency savings. All pages are accounted as
>>>> small pages.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/rmap.h |  2 ++
>>>>  mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>>>>  2 files changed, 45 insertions(+)
>>>>
>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>> index a3825ce81102..15433a3d0cbf 100644
>>>> --- a/include/linux/rmap.h
>>>> +++ b/include/linux/rmap.h
>>>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>>>>                 unsigned long address);
>>>>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>>>                 unsigned long address);
>>>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>>>> +               int nr, struct vm_area_struct *vma, unsigned long address);
>>>
>>> We should update folio_add_new_anon_rmap() to support large() &&
>>> !folio_test_pmd_mappable() folios instead.
>>>
>>> I double checked all places currently using folio_add_new_anon_rmap(),
>>> and as expected, none actually allocates large() &&
>>> !folio_test_pmd_mappable() and maps it one by one, which makes the
>>> cases simpler, i.e.,
>>>   if (!large())
>>>     // the existing basepage case
>>>   else if (!folio_test_pmd_mappable())
>>>     // our new case
>>>   else
>>>     // the existing THP case
>>
>> I don't have a strong opinion either way. Happy to go with this suggestion. But
>> the reason I did it as a new function was because I was following the pattern in
>> [1] which adds a new folio_add_file_rmap_range() function.
>>
>> [1] https://lore.kernel.org/linux-mm/20230315051444.3229621-35-willy@infradead.org/
> Oh. There is different here:
> For page cache, large folio could be created by previous file access. But later
> file access by other process just need map partial large folio. In this case, we need
> _range for filemap.
> 
> But for anonymous, I suppose we always map whole folio in. So I agree with Yu. We
> don't need _range for folio_add_new_anon_rmap(). Thanks.

Yes that makes sense - thanks. I'll merge the new case into
folio_add_new_anon_rmap() for v2.

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>>
>>>
>>>>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>>>                 bool compound);
>>>>  void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 1d8369549424..4050bcea7ae7 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1305,6 +1305,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>>>         __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>>>  }
>>>>
>>>> +/**
>>>> + * folio_add_new_anon_rmap_range - Add mapping to a set of pages within a new
>>>> + * anonymous potentially large folio.
>>>> + * @folio:      The folio containing the pages to be mapped
>>>> + * @page:       First page in the folio to be mapped
>>>> + * @nr:         Number of pages to be mapped
>>>> + * @vma:        the vm area in which the mapping is added
>>>> + * @address:    the user virtual address of the first page to be mapped
>>>> + *
>>>> + * Like folio_add_new_anon_rmap() but batch-maps a range of pages within a folio
>>>> + * using non-THP accounting. Like folio_add_new_anon_rmap(), the inc-and-test is
>>>> + * bypassed and the folio does not have to be locked. All pages in the folio are
>>>> + * individually accounted.
>>>> + *
>>>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>>>> + * process.
>>>> + */
>>>> +void folio_add_new_anon_rmap_range(struct folio *folio, struct page *page,
>>>> +               int nr, struct vm_area_struct *vma, unsigned long address)
>>>> +{
>>>> +       int i;
>>>> +
>>>> +       VM_BUG_ON_VMA(address < vma->vm_start ||
>>>> +                     address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>>
>>> BTW, VM_BUG_ON* shouldn't be used in new code:
>>> Documentation/process/coding-style.rst
>>
>> Thanks, sorry about that. Was copy-pasting from folio_add_new_anon_rmap().
>>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
  2023-06-27  9:59       ` Ryan Roberts
@ 2023-06-28 18:22         ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-28 18:22 UTC (permalink / raw)
  To: Ryan Roberts, Yin Fengwei
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

On Tue, Jun 27, 2023 at 3:59 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/06/2023 08:49, Yu Zhao wrote:
> > On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote:
> >>
> >> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>
> >>> Hi All,
> >>>
> >>> Following on from the previous RFCv2 [1], this series implements variable order,
> >>> large folios for anonymous memory. The objective of this is to improve
> >>> performance by allocating larger chunks of memory during anonymous page faults:
> >>>
> >>>  - Since SW (the kernel) is dealing with larger chunks of memory than base
> >>>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
> >>>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
> >>>    overhead. This should benefit all architectures.
> >>>  - Since we are now mapping physically contiguous chunks of memory, we can take
> >>>    advantage of HW TLB compression techniques. A reduction in TLB pressure
> >>>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> >>>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
> >>>
> >>> This patch set deals with the SW side of things only and based on feedback from
> >>> the RFC, aims to be the most minimal initial change, upon which future
> >>> incremental changes can be added. For this reason, the new behaviour is hidden
> >>> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
> >>> default. Although the code has been refactored to parameterize the desired order
> >>> of the allocation, when the feature is disabled (by forcing the order to be
> >>> always 0) my performance tests measure no regression. So I'm hoping this will be
> >>> a suitable mechanism to allow incremental submissions to the kernel without
> >>> affecting the rest of the world.
> >>>
> >>> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> >>> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
> >>> getting that series into the kernel, but I'm hoping we can start the review
> >>> process on this patch set independently. I have a branch at [3].
> >>>
> >>> I've posted a separate series concerning the HW part (contpte mapping) for arm64
> >>> at [4].
> >>>
> >>>
> >>> Performance
> >>> -----------
> >>>
> >>> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> >>> javascript benchmark running in Chromium). Both cases are running on Ampere
> >>> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> >>> is repeated 15 times over 5 reboots and averaged.
> >>>
> >>> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
> >>> 'anonfolio' is the full patch set similar to the RFC with the additional changes
> >>> to the extra 3 fault paths. The rest of the configs are described at [4].
> >>>
> >>> Kernel Compilation (smaller is better):
> >>>
> >>> | kernel          |   real-time |   kern-time |   user-time |
> >>> |:----------------|------------:|------------:|------------:|
> >>> | baseline-4k     |        0.0% |        0.0% |        0.0% |
> >>> | anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
> >>> | anonfolio       |       -5.4% |      -46.0% |       -0.3% |
> >>> | contpte         |       -6.8% |      -45.7% |       -2.1% |
> >>> | exefolio        |       -8.4% |      -46.4% |       -3.7% |
> >>> | baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
> >>> | baseline-64k    |      -10.5% |      -66.0% |       -3.5% |
> >>>
> >>> Speedometer 2.0 (bigger is better):
> >>>
> >>> | kernel          |   runs_per_min |
> >>> |:----------------|---------------:|
> >>> | baseline-4k     |           0.0% |
> >>> | anonfolio-basic |           0.7% |
> >>> | anonfolio       |           1.2% |
> >>> | contpte         |           3.1% |
> >>> | exefolio        |           4.2% |
> >>> | baseline-16k    |           5.3% |
> >>
> >> Thanks for pushing this forward!
> >>
> >>> Changes since RFCv2
> >>> -------------------
> >>>
> >>>   - Simplified series to bare minimum (on David Hildenbrand's advice)
> >>
> >> My impression is that this series still includes many pieces that can
> >> be split out and discussed separately with followup series.
> >>
> >> (I skipped 04/10 and will look at it tomorrow.)
> >
> > I went through the series twice. Here what I think a bare minimum
> > series (easier to review/debug/land) would look like:

===

> > 1. a new arch specific function providing a prefered order within (0,
> > PMD_ORDER).
> > 2. an extended anon folio alloc API taking that order (02/10, partially).
> > 3. an updated folio_add_new_anon_rmap() covering the large() &&
> > !pmd_mappable() case (similar to 04/10).
> > 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
> > (06/10, reviewed-by provided).
> > 5. finally, use the extended anon folio alloc API with the arch
> > preferred order in do_anonymous_page() (10/10, partially).

===

> > The rest can be split out into separate series and move forward in
> > parallel with probably a long list of things we need/want to do.
>
> Thanks for the fadt review - I really appreciate it!
>
> I've responded to many of your comments. I'd appreciate if we can close those
> points then I will work up a v2.

Thanks!

Based on the latest discussion here [1], my original list above can be
optionally reduced to 4 patches: item 2 can be quashed into item 5.

Also please make sure we have only one global (apply to all archs)
Kconfig option, and it should be added in item 5:

  if TRANSPARENT_HUGEPAGE
    config FLEXIBLE/VARIABLE_THP # or whatever name you see fit
  end if

(How many new Kconfig options added within arch/arm64/ is not a concern of MM.)

And please make sure it's disabled by default, because we are still
missing many important functions, e.g., I don't think we can mlock()
when large() && !pmd_mappable(), see mlock_pte_range() and
mlock_vma_folio(). We can fix it along with many things later, but we
need to present a plan and a schedule now. Otherwise, there would be
pushback if we try to land the series without supporting mlock().

Do you or Fengwei plan to take on it? (I personally don't.) If not,
I'll try to find someone from our team to look at it. (It'd be more
scalable if we have a coordinated group of people individually solving
different problems.)

[1] https://lore.kernel.org/r/b2c81404-67df-f841-ef02-919e841f49f2@arm.com/

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-28 18:22         ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-28 18:22 UTC (permalink / raw)
  To: Ryan Roberts, Yin Fengwei
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

On Tue, Jun 27, 2023 at 3:59 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/06/2023 08:49, Yu Zhao wrote:
> > On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote:
> >>
> >> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>
> >>> Hi All,
> >>>
> >>> Following on from the previous RFCv2 [1], this series implements variable order,
> >>> large folios for anonymous memory. The objective of this is to improve
> >>> performance by allocating larger chunks of memory during anonymous page faults:
> >>>
> >>>  - Since SW (the kernel) is dealing with larger chunks of memory than base
> >>>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
> >>>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
> >>>    overhead. This should benefit all architectures.
> >>>  - Since we are now mapping physically contiguous chunks of memory, we can take
> >>>    advantage of HW TLB compression techniques. A reduction in TLB pressure
> >>>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> >>>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
> >>>
> >>> This patch set deals with the SW side of things only and based on feedback from
> >>> the RFC, aims to be the most minimal initial change, upon which future
> >>> incremental changes can be added. For this reason, the new behaviour is hidden
> >>> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
> >>> default. Although the code has been refactored to parameterize the desired order
> >>> of the allocation, when the feature is disabled (by forcing the order to be
> >>> always 0) my performance tests measure no regression. So I'm hoping this will be
> >>> a suitable mechanism to allow incremental submissions to the kernel without
> >>> affecting the rest of the world.
> >>>
> >>> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> >>> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
> >>> getting that series into the kernel, but I'm hoping we can start the review
> >>> process on this patch set independently. I have a branch at [3].
> >>>
> >>> I've posted a separate series concerning the HW part (contpte mapping) for arm64
> >>> at [4].
> >>>
> >>>
> >>> Performance
> >>> -----------
> >>>
> >>> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> >>> javascript benchmark running in Chromium). Both cases are running on Ampere
> >>> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> >>> is repeated 15 times over 5 reboots and averaged.
> >>>
> >>> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
> >>> 'anonfolio' is the full patch set similar to the RFC with the additional changes
> >>> to the extra 3 fault paths. The rest of the configs are described at [4].
> >>>
> >>> Kernel Compilation (smaller is better):
> >>>
> >>> | kernel          |   real-time |   kern-time |   user-time |
> >>> |:----------------|------------:|------------:|------------:|
> >>> | baseline-4k     |        0.0% |        0.0% |        0.0% |
> >>> | anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
> >>> | anonfolio       |       -5.4% |      -46.0% |       -0.3% |
> >>> | contpte         |       -6.8% |      -45.7% |       -2.1% |
> >>> | exefolio        |       -8.4% |      -46.4% |       -3.7% |
> >>> | baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
> >>> | baseline-64k    |      -10.5% |      -66.0% |       -3.5% |
> >>>
> >>> Speedometer 2.0 (bigger is better):
> >>>
> >>> | kernel          |   runs_per_min |
> >>> |:----------------|---------------:|
> >>> | baseline-4k     |           0.0% |
> >>> | anonfolio-basic |           0.7% |
> >>> | anonfolio       |           1.2% |
> >>> | contpte         |           3.1% |
> >>> | exefolio        |           4.2% |
> >>> | baseline-16k    |           5.3% |
> >>
> >> Thanks for pushing this forward!
> >>
> >>> Changes since RFCv2
> >>> -------------------
> >>>
> >>>   - Simplified series to bare minimum (on David Hildenbrand's advice)
> >>
> >> My impression is that this series still includes many pieces that can
> >> be split out and discussed separately with followup series.
> >>
> >> (I skipped 04/10 and will look at it tomorrow.)
> >
> > I went through the series twice. Here what I think a bare minimum
> > series (easier to review/debug/land) would look like:

===

> > 1. a new arch specific function providing a prefered order within (0,
> > PMD_ORDER).
> > 2. an extended anon folio alloc API taking that order (02/10, partially).
> > 3. an updated folio_add_new_anon_rmap() covering the large() &&
> > !pmd_mappable() case (similar to 04/10).
> > 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
> > (06/10, reviewed-by provided).
> > 5. finally, use the extended anon folio alloc API with the arch
> > preferred order in do_anonymous_page() (10/10, partially).

===

> > The rest can be split out into separate series and move forward in
> > parallel with probably a long list of things we need/want to do.
>
> Thanks for the fadt review - I really appreciate it!
>
> I've responded to many of your comments. I'd appreciate if we can close those
> points then I will work up a v2.

Thanks!

Based on the latest discussion here [1], my original list above can be
optionally reduced to 4 patches: item 2 can be quashed into item 5.

Also please make sure we have only one global (apply to all archs)
Kconfig option, and it should be added in item 5:

  if TRANSPARENT_HUGEPAGE
    config FLEXIBLE/VARIABLE_THP # or whatever name you see fit
  end if

(How many new Kconfig options added within arch/arm64/ is not a concern of MM.)

And please make sure it's disabled by default, because we are still
missing many important functions, e.g., I don't think we can mlock()
when large() && !pmd_mappable(), see mlock_pte_range() and
mlock_vma_folio(). We can fix it along with many things later, but we
need to present a plan and a schedule now. Otherwise, there would be
pushback if we try to land the series without supporting mlock().

Do you or Fengwei plan to take on it? (I personally don't.) If not,
I'll try to find someone from our team to look at it. (It'd be more
scalable if we have a coordinated group of people individually solving
different problems.)

[1] https://lore.kernel.org/r/b2c81404-67df-f841-ef02-919e841f49f2@arm.com/

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
  2023-06-28 18:22         ` Yu Zhao
  (?)
@ 2023-06-28 23:59           ` Yin Fengwei
  -1 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-28 23:59 UTC (permalink / raw)
  To: Yu Zhao, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

Hi Yu,

On 6/29/23 02:22, Yu Zhao wrote:
> And please make sure it's disabled by default, because we are still
> missing many important functions, e.g., I don't think we can mlock()
> when large() && !pmd_mappable(), see mlock_pte_range() and
> mlock_vma_folio(). We can fix it along with many things later, but we
> need to present a plan and a schedule now. Otherwise, there would be
> pushback if we try to land the series without supporting mlock().
> 
> Do you or Fengwei plan to take on it? (I personally don't.) If not,
Do you mean the mlock() with large folio? Yes. I can work on it. Thanks.


Regards
Yin, Fengwei

> I'll try to find someone from our team to look at it. (It'd be more
> scalable if we have a coordinated group of people individually solving
> different problems.)

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-28 23:59           ` Yin Fengwei
  0 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-28 23:59 UTC (permalink / raw)
  To: Yu Zhao, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

Hi Yu,

On 6/29/23 02:22, Yu Zhao wrote:
> And please make sure it's disabled by default, because we are still
> missing many important functions, e.g., I don't think we can mlock()
> when large() && !pmd_mappable(), see mlock_pte_range() and
> mlock_vma_folio(). We can fix it along with many things later, but we
> need to present a plan and a schedule now. Otherwise, there would be
> pushback if we try to land the series without supporting mlock().
> 
> Do you or Fengwei plan to take on it? (I personally don't.) If not,
Do you mean the mlock() with large folio? Yes. I can work on it. Thanks.


Regards
Yin, Fengwei

> I'll try to find someone from our team to look at it. (It'd be more
> scalable if we have a coordinated group of people individually solving
> different problems.)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-28 23:59           ` Yin Fengwei
  0 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-28 23:59 UTC (permalink / raw)
  To: Yu Zhao, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k

Hi Yu,

On 6/29/23 02:22, Yu Zhao wrote:
> And please make sure it's disabled by default, because we are still
> missing many important functions, e.g., I don't think we can mlock()
> when large() && !pmd_mappable(), see mlock_pte_range() and
> mlock_vma_folio(). We can fix it along with many things later, but we
> need to present a plan and a schedule now. Otherwise, there would be
> pushback if we try to land the series without supporting mlock().
> 
> Do you or Fengwei plan to take on it? (I personally don't.) If not,
Do you mean the mlock() with large folio? Yes. I can work on it. Thanks.


Regards
Yin, Fengwei

> I'll try to find someone from our team to look at it. (It'd be more
> scalable if we have a coordinated group of people individually solving
> different problems.)

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
  2023-06-28 23:59           ` Yin Fengwei
  (?)
@ 2023-06-29  0:27             ` Yu Zhao
  -1 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-29  0:27 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

On Wed, Jun 28, 2023 at 5:59 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
> Hi Yu,
>
> On 6/29/23 02:22, Yu Zhao wrote:
> > And please make sure it's disabled by default, because we are still
> > missing many important functions, e.g., I don't think we can mlock()
> > when large() && !pmd_mappable(), see mlock_pte_range() and
> > mlock_vma_folio(). We can fix it along with many things later, but we
> > need to present a plan and a schedule now. Otherwise, there would be
> > pushback if we try to land the series without supporting mlock().
> >
> > Do you or Fengwei plan to take on it? (I personally don't.) If not,
> Do you mean the mlock() with large folio? Yes. I can work on it. Thanks.

Great. Thanks!

Other places that have the similar problem but are probably easier to
fix than the mlock() case:
* madvise_cold_or_pageout_pte_range()
* shrink_folio_list()

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-29  0:27             ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-29  0:27 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

On Wed, Jun 28, 2023 at 5:59 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
> Hi Yu,
>
> On 6/29/23 02:22, Yu Zhao wrote:
> > And please make sure it's disabled by default, because we are still
> > missing many important functions, e.g., I don't think we can mlock()
> > when large() && !pmd_mappable(), see mlock_pte_range() and
> > mlock_vma_folio(). We can fix it along with many things later, but we
> > need to present a plan and a schedule now. Otherwise, there would be
> > pushback if we try to land the series without supporting mlock().
> >
> > Do you or Fengwei plan to take on it? (I personally don't.) If not,
> Do you mean the mlock() with large folio? Yes. I can work on it. Thanks.

Great. Thanks!

Other places that have the similar problem but are probably easier to
fix than the mlock() case:
* madvise_cold_or_pageout_pte_range()
* shrink_folio_list()

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-29  0:27             ` Yu Zhao
  0 siblings, 0 replies; 148+ messages in thread
From: Yu Zhao @ 2023-06-29  0:27 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k

On Wed, Jun 28, 2023 at 5:59 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
> Hi Yu,
>
> On 6/29/23 02:22, Yu Zhao wrote:
> > And please make sure it's disabled by default, because we are still
> > missing many important functions, e.g., I don't think we can mlock()
> > when large() && !pmd_mappable(), see mlock_pte_range() and
> > mlock_vma_folio(). We can fix it along with many things later, but we
> > need to present a plan and a schedule now. Otherwise, there would be
> > pushback if we try to land the series without supporting mlock().
> >
> > Do you or Fengwei plan to take on it? (I personally don't.) If not,
> Do you mean the mlock() with large folio? Yes. I can work on it. Thanks.

Great. Thanks!

Other places that have the similar problem but are probably easier to
fix than the mlock() case:
* madvise_cold_or_pageout_pte_range()
* shrink_folio_list()

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
  2023-06-29  0:27             ` Yu Zhao
  (?)
@ 2023-06-29  0:31               ` Yin Fengwei
  -1 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-29  0:31 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390



On 6/29/23 08:27, Yu Zhao wrote:
> On Wed, Jun 28, 2023 at 5:59 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>
>> Hi Yu,
>>
>> On 6/29/23 02:22, Yu Zhao wrote:
>>> And please make sure it's disabled by default, because we are still
>>> missing many important functions, e.g., I don't think we can mlock()
>>> when large() && !pmd_mappable(), see mlock_pte_range() and
>>> mlock_vma_folio(). We can fix it along with many things later, but we
>>> need to present a plan and a schedule now. Otherwise, there would be
>>> pushback if we try to land the series without supporting mlock().
>>>
>>> Do you or Fengwei plan to take on it? (I personally don't.) If not,
>> Do you mean the mlock() with large folio? Yes. I can work on it. Thanks.
> 
> Great. Thanks!
> 
> Other places that have the similar problem but are probably easier to
> fix than the mlock() case:
> * madvise_cold_or_pageout_pte_range()
This one was on my radar. :). 

Regards
Yin, Fengwei

> * shrink_folio_list()

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-29  0:31               ` Yin Fengwei
  0 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-29  0:31 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390



On 6/29/23 08:27, Yu Zhao wrote:
> On Wed, Jun 28, 2023 at 5:59 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>
>> Hi Yu,
>>
>> On 6/29/23 02:22, Yu Zhao wrote:
>>> And please make sure it's disabled by default, because we are still
>>> missing many important functions, e.g., I don't think we can mlock()
>>> when large() && !pmd_mappable(), see mlock_pte_range() and
>>> mlock_vma_folio(). We can fix it along with many things later, but we
>>> need to present a plan and a schedule now. Otherwise, there would be
>>> pushback if we try to land the series without supporting mlock().
>>>
>>> Do you or Fengwei plan to take on it? (I personally don't.) If not,
>> Do you mean the mlock() with large folio? Yes. I can work on it. Thanks.
> 
> Great. Thanks!
> 
> Other places that have the similar problem but are probably easier to
> fix than the mlock() case:
> * madvise_cold_or_pageout_pte_range()
This one was on my radar. :). 

Regards
Yin, Fengwei

> * shrink_folio_list()

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-29  0:31               ` Yin Fengwei
  0 siblings, 0 replies; 148+ messages in thread
From: Yin Fengwei @ 2023-06-29  0:31 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, lin



On 6/29/23 08:27, Yu Zhao wrote:
> On Wed, Jun 28, 2023 at 5:59 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>
>> Hi Yu,
>>
>> On 6/29/23 02:22, Yu Zhao wrote:
>>> And please make sure it's disabled by default, because we are still
>>> missing many important functions, e.g., I don't think we can mlock()
>>> when large() && !pmd_mappable(), see mlock_pte_range() and
>>> mlock_vma_folio(). We can fix it along with many things later, but we
>>> need to present a plan and a schedule now. Otherwise, there would be
>>> pushback if we try to land the series without supporting mlock().
>>>
>>> Do you or Fengwei plan to take on it? (I personally don't.) If not,
>> Do you mean the mlock() with large folio? Yes. I can work on it. Thanks.
> 
> Great. Thanks!
> 
> Other places that have the similar problem but are probably easier to
> fix than the mlock() case:
> * madvise_cold_or_pageout_pte_range()
This one was on my radar. :). 

Regards
Yin, Fengwei

> * shrink_folio_list()

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order
  2023-06-26 17:14   ` Ryan Roberts
  (?)
@ 2023-06-29  1:38     ` Yang Shi
  -1 siblings, 0 replies; 148+ messages in thread
From: Yang Shi @ 2023-06-29  1:38 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> For variable-order anonymous folios, we need to determine the order that
> we will allocate. From a SW perspective, the higher the order we
> allocate, the less overhead we will have; fewer faults, fewer folios in
> lists, etc. But of course there will also be more memory wastage as the
> order increases.
>
> From a HW perspective, there are memory block sizes that can be
> beneficial to reducing TLB pressure. arm64, for example, has the ability
> to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
> 64K base pages) such that one of these chunks only uses a single TLB
> entry.
>
> So we let the architecture specify the order of the maximally beneficial
> mapping unit when PTE-mapped. Furthermore, because in some cases, this
> order may be quite big (and therefore potentially wasteful of memory),
> allow the arch to specify 2 values; One is the max order for a mapping
> that _would not_ use THP if all size and alignment constraints were met,
> and the other is the max order for a mapping that _would_ use THP if all
> those constraints were met.
>
> Implement this with Kconfig by introducing some new options to allow the
> architecture to declare that it supports large anonymous folios along
> with these 2 preferred max order values. Then introduce a user-facing
> option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
> enabled if the architecture has declared its support. When disabled, it
> forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
> LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
> allocated.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/Kconfig  | 39 +++++++++++++++++++++++++++++++++++++++
>  mm/memory.c |  8 ++++++++
>  2 files changed, 47 insertions(+)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..f4ba48c37b75 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
>
>  source "mm/damon/Kconfig"
>
> +config ARCH_SUPPORTS_LARGE_ANON_FOLIO
> +       def_bool n
> +       help
> +         An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
> +         to be enabled. It must also set the following integer values:
> +         - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +         - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +
> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +       int
> +       help
> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
> +         that does not have the MADV_HUGEPAGE hint set.
> +
> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +       int
> +       help
> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
> +         that has the MADV_HUGEPAGE hint set.
> +
> +config LARGE_ANON_FOLIO
> +       bool "Allocate large folios for anonymous memory"
> +       depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
> +       default n
> +       help
> +         Use large (bigger than order-0) folios to back anonymous memory where
> +         possible. This reduces the number of page faults, as well as other
> +         per-page overheads to improve performance for many workloads.
> +
> +config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +       int
> +       default 0 if !LARGE_ANON_FOLIO
> +       default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +
> +config LARGE_ANON_FOLIO_THP_ORDER_MAX
> +       int
> +       default 0 if !LARGE_ANON_FOLIO
> +       default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +

IMHO I don't think we need all of the new kconfigs. Ideally the large
anon folios could be supported by all arches, although some of them
may not benefit from larger TLB entries due to lack of hardware
support.t

For now with a minimum implementation, I think you could define a
macro or a function that returns the hardware preferred order.

>  endmenu
> diff --git a/mm/memory.c b/mm/memory.c
> index 9165ed1b9fc2..a8f7e2b28d7a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3153,6 +3153,14 @@ static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>         return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>  }
>
> +static inline int max_anon_folio_order(struct vm_area_struct *vma)
> +{
> +       if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> +               return CONFIG_LARGE_ANON_FOLIO_THP_ORDER_MAX;
> +       else
> +               return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
> +}
> +
>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> --
> 2.25.1
>
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order
@ 2023-06-29  1:38     ` Yang Shi
  0 siblings, 0 replies; 148+ messages in thread
From: Yang Shi @ 2023-06-29  1:38 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> For variable-order anonymous folios, we need to determine the order that
> we will allocate. From a SW perspective, the higher the order we
> allocate, the less overhead we will have; fewer faults, fewer folios in
> lists, etc. But of course there will also be more memory wastage as the
> order increases.
>
> From a HW perspective, there are memory block sizes that can be
> beneficial to reducing TLB pressure. arm64, for example, has the ability
> to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
> 64K base pages) such that one of these chunks only uses a single TLB
> entry.
>
> So we let the architecture specify the order of the maximally beneficial
> mapping unit when PTE-mapped. Furthermore, because in some cases, this
> order may be quite big (and therefore potentially wasteful of memory),
> allow the arch to specify 2 values; One is the max order for a mapping
> that _would not_ use THP if all size and alignment constraints were met,
> and the other is the max order for a mapping that _would_ use THP if all
> those constraints were met.
>
> Implement this with Kconfig by introducing some new options to allow the
> architecture to declare that it supports large anonymous folios along
> with these 2 preferred max order values. Then introduce a user-facing
> option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
> enabled if the architecture has declared its support. When disabled, it
> forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
> LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
> allocated.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/Kconfig  | 39 +++++++++++++++++++++++++++++++++++++++
>  mm/memory.c |  8 ++++++++
>  2 files changed, 47 insertions(+)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..f4ba48c37b75 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
>
>  source "mm/damon/Kconfig"
>
> +config ARCH_SUPPORTS_LARGE_ANON_FOLIO
> +       def_bool n
> +       help
> +         An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
> +         to be enabled. It must also set the following integer values:
> +         - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +         - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +
> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +       int
> +       help
> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
> +         that does not have the MADV_HUGEPAGE hint set.
> +
> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +       int
> +       help
> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
> +         that has the MADV_HUGEPAGE hint set.
> +
> +config LARGE_ANON_FOLIO
> +       bool "Allocate large folios for anonymous memory"
> +       depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
> +       default n
> +       help
> +         Use large (bigger than order-0) folios to back anonymous memory where
> +         possible. This reduces the number of page faults, as well as other
> +         per-page overheads to improve performance for many workloads.
> +
> +config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +       int
> +       default 0 if !LARGE_ANON_FOLIO
> +       default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +
> +config LARGE_ANON_FOLIO_THP_ORDER_MAX
> +       int
> +       default 0 if !LARGE_ANON_FOLIO
> +       default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +

IMHO I don't think we need all of the new kconfigs. Ideally the large
anon folios could be supported by all arches, although some of them
may not benefit from larger TLB entries due to lack of hardware
support.t

For now with a minimum implementation, I think you could define a
macro or a function that returns the hardware preferred order.

>  endmenu
> diff --git a/mm/memory.c b/mm/memory.c
> index 9165ed1b9fc2..a8f7e2b28d7a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3153,6 +3153,14 @@ static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>         return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>  }
>
> +static inline int max_anon_folio_order(struct vm_area_struct *vma)
> +{
> +       if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> +               return CONFIG_LARGE_ANON_FOLIO_THP_ORDER_MAX;
> +       else
> +               return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
> +}
> +
>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> --
> 2.25.1
>
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order
@ 2023-06-29  1:38     ` Yang Shi
  0 siblings, 0 replies; 148+ messages in thread
From: Yang Shi @ 2023-06-29  1:38 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64

On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> For variable-order anonymous folios, we need to determine the order that
> we will allocate. From a SW perspective, the higher the order we
> allocate, the less overhead we will have; fewer faults, fewer folios in
> lists, etc. But of course there will also be more memory wastage as the
> order increases.
>
> From a HW perspective, there are memory block sizes that can be
> beneficial to reducing TLB pressure. arm64, for example, has the ability
> to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
> 64K base pages) such that one of these chunks only uses a single TLB
> entry.
>
> So we let the architecture specify the order of the maximally beneficial
> mapping unit when PTE-mapped. Furthermore, because in some cases, this
> order may be quite big (and therefore potentially wasteful of memory),
> allow the arch to specify 2 values; One is the max order for a mapping
> that _would not_ use THP if all size and alignment constraints were met,
> and the other is the max order for a mapping that _would_ use THP if all
> those constraints were met.
>
> Implement this with Kconfig by introducing some new options to allow the
> architecture to declare that it supports large anonymous folios along
> with these 2 preferred max order values. Then introduce a user-facing
> option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
> enabled if the architecture has declared its support. When disabled, it
> forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
> LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
> allocated.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/Kconfig  | 39 +++++++++++++++++++++++++++++++++++++++
>  mm/memory.c |  8 ++++++++
>  2 files changed, 47 insertions(+)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..f4ba48c37b75 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
>
>  source "mm/damon/Kconfig"
>
> +config ARCH_SUPPORTS_LARGE_ANON_FOLIO
> +       def_bool n
> +       help
> +         An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
> +         to be enabled. It must also set the following integer values:
> +         - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +         - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +
> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +       int
> +       help
> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
> +         that does not have the MADV_HUGEPAGE hint set.
> +
> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +       int
> +       help
> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
> +         that has the MADV_HUGEPAGE hint set.
> +
> +config LARGE_ANON_FOLIO
> +       bool "Allocate large folios for anonymous memory"
> +       depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
> +       default n
> +       help
> +         Use large (bigger than order-0) folios to back anonymous memory where
> +         possible. This reduces the number of page faults, as well as other
> +         per-page overheads to improve performance for many workloads.
> +
> +config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +       int
> +       default 0 if !LARGE_ANON_FOLIO
> +       default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
> +
> +config LARGE_ANON_FOLIO_THP_ORDER_MAX
> +       int
> +       default 0 if !LARGE_ANON_FOLIO
> +       default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
> +

IMHO I don't think we need all of the new kconfigs. Ideally the large
anon folios could be supported by all arches, although some of them
may not benefit from larger TLB entries due to lack of hardware
support.t

For now with a minimum implementation, I think you could define a
macro or a function that returns the hardware preferred order.

>  endmenu
> diff --git a/mm/memory.c b/mm/memory.c
> index 9165ed1b9fc2..a8f7e2b28d7a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3153,6 +3153,14 @@ static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>         return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>  }
>
> +static inline int max_anon_folio_order(struct vm_area_struct *vma)
> +{
> +       if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> +               return CONFIG_LARGE_ANON_FOLIO_THP_ORDER_MAX;
> +       else
> +               return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
> +}
> +
>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> --
> 2.25.1
>
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
  2023-06-26 17:14   ` Ryan Roberts
  (?)
@ 2023-06-29  2:13     ` Yang Shi
  -1 siblings, 0 replies; 148+ messages in thread
From: Yang Shi @ 2023-06-29  2:13 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> With all of the enabler patches in place, modify the anonymous memory
> write allocation path so that it opportunistically attempts to allocate
> a large folio up to `max_anon_folio_order()` size (This value is
> ultimately configured by the architecture). This reduces the number of
> page faults, reduces the size of (e.g. LRU) lists, and generally
> improves performance by batching what were per-page operations into
> per-(large)-folio operations.
>
> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> `max_anon_folio_order()` always returns 0, meaning we get the existing
> allocation behaviour.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 144 insertions(+), 15 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index a8f7e2b28d7a..d23c44cc5092 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>  }
>
> +/*
> + * Returns index of first pte that is not none, or nr if all are none.
> + */
> +static inline int check_ptes_none(pte_t *pte, int nr)
> +{
> +       int i;
> +
> +       for (i = 0; i < nr; i++) {
> +               if (!pte_none(ptep_get(pte++)))
> +                       return i;
> +       }
> +
> +       return nr;
> +}
> +
> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> +{
> +       /*
> +        * The aim here is to determine what size of folio we should allocate
> +        * for this fault. Factors include:
> +        * - Order must not be higher than `order` upon entry
> +        * - Folio must be naturally aligned within VA space
> +        * - Folio must not breach boundaries of vma
> +        * - Folio must be fully contained inside one pmd entry
> +        * - Folio must not overlap any non-none ptes
> +        *
> +        * Additionally, we do not allow order-1 since this breaks assumptions
> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
> +        * store state up to the 3rd struct page subpage), and these pages must
> +        * be THP in order to correctly use pre-existing THP infrastructure such
> +        * as folio_split().
> +        *
> +        * As a consequence of relying on the THP infrastructure, if the system
> +        * does not support THP, we always fallback to order-0.
> +        *
> +        * Note that the caller may or may not choose to lock the pte. If
> +        * unlocked, the calculation should be considered an estimate that will
> +        * need to be validated under the lock.
> +        */
> +
> +       struct vm_area_struct *vma = vmf->vma;
> +       int nr;
> +       unsigned long addr;
> +       pte_t *pte;
> +       pte_t *first_set = NULL;
> +       int ret;
> +
> +       if (has_transparent_hugepage()) {
> +               order = min(order, PMD_SHIFT - PAGE_SHIFT);
> +
> +               for (; order > 1; order--) {
> +                       nr = 1 << order;
> +                       addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
> +                       pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
> +
> +                       /* Check vma bounds. */
> +                       if (addr < vma->vm_start ||
> +                           addr + (nr << PAGE_SHIFT) > vma->vm_end)
> +                               continue;
> +
> +                       /* Ptes covered by order already known to be none. */
> +                       if (pte + nr <= first_set)
> +                               break;
> +
> +                       /* Already found set pte in range covered by order. */
> +                       if (pte <= first_set)
> +                               continue;
> +
> +                       /* Need to check if all the ptes are none. */
> +                       ret = check_ptes_none(pte, nr);
> +                       if (ret == nr)
> +                               break;
> +
> +                       first_set = pte + ret;
> +               }
> +
> +               if (order == 1)
> +                       order = 0;
> +       } else
> +               order = 0;
> +
> +       return order;
> +}
> +
>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> @@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         struct folio *folio;
>         vm_fault_t ret = 0;
>         pte_t entry;
> +       unsigned long addr;
> +       int order = uffd_wp ? 0 : max_anon_folio_order(vma);
> +       int pgcount = BIT(order);
>
>         /* File mapping without ->vm_ops ? */
>         if (vma->vm_flags & VM_SHARED)
> @@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
>                         return handle_userfault(vmf, VM_UFFD_MISSING);
>                 }
> -               goto setpte;
> +               if (uffd_wp)
> +                       entry = pte_mkuffd_wp(entry);
> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +
> +               /* No need to invalidate - it was non-present before */
> +               update_mmu_cache(vma, vmf->address, vmf->pte);
> +               goto unlock;
>         }
>
> -       /* Allocate our own private page. */
> +retry:
> +       /*
> +        * Estimate the folio order to allocate. We are not under the ptl here
> +        * so this estiamte needs to be re-checked later once we have the lock.
> +        */
> +       vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> +       order = calc_anon_folio_order_alloc(vmf, order);
> +       pte_unmap(vmf->pte);
> +
> +       /* Allocate our own private folio. */
>         if (unlikely(anon_vma_prepare(vma)))
>                 goto oom;
> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
> +       folio = try_vma_alloc_movable_folio(vma, vmf->address, order, true);
>         if (!folio)
>                 goto oom;
>
> +       /* We may have been granted less than we asked for. */
> +       order = folio_order(folio);
> +       pgcount = BIT(order);
> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
> +
>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>                 goto oom_free_page;
>         folio_throttle_swaprate(folio, GFP_KERNEL);
>
>         /*
>          * The memory barrier inside __folio_mark_uptodate makes sure that
> -        * preceding stores to the page contents become visible before
> -        * the set_pte_at() write.
> +        * preceding stores to the folio contents become visible before
> +        * the set_ptes() write.
>          */
>         __folio_mark_uptodate(folio);
>
> @@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         if (vma->vm_flags & VM_WRITE)
>                 entry = pte_mkwrite(pte_mkdirty(entry));
>
> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -                       &vmf->ptl);
> -       if (vmf_pte_changed(vmf)) {
> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
> -               goto release;
> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> +
> +       /*
> +        * Ensure our estimate above is still correct; we could have raced with
> +        * another thread to service a fault in the region.
> +        */
> +       if (order == 0) {
> +               if (vmf_pte_changed(vmf)) {
> +                       update_mmu_tlb(vma, vmf->address, vmf->pte);
> +                       goto release;
> +               }
> +       } else if (check_ptes_none(vmf->pte, pgcount) != pgcount) {
> +               pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
> +
> +               /* If faulting pte was allocated by another, exit early. */
> +               if (!pte_none(ptep_get(pte))) {
> +                       update_mmu_tlb(vma, vmf->address, pte);
> +                       goto release;
> +               }
> +
> +               /* Else try again, with a lower order. */
> +               pte_unmap_unlock(vmf->pte, vmf->ptl);
> +               folio_put(folio);
> +               order--;
> +               goto retry;

I'm not sure whether this extra fallback logic is worth it or not. Do
you have any benchmark data or is it just an arbitrary design choice?
If it is just an arbitrary design choice, I'd like to go with the
simplest way by just exiting page fault handler, just like the
order-0, IMHO.

>         }
>
>         ret = check_stable_address_space(vma->vm_mm);
> @@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>         }
>
> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
> +       folio_ref_add(folio, pgcount - 1);
> +
> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
> +       folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, addr);
>         folio_add_lru_vma(folio, vma);
> -setpte:
> +
>         if (uffd_wp)
>                 entry = pte_mkuffd_wp(entry);
> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>
>         /* No need to invalidate - it was non-present before */
> -       update_mmu_cache(vma, vmf->address, vmf->pte);
> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
>  unlock:
>         pte_unmap_unlock(vmf->pte, vmf->ptl);
>         return ret;
> --
> 2.25.1
>
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
@ 2023-06-29  2:13     ` Yang Shi
  0 siblings, 0 replies; 148+ messages in thread
From: Yang Shi @ 2023-06-29  2:13 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> With all of the enabler patches in place, modify the anonymous memory
> write allocation path so that it opportunistically attempts to allocate
> a large folio up to `max_anon_folio_order()` size (This value is
> ultimately configured by the architecture). This reduces the number of
> page faults, reduces the size of (e.g. LRU) lists, and generally
> improves performance by batching what were per-page operations into
> per-(large)-folio operations.
>
> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> `max_anon_folio_order()` always returns 0, meaning we get the existing
> allocation behaviour.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 144 insertions(+), 15 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index a8f7e2b28d7a..d23c44cc5092 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>  }
>
> +/*
> + * Returns index of first pte that is not none, or nr if all are none.
> + */
> +static inline int check_ptes_none(pte_t *pte, int nr)
> +{
> +       int i;
> +
> +       for (i = 0; i < nr; i++) {
> +               if (!pte_none(ptep_get(pte++)))
> +                       return i;
> +       }
> +
> +       return nr;
> +}
> +
> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> +{
> +       /*
> +        * The aim here is to determine what size of folio we should allocate
> +        * for this fault. Factors include:
> +        * - Order must not be higher than `order` upon entry
> +        * - Folio must be naturally aligned within VA space
> +        * - Folio must not breach boundaries of vma
> +        * - Folio must be fully contained inside one pmd entry
> +        * - Folio must not overlap any non-none ptes
> +        *
> +        * Additionally, we do not allow order-1 since this breaks assumptions
> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
> +        * store state up to the 3rd struct page subpage), and these pages must
> +        * be THP in order to correctly use pre-existing THP infrastructure such
> +        * as folio_split().
> +        *
> +        * As a consequence of relying on the THP infrastructure, if the system
> +        * does not support THP, we always fallback to order-0.
> +        *
> +        * Note that the caller may or may not choose to lock the pte. If
> +        * unlocked, the calculation should be considered an estimate that will
> +        * need to be validated under the lock.
> +        */
> +
> +       struct vm_area_struct *vma = vmf->vma;
> +       int nr;
> +       unsigned long addr;
> +       pte_t *pte;
> +       pte_t *first_set = NULL;
> +       int ret;
> +
> +       if (has_transparent_hugepage()) {
> +               order = min(order, PMD_SHIFT - PAGE_SHIFT);
> +
> +               for (; order > 1; order--) {
> +                       nr = 1 << order;
> +                       addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
> +                       pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
> +
> +                       /* Check vma bounds. */
> +                       if (addr < vma->vm_start ||
> +                           addr + (nr << PAGE_SHIFT) > vma->vm_end)
> +                               continue;
> +
> +                       /* Ptes covered by order already known to be none. */
> +                       if (pte + nr <= first_set)
> +                               break;
> +
> +                       /* Already found set pte in range covered by order. */
> +                       if (pte <= first_set)
> +                               continue;
> +
> +                       /* Need to check if all the ptes are none. */
> +                       ret = check_ptes_none(pte, nr);
> +                       if (ret == nr)
> +                               break;
> +
> +                       first_set = pte + ret;
> +               }
> +
> +               if (order == 1)
> +                       order = 0;
> +       } else
> +               order = 0;
> +
> +       return order;
> +}
> +
>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> @@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         struct folio *folio;
>         vm_fault_t ret = 0;
>         pte_t entry;
> +       unsigned long addr;
> +       int order = uffd_wp ? 0 : max_anon_folio_order(vma);
> +       int pgcount = BIT(order);
>
>         /* File mapping without ->vm_ops ? */
>         if (vma->vm_flags & VM_SHARED)
> @@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
>                         return handle_userfault(vmf, VM_UFFD_MISSING);
>                 }
> -               goto setpte;
> +               if (uffd_wp)
> +                       entry = pte_mkuffd_wp(entry);
> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +
> +               /* No need to invalidate - it was non-present before */
> +               update_mmu_cache(vma, vmf->address, vmf->pte);
> +               goto unlock;
>         }
>
> -       /* Allocate our own private page. */
> +retry:
> +       /*
> +        * Estimate the folio order to allocate. We are not under the ptl here
> +        * so this estiamte needs to be re-checked later once we have the lock.
> +        */
> +       vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> +       order = calc_anon_folio_order_alloc(vmf, order);
> +       pte_unmap(vmf->pte);
> +
> +       /* Allocate our own private folio. */
>         if (unlikely(anon_vma_prepare(vma)))
>                 goto oom;
> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
> +       folio = try_vma_alloc_movable_folio(vma, vmf->address, order, true);
>         if (!folio)
>                 goto oom;
>
> +       /* We may have been granted less than we asked for. */
> +       order = folio_order(folio);
> +       pgcount = BIT(order);
> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
> +
>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>                 goto oom_free_page;
>         folio_throttle_swaprate(folio, GFP_KERNEL);
>
>         /*
>          * The memory barrier inside __folio_mark_uptodate makes sure that
> -        * preceding stores to the page contents become visible before
> -        * the set_pte_at() write.
> +        * preceding stores to the folio contents become visible before
> +        * the set_ptes() write.
>          */
>         __folio_mark_uptodate(folio);
>
> @@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         if (vma->vm_flags & VM_WRITE)
>                 entry = pte_mkwrite(pte_mkdirty(entry));
>
> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -                       &vmf->ptl);
> -       if (vmf_pte_changed(vmf)) {
> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
> -               goto release;
> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> +
> +       /*
> +        * Ensure our estimate above is still correct; we could have raced with
> +        * another thread to service a fault in the region.
> +        */
> +       if (order == 0) {
> +               if (vmf_pte_changed(vmf)) {
> +                       update_mmu_tlb(vma, vmf->address, vmf->pte);
> +                       goto release;
> +               }
> +       } else if (check_ptes_none(vmf->pte, pgcount) != pgcount) {
> +               pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
> +
> +               /* If faulting pte was allocated by another, exit early. */
> +               if (!pte_none(ptep_get(pte))) {
> +                       update_mmu_tlb(vma, vmf->address, pte);
> +                       goto release;
> +               }
> +
> +               /* Else try again, with a lower order. */
> +               pte_unmap_unlock(vmf->pte, vmf->ptl);
> +               folio_put(folio);
> +               order--;
> +               goto retry;

I'm not sure whether this extra fallback logic is worth it or not. Do
you have any benchmark data or is it just an arbitrary design choice?
If it is just an arbitrary design choice, I'd like to go with the
simplest way by just exiting page fault handler, just like the
order-0, IMHO.

>         }
>
>         ret = check_stable_address_space(vma->vm_mm);
> @@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>         }
>
> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
> +       folio_ref_add(folio, pgcount - 1);
> +
> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
> +       folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, addr);
>         folio_add_lru_vma(folio, vma);
> -setpte:
> +
>         if (uffd_wp)
>                 entry = pte_mkuffd_wp(entry);
> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>
>         /* No need to invalidate - it was non-present before */
> -       update_mmu_cache(vma, vmf->address, vmf->pte);
> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
>  unlock:
>         pte_unmap_unlock(vmf->pte, vmf->ptl);
>         return ret;
> --
> 2.25.1
>
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
@ 2023-06-29  2:13     ` Yang Shi
  0 siblings, 0 replies; 148+ messages in thread
From: Yang Shi @ 2023-06-29  2:13 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64

On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> With all of the enabler patches in place, modify the anonymous memory
> write allocation path so that it opportunistically attempts to allocate
> a large folio up to `max_anon_folio_order()` size (This value is
> ultimately configured by the architecture). This reduces the number of
> page faults, reduces the size of (e.g. LRU) lists, and generally
> improves performance by batching what were per-page operations into
> per-(large)-folio operations.
>
> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> `max_anon_folio_order()` always returns 0, meaning we get the existing
> allocation behaviour.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 144 insertions(+), 15 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index a8f7e2b28d7a..d23c44cc5092 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>  }
>
> +/*
> + * Returns index of first pte that is not none, or nr if all are none.
> + */
> +static inline int check_ptes_none(pte_t *pte, int nr)
> +{
> +       int i;
> +
> +       for (i = 0; i < nr; i++) {
> +               if (!pte_none(ptep_get(pte++)))
> +                       return i;
> +       }
> +
> +       return nr;
> +}
> +
> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> +{
> +       /*
> +        * The aim here is to determine what size of folio we should allocate
> +        * for this fault. Factors include:
> +        * - Order must not be higher than `order` upon entry
> +        * - Folio must be naturally aligned within VA space
> +        * - Folio must not breach boundaries of vma
> +        * - Folio must be fully contained inside one pmd entry
> +        * - Folio must not overlap any non-none ptes
> +        *
> +        * Additionally, we do not allow order-1 since this breaks assumptions
> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
> +        * store state up to the 3rd struct page subpage), and these pages must
> +        * be THP in order to correctly use pre-existing THP infrastructure such
> +        * as folio_split().
> +        *
> +        * As a consequence of relying on the THP infrastructure, if the system
> +        * does not support THP, we always fallback to order-0.
> +        *
> +        * Note that the caller may or may not choose to lock the pte. If
> +        * unlocked, the calculation should be considered an estimate that will
> +        * need to be validated under the lock.
> +        */
> +
> +       struct vm_area_struct *vma = vmf->vma;
> +       int nr;
> +       unsigned long addr;
> +       pte_t *pte;
> +       pte_t *first_set = NULL;
> +       int ret;
> +
> +       if (has_transparent_hugepage()) {
> +               order = min(order, PMD_SHIFT - PAGE_SHIFT);
> +
> +               for (; order > 1; order--) {
> +                       nr = 1 << order;
> +                       addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
> +                       pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
> +
> +                       /* Check vma bounds. */
> +                       if (addr < vma->vm_start ||
> +                           addr + (nr << PAGE_SHIFT) > vma->vm_end)
> +                               continue;
> +
> +                       /* Ptes covered by order already known to be none. */
> +                       if (pte + nr <= first_set)
> +                               break;
> +
> +                       /* Already found set pte in range covered by order. */
> +                       if (pte <= first_set)
> +                               continue;
> +
> +                       /* Need to check if all the ptes are none. */
> +                       ret = check_ptes_none(pte, nr);
> +                       if (ret == nr)
> +                               break;
> +
> +                       first_set = pte + ret;
> +               }
> +
> +               if (order == 1)
> +                       order = 0;
> +       } else
> +               order = 0;
> +
> +       return order;
> +}
> +
>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> @@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         struct folio *folio;
>         vm_fault_t ret = 0;
>         pte_t entry;
> +       unsigned long addr;
> +       int order = uffd_wp ? 0 : max_anon_folio_order(vma);
> +       int pgcount = BIT(order);
>
>         /* File mapping without ->vm_ops ? */
>         if (vma->vm_flags & VM_SHARED)
> @@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
>                         return handle_userfault(vmf, VM_UFFD_MISSING);
>                 }
> -               goto setpte;
> +               if (uffd_wp)
> +                       entry = pte_mkuffd_wp(entry);
> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +
> +               /* No need to invalidate - it was non-present before */
> +               update_mmu_cache(vma, vmf->address, vmf->pte);
> +               goto unlock;
>         }
>
> -       /* Allocate our own private page. */
> +retry:
> +       /*
> +        * Estimate the folio order to allocate. We are not under the ptl here
> +        * so this estiamte needs to be re-checked later once we have the lock.
> +        */
> +       vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> +       order = calc_anon_folio_order_alloc(vmf, order);
> +       pte_unmap(vmf->pte);
> +
> +       /* Allocate our own private folio. */
>         if (unlikely(anon_vma_prepare(vma)))
>                 goto oom;
> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
> +       folio = try_vma_alloc_movable_folio(vma, vmf->address, order, true);
>         if (!folio)
>                 goto oom;
>
> +       /* We may have been granted less than we asked for. */
> +       order = folio_order(folio);
> +       pgcount = BIT(order);
> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
> +
>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>                 goto oom_free_page;
>         folio_throttle_swaprate(folio, GFP_KERNEL);
>
>         /*
>          * The memory barrier inside __folio_mark_uptodate makes sure that
> -        * preceding stores to the page contents become visible before
> -        * the set_pte_at() write.
> +        * preceding stores to the folio contents become visible before
> +        * the set_ptes() write.
>          */
>         __folio_mark_uptodate(folio);
>
> @@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         if (vma->vm_flags & VM_WRITE)
>                 entry = pte_mkwrite(pte_mkdirty(entry));
>
> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -                       &vmf->ptl);
> -       if (vmf_pte_changed(vmf)) {
> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
> -               goto release;
> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> +
> +       /*
> +        * Ensure our estimate above is still correct; we could have raced with
> +        * another thread to service a fault in the region.
> +        */
> +       if (order == 0) {
> +               if (vmf_pte_changed(vmf)) {
> +                       update_mmu_tlb(vma, vmf->address, vmf->pte);
> +                       goto release;
> +               }
> +       } else if (check_ptes_none(vmf->pte, pgcount) != pgcount) {
> +               pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
> +
> +               /* If faulting pte was allocated by another, exit early. */
> +               if (!pte_none(ptep_get(pte))) {
> +                       update_mmu_tlb(vma, vmf->address, pte);
> +                       goto release;
> +               }
> +
> +               /* Else try again, with a lower order. */
> +               pte_unmap_unlock(vmf->pte, vmf->ptl);
> +               folio_put(folio);
> +               order--;
> +               goto retry;

I'm not sure whether this extra fallback logic is worth it or not. Do
you have any benchmark data or is it just an arbitrary design choice?
If it is just an arbitrary design choice, I'd like to go with the
simplest way by just exiting page fault handler, just like the
order-0, IMHO.

>         }
>
>         ret = check_stable_address_space(vma->vm_mm);
> @@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>         }
>
> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
> +       folio_ref_add(folio, pgcount - 1);
> +
> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
> +       folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, addr);
>         folio_add_lru_vma(folio, vma);
> -setpte:
> +
>         if (uffd_wp)
>                 entry = pte_mkuffd_wp(entry);
> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>
>         /* No need to invalidate - it was non-present before */
> -       update_mmu_cache(vma, vmf->address, vmf->pte);
> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
>  unlock:
>         pte_unmap_unlock(vmf->pte, vmf->ptl);
>         return ret;
> --
> 2.25.1
>
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
  2023-06-27  7:49     ` Yu Zhao
  (?)
@ 2023-06-29  2:21       ` Yang Shi
  -1 siblings, 0 replies; 148+ messages in thread
From: Yang Shi @ 2023-06-29  2:21 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Tue, Jun 27, 2023 at 12:49 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> > >
> > > Hi All,
> > >
> > > Following on from the previous RFCv2 [1], this series implements variable order,
> > > large folios for anonymous memory. The objective of this is to improve
> > > performance by allocating larger chunks of memory during anonymous page faults:
> > >
> > >  - Since SW (the kernel) is dealing with larger chunks of memory than base
> > >    pages, there are efficiency savings to be had; fewer page faults, batched PTE
> > >    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
> > >    overhead. This should benefit all architectures.
> > >  - Since we are now mapping physically contiguous chunks of memory, we can take
> > >    advantage of HW TLB compression techniques. A reduction in TLB pressure
> > >    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> > >    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
> > >
> > > This patch set deals with the SW side of things only and based on feedback from
> > > the RFC, aims to be the most minimal initial change, upon which future
> > > incremental changes can be added. For this reason, the new behaviour is hidden
> > > behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
> > > default. Although the code has been refactored to parameterize the desired order
> > > of the allocation, when the feature is disabled (by forcing the order to be
> > > always 0) my performance tests measure no regression. So I'm hoping this will be
> > > a suitable mechanism to allow incremental submissions to the kernel without
> > > affecting the rest of the world.
> > >
> > > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> > > [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
> > > getting that series into the kernel, but I'm hoping we can start the review
> > > process on this patch set independently. I have a branch at [3].
> > >
> > > I've posted a separate series concerning the HW part (contpte mapping) for arm64
> > > at [4].
> > >
> > >
> > > Performance
> > > -----------
> > >
> > > Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> > > javascript benchmark running in Chromium). Both cases are running on Ampere
> > > Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> > > is repeated 15 times over 5 reboots and averaged.
> > >
> > > All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
> > > 'anonfolio' is the full patch set similar to the RFC with the additional changes
> > > to the extra 3 fault paths. The rest of the configs are described at [4].
> > >
> > > Kernel Compilation (smaller is better):
> > >
> > > | kernel          |   real-time |   kern-time |   user-time |
> > > |:----------------|------------:|------------:|------------:|
> > > | baseline-4k     |        0.0% |        0.0% |        0.0% |
> > > | anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
> > > | anonfolio       |       -5.4% |      -46.0% |       -0.3% |
> > > | contpte         |       -6.8% |      -45.7% |       -2.1% |
> > > | exefolio        |       -8.4% |      -46.4% |       -3.7% |
> > > | baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
> > > | baseline-64k    |      -10.5% |      -66.0% |       -3.5% |
> > >
> > > Speedometer 2.0 (bigger is better):
> > >
> > > | kernel          |   runs_per_min |
> > > |:----------------|---------------:|
> > > | baseline-4k     |           0.0% |
> > > | anonfolio-basic |           0.7% |
> > > | anonfolio       |           1.2% |
> > > | contpte         |           3.1% |
> > > | exefolio        |           4.2% |
> > > | baseline-16k    |           5.3% |
> >
> > Thanks for pushing this forward!
> >
> > > Changes since RFCv2
> > > -------------------
> > >
> > >   - Simplified series to bare minimum (on David Hildenbrand's advice)
> >
> > My impression is that this series still includes many pieces that can
> > be split out and discussed separately with followup series.
> >
> > (I skipped 04/10 and will look at it tomorrow.)
>
> I went through the series twice. Here what I think a bare minimum
> series (easier to review/debug/land) would look like:
> 1. a new arch specific function providing a prefered order within (0,
> PMD_ORDER).
> 2. an extended anon folio alloc API taking that order (02/10, partially).
> 3. an updated folio_add_new_anon_rmap() covering the large() &&
> !pmd_mappable() case (similar to 04/10).
> 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
> (06/10, reviewed-by provided).
> 5. finally, use the extended anon folio alloc API with the arch
> preferred order in do_anonymous_page() (10/10, partially).
>
> The rest can be split out into separate series and move forward in
> parallel with probably a long list of things we need/want to do.

Yeah, the suggestion makes sense to me. And I'd like to go with the
simplest way unless there is strong justification for extra
optimization for the time being IMHO.

>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-29  2:21       ` Yang Shi
  0 siblings, 0 replies; 148+ messages in thread
From: Yang Shi @ 2023-06-29  2:21 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Tue, Jun 27, 2023 at 12:49 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> > >
> > > Hi All,
> > >
> > > Following on from the previous RFCv2 [1], this series implements variable order,
> > > large folios for anonymous memory. The objective of this is to improve
> > > performance by allocating larger chunks of memory during anonymous page faults:
> > >
> > >  - Since SW (the kernel) is dealing with larger chunks of memory than base
> > >    pages, there are efficiency savings to be had; fewer page faults, batched PTE
> > >    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
> > >    overhead. This should benefit all architectures.
> > >  - Since we are now mapping physically contiguous chunks of memory, we can take
> > >    advantage of HW TLB compression techniques. A reduction in TLB pressure
> > >    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> > >    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
> > >
> > > This patch set deals with the SW side of things only and based on feedback from
> > > the RFC, aims to be the most minimal initial change, upon which future
> > > incremental changes can be added. For this reason, the new behaviour is hidden
> > > behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
> > > default. Although the code has been refactored to parameterize the desired order
> > > of the allocation, when the feature is disabled (by forcing the order to be
> > > always 0) my performance tests measure no regression. So I'm hoping this will be
> > > a suitable mechanism to allow incremental submissions to the kernel without
> > > affecting the rest of the world.
> > >
> > > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> > > [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
> > > getting that series into the kernel, but I'm hoping we can start the review
> > > process on this patch set independently. I have a branch at [3].
> > >
> > > I've posted a separate series concerning the HW part (contpte mapping) for arm64
> > > at [4].
> > >
> > >
> > > Performance
> > > -----------
> > >
> > > Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> > > javascript benchmark running in Chromium). Both cases are running on Ampere
> > > Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> > > is repeated 15 times over 5 reboots and averaged.
> > >
> > > All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
> > > 'anonfolio' is the full patch set similar to the RFC with the additional changes
> > > to the extra 3 fault paths. The rest of the configs are described at [4].
> > >
> > > Kernel Compilation (smaller is better):
> > >
> > > | kernel          |   real-time |   kern-time |   user-time |
> > > |:----------------|------------:|------------:|------------:|
> > > | baseline-4k     |        0.0% |        0.0% |        0.0% |
> > > | anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
> > > | anonfolio       |       -5.4% |      -46.0% |       -0.3% |
> > > | contpte         |       -6.8% |      -45.7% |       -2.1% |
> > > | exefolio        |       -8.4% |      -46.4% |       -3.7% |
> > > | baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
> > > | baseline-64k    |      -10.5% |      -66.0% |       -3.5% |
> > >
> > > Speedometer 2.0 (bigger is better):
> > >
> > > | kernel          |   runs_per_min |
> > > |:----------------|---------------:|
> > > | baseline-4k     |           0.0% |
> > > | anonfolio-basic |           0.7% |
> > > | anonfolio       |           1.2% |
> > > | contpte         |           3.1% |
> > > | exefolio        |           4.2% |
> > > | baseline-16k    |           5.3% |
> >
> > Thanks for pushing this forward!
> >
> > > Changes since RFCv2
> > > -------------------
> > >
> > >   - Simplified series to bare minimum (on David Hildenbrand's advice)
> >
> > My impression is that this series still includes many pieces that can
> > be split out and discussed separately with followup series.
> >
> > (I skipped 04/10 and will look at it tomorrow.)
>
> I went through the series twice. Here what I think a bare minimum
> series (easier to review/debug/land) would look like:
> 1. a new arch specific function providing a prefered order within (0,
> PMD_ORDER).
> 2. an extended anon folio alloc API taking that order (02/10, partially).
> 3. an updated folio_add_new_anon_rmap() covering the large() &&
> !pmd_mappable() case (similar to 04/10).
> 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
> (06/10, reviewed-by provided).
> 5. finally, use the extended anon folio alloc API with the arch
> preferred order in do_anonymous_page() (10/10, partially).
>
> The rest can be split out into separate series and move forward in
> parallel with probably a long list of things we need/want to do.

Yeah, the suggestion makes sense to me. And I'd like to go with the
simplest way unless there is strong justification for extra
optimization for the time being IMHO.

>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-29  2:21       ` Yang Shi
  0 siblings, 0 replies; 148+ messages in thread
From: Yang Shi @ 2023-06-29  2:21 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel, linux-i

On Tue, Jun 27, 2023 at 12:49 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> > >
> > > Hi All,
> > >
> > > Following on from the previous RFCv2 [1], this series implements variable order,
> > > large folios for anonymous memory. The objective of this is to improve
> > > performance by allocating larger chunks of memory during anonymous page faults:
> > >
> > >  - Since SW (the kernel) is dealing with larger chunks of memory than base
> > >    pages, there are efficiency savings to be had; fewer page faults, batched PTE
> > >    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
> > >    overhead. This should benefit all architectures.
> > >  - Since we are now mapping physically contiguous chunks of memory, we can take
> > >    advantage of HW TLB compression techniques. A reduction in TLB pressure
> > >    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> > >    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
> > >
> > > This patch set deals with the SW side of things only and based on feedback from
> > > the RFC, aims to be the most minimal initial change, upon which future
> > > incremental changes can be added. For this reason, the new behaviour is hidden
> > > behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
> > > default. Although the code has been refactored to parameterize the desired order
> > > of the allocation, when the feature is disabled (by forcing the order to be
> > > always 0) my performance tests measure no regression. So I'm hoping this will be
> > > a suitable mechanism to allow incremental submissions to the kernel without
> > > affecting the rest of the world.
> > >
> > > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> > > [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
> > > getting that series into the kernel, but I'm hoping we can start the review
> > > process on this patch set independently. I have a branch at [3].
> > >
> > > I've posted a separate series concerning the HW part (contpte mapping) for arm64
> > > at [4].
> > >
> > >
> > > Performance
> > > -----------
> > >
> > > Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> > > javascript benchmark running in Chromium). Both cases are running on Ampere
> > > Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> > > is repeated 15 times over 5 reboots and averaged.
> > >
> > > All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
> > > 'anonfolio' is the full patch set similar to the RFC with the additional changes
> > > to the extra 3 fault paths. The rest of the configs are described at [4].
> > >
> > > Kernel Compilation (smaller is better):
> > >
> > > | kernel          |   real-time |   kern-time |   user-time |
> > > |:----------------|------------:|------------:|------------:|
> > > | baseline-4k     |        0.0% |        0.0% |        0.0% |
> > > | anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
> > > | anonfolio       |       -5.4% |      -46.0% |       -0.3% |
> > > | contpte         |       -6.8% |      -45.7% |       -2.1% |
> > > | exefolio        |       -8.4% |      -46.4% |       -3.7% |
> > > | baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
> > > | baseline-64k    |      -10.5% |      -66.0% |       -3.5% |
> > >
> > > Speedometer 2.0 (bigger is better):
> > >
> > > | kernel          |   runs_per_min |
> > > |:----------------|---------------:|
> > > | baseline-4k     |           0.0% |
> > > | anonfolio-basic |           0.7% |
> > > | anonfolio       |           1.2% |
> > > | contpte         |           3.1% |
> > > | exefolio        |           4.2% |
> > > | baseline-16k    |           5.3% |
> >
> > Thanks for pushing this forward!
> >
> > > Changes since RFCv2
> > > -------------------
> > >
> > >   - Simplified series to bare minimum (on David Hildenbrand's advice)
> >
> > My impression is that this series still includes many pieces that can
> > be split out and discussed separately with followup series.
> >
> > (I skipped 04/10 and will look at it tomorrow.)
>
> I went through the series twice. Here what I think a bare minimum
> series (easier to review/debug/land) would look like:
> 1. a new arch specific function providing a prefered order within (0,
> PMD_ORDER).
> 2. an extended anon folio alloc API taking that order (02/10, partially).
> 3. an updated folio_add_new_anon_rmap() covering the large() &&
> !pmd_mappable() case (similar to 04/10).
> 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
> (06/10, reviewed-by provided).
> 5. finally, use the extended anon folio alloc API with the arch
> preferred order in do_anonymous_page() (10/10, partially).
>
> The rest can be split out into separate series and move forward in
> parallel with probably a long list of things we need/want to do.

Yeah, the suggestion makes sense to me. And I'd like to go with the
simplest way unless there is strong justification for extra
optimization for the time being IMHO.

>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
  2023-06-29  2:13     ` Yang Shi
  (?)
@ 2023-06-29 11:30       ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-29 11:30 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 29/06/2023 03:13, Yang Shi wrote:
> On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> With all of the enabler patches in place, modify the anonymous memory
>> write allocation path so that it opportunistically attempts to allocate
>> a large folio up to `max_anon_folio_order()` size (This value is
>> ultimately configured by the architecture). This reduces the number of
>> page faults, reduces the size of (e.g. LRU) lists, and generally
>> improves performance by batching what were per-page operations into
>> per-(large)-folio operations.
>>
>> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
>> `max_anon_folio_order()` always returns 0, meaning we get the existing
>> allocation behaviour.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 144 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index a8f7e2b28d7a..d23c44cc5092 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
>>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>>  }
>>
>> +/*
>> + * Returns index of first pte that is not none, or nr if all are none.
>> + */
>> +static inline int check_ptes_none(pte_t *pte, int nr)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < nr; i++) {
>> +               if (!pte_none(ptep_get(pte++)))
>> +                       return i;
>> +       }
>> +
>> +       return nr;
>> +}
>> +
>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
>> +{
>> +       /*
>> +        * The aim here is to determine what size of folio we should allocate
>> +        * for this fault. Factors include:
>> +        * - Order must not be higher than `order` upon entry
>> +        * - Folio must be naturally aligned within VA space
>> +        * - Folio must not breach boundaries of vma
>> +        * - Folio must be fully contained inside one pmd entry
>> +        * - Folio must not overlap any non-none ptes
>> +        *
>> +        * Additionally, we do not allow order-1 since this breaks assumptions
>> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
>> +        * store state up to the 3rd struct page subpage), and these pages must
>> +        * be THP in order to correctly use pre-existing THP infrastructure such
>> +        * as folio_split().
>> +        *
>> +        * As a consequence of relying on the THP infrastructure, if the system
>> +        * does not support THP, we always fallback to order-0.
>> +        *
>> +        * Note that the caller may or may not choose to lock the pte. If
>> +        * unlocked, the calculation should be considered an estimate that will
>> +        * need to be validated under the lock.
>> +        */
>> +
>> +       struct vm_area_struct *vma = vmf->vma;
>> +       int nr;
>> +       unsigned long addr;
>> +       pte_t *pte;
>> +       pte_t *first_set = NULL;
>> +       int ret;
>> +
>> +       if (has_transparent_hugepage()) {
>> +               order = min(order, PMD_SHIFT - PAGE_SHIFT);
>> +
>> +               for (; order > 1; order--) {
>> +                       nr = 1 << order;
>> +                       addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
>> +                       pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
>> +
>> +                       /* Check vma bounds. */
>> +                       if (addr < vma->vm_start ||
>> +                           addr + (nr << PAGE_SHIFT) > vma->vm_end)
>> +                               continue;
>> +
>> +                       /* Ptes covered by order already known to be none. */
>> +                       if (pte + nr <= first_set)
>> +                               break;
>> +
>> +                       /* Already found set pte in range covered by order. */
>> +                       if (pte <= first_set)
>> +                               continue;
>> +
>> +                       /* Need to check if all the ptes are none. */
>> +                       ret = check_ptes_none(pte, nr);
>> +                       if (ret == nr)
>> +                               break;
>> +
>> +                       first_set = pte + ret;
>> +               }
>> +
>> +               if (order == 1)
>> +                       order = 0;
>> +       } else
>> +               order = 0;
>> +
>> +       return order;
>> +}
>> +
>>  /*
>>   * Handle write page faults for pages that can be reused in the current vma
>>   *
>> @@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>         struct folio *folio;
>>         vm_fault_t ret = 0;
>>         pte_t entry;
>> +       unsigned long addr;
>> +       int order = uffd_wp ? 0 : max_anon_folio_order(vma);
>> +       int pgcount = BIT(order);
>>
>>         /* File mapping without ->vm_ops ? */
>>         if (vma->vm_flags & VM_SHARED)
>> @@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
>>                         return handle_userfault(vmf, VM_UFFD_MISSING);
>>                 }
>> -               goto setpte;
>> +               if (uffd_wp)
>> +                       entry = pte_mkuffd_wp(entry);
>> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +
>> +               /* No need to invalidate - it was non-present before */
>> +               update_mmu_cache(vma, vmf->address, vmf->pte);
>> +               goto unlock;
>>         }
>>
>> -       /* Allocate our own private page. */
>> +retry:
>> +       /*
>> +        * Estimate the folio order to allocate. We are not under the ptl here
>> +        * so this estiamte needs to be re-checked later once we have the lock.
>> +        */
>> +       vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> +       order = calc_anon_folio_order_alloc(vmf, order);
>> +       pte_unmap(vmf->pte);
>> +
>> +       /* Allocate our own private folio. */
>>         if (unlikely(anon_vma_prepare(vma)))
>>                 goto oom;
>> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
>> +       folio = try_vma_alloc_movable_folio(vma, vmf->address, order, true);
>>         if (!folio)
>>                 goto oom;
>>
>> +       /* We may have been granted less than we asked for. */
>> +       order = folio_order(folio);
>> +       pgcount = BIT(order);
>> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
>> +
>>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>                 goto oom_free_page;
>>         folio_throttle_swaprate(folio, GFP_KERNEL);
>>
>>         /*
>>          * The memory barrier inside __folio_mark_uptodate makes sure that
>> -        * preceding stores to the page contents become visible before
>> -        * the set_pte_at() write.
>> +        * preceding stores to the folio contents become visible before
>> +        * the set_ptes() write.
>>          */
>>         __folio_mark_uptodate(folio);
>>
>> @@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>         if (vma->vm_flags & VM_WRITE)
>>                 entry = pte_mkwrite(pte_mkdirty(entry));
>>
>> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> -                       &vmf->ptl);
>> -       if (vmf_pte_changed(vmf)) {
>> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
>> -               goto release;
>> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>> +
>> +       /*
>> +        * Ensure our estimate above is still correct; we could have raced with
>> +        * another thread to service a fault in the region.
>> +        */
>> +       if (order == 0) {
>> +               if (vmf_pte_changed(vmf)) {
>> +                       update_mmu_tlb(vma, vmf->address, vmf->pte);
>> +                       goto release;
>> +               }
>> +       } else if (check_ptes_none(vmf->pte, pgcount) != pgcount) {
>> +               pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
>> +
>> +               /* If faulting pte was allocated by another, exit early. */
>> +               if (!pte_none(ptep_get(pte))) {
>> +                       update_mmu_tlb(vma, vmf->address, pte);
>> +                       goto release;
>> +               }
>> +
>> +               /* Else try again, with a lower order. */
>> +               pte_unmap_unlock(vmf->pte, vmf->ptl);
>> +               folio_put(folio);
>> +               order--;
>> +               goto retry;
> 
> I'm not sure whether this extra fallback logic is worth it or not. Do
> you have any benchmark data or is it just an arbitrary design choice?
> If it is just an arbitrary design choice, I'd like to go with the
> simplest way by just exiting page fault handler, just like the
> order-0, IMHO.

Yes, its an arbitrary design choice. Based on Yu Zhao's feedback, I'm already
reworking this so that we only try the preferred order and order-0, so no longer
iterating through intermediate orders.

I think what you are suggesting is that if attempting to allocate the preferred
order and we find there was a race meaning that the folio now is overlapping
populated ptes (but the faulting pte is still empty), just exit and rely on the
page fault being re-triggered, rather than immediately falling back to order-0?

The reason I didn't do that was I wasn't sure if the return path might have
assumptions that the faulting pte is now valid if no error was returned? I guess
another option is to return VM_FAULT_RETRY but then it seemed cleaner to do the
retry directly here. What do you suggest?

Thanks,
Ryan



> 
>>         }
>>
>>         ret = check_stable_address_space(vma->vm_mm);
>> @@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>>         }
>>
>> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
>> +       folio_ref_add(folio, pgcount - 1);
>> +
>> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
>> +       folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, addr);
>>         folio_add_lru_vma(folio, vma);
>> -setpte:
>> +
>>         if (uffd_wp)
>>                 entry = pte_mkuffd_wp(entry);
>> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>>
>>         /* No need to invalidate - it was non-present before */
>> -       update_mmu_cache(vma, vmf->address, vmf->pte);
>> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
>>  unlock:
>>         pte_unmap_unlock(vmf->pte, vmf->ptl);
>>         return ret;
>> --
>> 2.25.1
>>
>>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
@ 2023-06-29 11:30       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-29 11:30 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 29/06/2023 03:13, Yang Shi wrote:
> On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> With all of the enabler patches in place, modify the anonymous memory
>> write allocation path so that it opportunistically attempts to allocate
>> a large folio up to `max_anon_folio_order()` size (This value is
>> ultimately configured by the architecture). This reduces the number of
>> page faults, reduces the size of (e.g. LRU) lists, and generally
>> improves performance by batching what were per-page operations into
>> per-(large)-folio operations.
>>
>> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
>> `max_anon_folio_order()` always returns 0, meaning we get the existing
>> allocation behaviour.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 144 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index a8f7e2b28d7a..d23c44cc5092 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
>>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>>  }
>>
>> +/*
>> + * Returns index of first pte that is not none, or nr if all are none.
>> + */
>> +static inline int check_ptes_none(pte_t *pte, int nr)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < nr; i++) {
>> +               if (!pte_none(ptep_get(pte++)))
>> +                       return i;
>> +       }
>> +
>> +       return nr;
>> +}
>> +
>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
>> +{
>> +       /*
>> +        * The aim here is to determine what size of folio we should allocate
>> +        * for this fault. Factors include:
>> +        * - Order must not be higher than `order` upon entry
>> +        * - Folio must be naturally aligned within VA space
>> +        * - Folio must not breach boundaries of vma
>> +        * - Folio must be fully contained inside one pmd entry
>> +        * - Folio must not overlap any non-none ptes
>> +        *
>> +        * Additionally, we do not allow order-1 since this breaks assumptions
>> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
>> +        * store state up to the 3rd struct page subpage), and these pages must
>> +        * be THP in order to correctly use pre-existing THP infrastructure such
>> +        * as folio_split().
>> +        *
>> +        * As a consequence of relying on the THP infrastructure, if the system
>> +        * does not support THP, we always fallback to order-0.
>> +        *
>> +        * Note that the caller may or may not choose to lock the pte. If
>> +        * unlocked, the calculation should be considered an estimate that will
>> +        * need to be validated under the lock.
>> +        */
>> +
>> +       struct vm_area_struct *vma = vmf->vma;
>> +       int nr;
>> +       unsigned long addr;
>> +       pte_t *pte;
>> +       pte_t *first_set = NULL;
>> +       int ret;
>> +
>> +       if (has_transparent_hugepage()) {
>> +               order = min(order, PMD_SHIFT - PAGE_SHIFT);
>> +
>> +               for (; order > 1; order--) {
>> +                       nr = 1 << order;
>> +                       addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
>> +                       pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
>> +
>> +                       /* Check vma bounds. */
>> +                       if (addr < vma->vm_start ||
>> +                           addr + (nr << PAGE_SHIFT) > vma->vm_end)
>> +                               continue;
>> +
>> +                       /* Ptes covered by order already known to be none. */
>> +                       if (pte + nr <= first_set)
>> +                               break;
>> +
>> +                       /* Already found set pte in range covered by order. */
>> +                       if (pte <= first_set)
>> +                               continue;
>> +
>> +                       /* Need to check if all the ptes are none. */
>> +                       ret = check_ptes_none(pte, nr);
>> +                       if (ret == nr)
>> +                               break;
>> +
>> +                       first_set = pte + ret;
>> +               }
>> +
>> +               if (order == 1)
>> +                       order = 0;
>> +       } else
>> +               order = 0;
>> +
>> +       return order;
>> +}
>> +
>>  /*
>>   * Handle write page faults for pages that can be reused in the current vma
>>   *
>> @@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>         struct folio *folio;
>>         vm_fault_t ret = 0;
>>         pte_t entry;
>> +       unsigned long addr;
>> +       int order = uffd_wp ? 0 : max_anon_folio_order(vma);
>> +       int pgcount = BIT(order);
>>
>>         /* File mapping without ->vm_ops ? */
>>         if (vma->vm_flags & VM_SHARED)
>> @@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
>>                         return handle_userfault(vmf, VM_UFFD_MISSING);
>>                 }
>> -               goto setpte;
>> +               if (uffd_wp)
>> +                       entry = pte_mkuffd_wp(entry);
>> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +
>> +               /* No need to invalidate - it was non-present before */
>> +               update_mmu_cache(vma, vmf->address, vmf->pte);
>> +               goto unlock;
>>         }
>>
>> -       /* Allocate our own private page. */
>> +retry:
>> +       /*
>> +        * Estimate the folio order to allocate. We are not under the ptl here
>> +        * so this estiamte needs to be re-checked later once we have the lock.
>> +        */
>> +       vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> +       order = calc_anon_folio_order_alloc(vmf, order);
>> +       pte_unmap(vmf->pte);
>> +
>> +       /* Allocate our own private folio. */
>>         if (unlikely(anon_vma_prepare(vma)))
>>                 goto oom;
>> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
>> +       folio = try_vma_alloc_movable_folio(vma, vmf->address, order, true);
>>         if (!folio)
>>                 goto oom;
>>
>> +       /* We may have been granted less than we asked for. */
>> +       order = folio_order(folio);
>> +       pgcount = BIT(order);
>> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
>> +
>>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>                 goto oom_free_page;
>>         folio_throttle_swaprate(folio, GFP_KERNEL);
>>
>>         /*
>>          * The memory barrier inside __folio_mark_uptodate makes sure that
>> -        * preceding stores to the page contents become visible before
>> -        * the set_pte_at() write.
>> +        * preceding stores to the folio contents become visible before
>> +        * the set_ptes() write.
>>          */
>>         __folio_mark_uptodate(folio);
>>
>> @@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>         if (vma->vm_flags & VM_WRITE)
>>                 entry = pte_mkwrite(pte_mkdirty(entry));
>>
>> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> -                       &vmf->ptl);
>> -       if (vmf_pte_changed(vmf)) {
>> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
>> -               goto release;
>> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>> +
>> +       /*
>> +        * Ensure our estimate above is still correct; we could have raced with
>> +        * another thread to service a fault in the region.
>> +        */
>> +       if (order == 0) {
>> +               if (vmf_pte_changed(vmf)) {
>> +                       update_mmu_tlb(vma, vmf->address, vmf->pte);
>> +                       goto release;
>> +               }
>> +       } else if (check_ptes_none(vmf->pte, pgcount) != pgcount) {
>> +               pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
>> +
>> +               /* If faulting pte was allocated by another, exit early. */
>> +               if (!pte_none(ptep_get(pte))) {
>> +                       update_mmu_tlb(vma, vmf->address, pte);
>> +                       goto release;
>> +               }
>> +
>> +               /* Else try again, with a lower order. */
>> +               pte_unmap_unlock(vmf->pte, vmf->ptl);
>> +               folio_put(folio);
>> +               order--;
>> +               goto retry;
> 
> I'm not sure whether this extra fallback logic is worth it or not. Do
> you have any benchmark data or is it just an arbitrary design choice?
> If it is just an arbitrary design choice, I'd like to go with the
> simplest way by just exiting page fault handler, just like the
> order-0, IMHO.

Yes, its an arbitrary design choice. Based on Yu Zhao's feedback, I'm already
reworking this so that we only try the preferred order and order-0, so no longer
iterating through intermediate orders.

I think what you are suggesting is that if attempting to allocate the preferred
order and we find there was a race meaning that the folio now is overlapping
populated ptes (but the faulting pte is still empty), just exit and rely on the
page fault being re-triggered, rather than immediately falling back to order-0?

The reason I didn't do that was I wasn't sure if the return path might have
assumptions that the faulting pte is now valid if no error was returned? I guess
another option is to return VM_FAULT_RETRY but then it seemed cleaner to do the
retry directly here. What do you suggest?

Thanks,
Ryan



> 
>>         }
>>
>>         ret = check_stable_address_space(vma->vm_mm);
>> @@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>>         }
>>
>> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
>> +       folio_ref_add(folio, pgcount - 1);
>> +
>> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
>> +       folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, addr);
>>         folio_add_lru_vma(folio, vma);
>> -setpte:
>> +
>>         if (uffd_wp)
>>                 entry = pte_mkuffd_wp(entry);
>> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>>
>>         /* No need to invalidate - it was non-present before */
>> -       update_mmu_cache(vma, vmf->address, vmf->pte);
>> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
>>  unlock:
>>         pte_unmap_unlock(vmf->pte, vmf->ptl);
>>         return ret;
>> --
>> 2.25.1
>>
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
@ 2023-06-29 11:30       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-29 11:30 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64

On 29/06/2023 03:13, Yang Shi wrote:
> On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> With all of the enabler patches in place, modify the anonymous memory
>> write allocation path so that it opportunistically attempts to allocate
>> a large folio up to `max_anon_folio_order()` size (This value is
>> ultimately configured by the architecture). This reduces the number of
>> page faults, reduces the size of (e.g. LRU) lists, and generally
>> improves performance by batching what were per-page operations into
>> per-(large)-folio operations.
>>
>> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
>> `max_anon_folio_order()` always returns 0, meaning we get the existing
>> allocation behaviour.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 144 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index a8f7e2b28d7a..d23c44cc5092 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
>>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>>  }
>>
>> +/*
>> + * Returns index of first pte that is not none, or nr if all are none.
>> + */
>> +static inline int check_ptes_none(pte_t *pte, int nr)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < nr; i++) {
>> +               if (!pte_none(ptep_get(pte++)))
>> +                       return i;
>> +       }
>> +
>> +       return nr;
>> +}
>> +
>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
>> +{
>> +       /*
>> +        * The aim here is to determine what size of folio we should allocate
>> +        * for this fault. Factors include:
>> +        * - Order must not be higher than `order` upon entry
>> +        * - Folio must be naturally aligned within VA space
>> +        * - Folio must not breach boundaries of vma
>> +        * - Folio must be fully contained inside one pmd entry
>> +        * - Folio must not overlap any non-none ptes
>> +        *
>> +        * Additionally, we do not allow order-1 since this breaks assumptions
>> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
>> +        * store state up to the 3rd struct page subpage), and these pages must
>> +        * be THP in order to correctly use pre-existing THP infrastructure such
>> +        * as folio_split().
>> +        *
>> +        * As a consequence of relying on the THP infrastructure, if the system
>> +        * does not support THP, we always fallback to order-0.
>> +        *
>> +        * Note that the caller may or may not choose to lock the pte. If
>> +        * unlocked, the calculation should be considered an estimate that will
>> +        * need to be validated under the lock.
>> +        */
>> +
>> +       struct vm_area_struct *vma = vmf->vma;
>> +       int nr;
>> +       unsigned long addr;
>> +       pte_t *pte;
>> +       pte_t *first_set = NULL;
>> +       int ret;
>> +
>> +       if (has_transparent_hugepage()) {
>> +               order = min(order, PMD_SHIFT - PAGE_SHIFT);
>> +
>> +               for (; order > 1; order--) {
>> +                       nr = 1 << order;
>> +                       addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
>> +                       pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
>> +
>> +                       /* Check vma bounds. */
>> +                       if (addr < vma->vm_start ||
>> +                           addr + (nr << PAGE_SHIFT) > vma->vm_end)
>> +                               continue;
>> +
>> +                       /* Ptes covered by order already known to be none. */
>> +                       if (pte + nr <= first_set)
>> +                               break;
>> +
>> +                       /* Already found set pte in range covered by order. */
>> +                       if (pte <= first_set)
>> +                               continue;
>> +
>> +                       /* Need to check if all the ptes are none. */
>> +                       ret = check_ptes_none(pte, nr);
>> +                       if (ret == nr)
>> +                               break;
>> +
>> +                       first_set = pte + ret;
>> +               }
>> +
>> +               if (order == 1)
>> +                       order = 0;
>> +       } else
>> +               order = 0;
>> +
>> +       return order;
>> +}
>> +
>>  /*
>>   * Handle write page faults for pages that can be reused in the current vma
>>   *
>> @@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>         struct folio *folio;
>>         vm_fault_t ret = 0;
>>         pte_t entry;
>> +       unsigned long addr;
>> +       int order = uffd_wp ? 0 : max_anon_folio_order(vma);
>> +       int pgcount = BIT(order);
>>
>>         /* File mapping without ->vm_ops ? */
>>         if (vma->vm_flags & VM_SHARED)
>> @@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
>>                         return handle_userfault(vmf, VM_UFFD_MISSING);
>>                 }
>> -               goto setpte;
>> +               if (uffd_wp)
>> +                       entry = pte_mkuffd_wp(entry);
>> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +
>> +               /* No need to invalidate - it was non-present before */
>> +               update_mmu_cache(vma, vmf->address, vmf->pte);
>> +               goto unlock;
>>         }
>>
>> -       /* Allocate our own private page. */
>> +retry:
>> +       /*
>> +        * Estimate the folio order to allocate. We are not under the ptl here
>> +        * so this estiamte needs to be re-checked later once we have the lock.
>> +        */
>> +       vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> +       order = calc_anon_folio_order_alloc(vmf, order);
>> +       pte_unmap(vmf->pte);
>> +
>> +       /* Allocate our own private folio. */
>>         if (unlikely(anon_vma_prepare(vma)))
>>                 goto oom;
>> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
>> +       folio = try_vma_alloc_movable_folio(vma, vmf->address, order, true);
>>         if (!folio)
>>                 goto oom;
>>
>> +       /* We may have been granted less than we asked for. */
>> +       order = folio_order(folio);
>> +       pgcount = BIT(order);
>> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
>> +
>>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>                 goto oom_free_page;
>>         folio_throttle_swaprate(folio, GFP_KERNEL);
>>
>>         /*
>>          * The memory barrier inside __folio_mark_uptodate makes sure that
>> -        * preceding stores to the page contents become visible before
>> -        * the set_pte_at() write.
>> +        * preceding stores to the folio contents become visible before
>> +        * the set_ptes() write.
>>          */
>>         __folio_mark_uptodate(folio);
>>
>> @@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>         if (vma->vm_flags & VM_WRITE)
>>                 entry = pte_mkwrite(pte_mkdirty(entry));
>>
>> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> -                       &vmf->ptl);
>> -       if (vmf_pte_changed(vmf)) {
>> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
>> -               goto release;
>> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>> +
>> +       /*
>> +        * Ensure our estimate above is still correct; we could have raced with
>> +        * another thread to service a fault in the region.
>> +        */
>> +       if (order == 0) {
>> +               if (vmf_pte_changed(vmf)) {
>> +                       update_mmu_tlb(vma, vmf->address, vmf->pte);
>> +                       goto release;
>> +               }
>> +       } else if (check_ptes_none(vmf->pte, pgcount) != pgcount) {
>> +               pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
>> +
>> +               /* If faulting pte was allocated by another, exit early. */
>> +               if (!pte_none(ptep_get(pte))) {
>> +                       update_mmu_tlb(vma, vmf->address, pte);
>> +                       goto release;
>> +               }
>> +
>> +               /* Else try again, with a lower order. */
>> +               pte_unmap_unlock(vmf->pte, vmf->ptl);
>> +               folio_put(folio);
>> +               order--;
>> +               goto retry;
> 
> I'm not sure whether this extra fallback logic is worth it or not. Do
> you have any benchmark data or is it just an arbitrary design choice?
> If it is just an arbitrary design choice, I'd like to go with the
> simplest way by just exiting page fault handler, just like the
> order-0, IMHO.

Yes, its an arbitrary design choice. Based on Yu Zhao's feedback, I'm already
reworking this so that we only try the preferred order and order-0, so no longer
iterating through intermediate orders.

I think what you are suggesting is that if attempting to allocate the preferred
order and we find there was a race meaning that the folio now is overlapping
populated ptes (but the faulting pte is still empty), just exit and rely on the
page fault being re-triggered, rather than immediately falling back to order-0?

The reason I didn't do that was I wasn't sure if the return path might have
assumptions that the faulting pte is now valid if no error was returned? I guess
another option is to return VM_FAULT_RETRY but then it seemed cleaner to do the
retry directly here. What do you suggest?

Thanks,
Ryan



> 
>>         }
>>
>>         ret = check_stable_address_space(vma->vm_mm);
>> @@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>>         }
>>
>> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
>> +       folio_ref_add(folio, pgcount - 1);
>> +
>> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
>> +       folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, addr);
>>         folio_add_lru_vma(folio, vma);
>> -setpte:
>> +
>>         if (uffd_wp)
>>                 entry = pte_mkuffd_wp(entry);
>> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>>
>>         /* No need to invalidate - it was non-present before */
>> -       update_mmu_cache(vma, vmf->address, vmf->pte);
>> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
>>  unlock:
>>         pte_unmap_unlock(vmf->pte, vmf->ptl);
>>         return ret;
>> --
>> 2.25.1
>>
>>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order
  2023-06-29  1:38     ` Yang Shi
  (?)
@ 2023-06-29 11:31       ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-29 11:31 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 29/06/2023 02:38, Yang Shi wrote:
> On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> For variable-order anonymous folios, we need to determine the order that
>> we will allocate. From a SW perspective, the higher the order we
>> allocate, the less overhead we will have; fewer faults, fewer folios in
>> lists, etc. But of course there will also be more memory wastage as the
>> order increases.
>>
>> From a HW perspective, there are memory block sizes that can be
>> beneficial to reducing TLB pressure. arm64, for example, has the ability
>> to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
>> 64K base pages) such that one of these chunks only uses a single TLB
>> entry.
>>
>> So we let the architecture specify the order of the maximally beneficial
>> mapping unit when PTE-mapped. Furthermore, because in some cases, this
>> order may be quite big (and therefore potentially wasteful of memory),
>> allow the arch to specify 2 values; One is the max order for a mapping
>> that _would not_ use THP if all size and alignment constraints were met,
>> and the other is the max order for a mapping that _would_ use THP if all
>> those constraints were met.
>>
>> Implement this with Kconfig by introducing some new options to allow the
>> architecture to declare that it supports large anonymous folios along
>> with these 2 preferred max order values. Then introduce a user-facing
>> option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
>> enabled if the architecture has declared its support. When disabled, it
>> forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
>> LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
>> allocated.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/Kconfig  | 39 +++++++++++++++++++++++++++++++++++++++
>>  mm/memory.c |  8 ++++++++
>>  2 files changed, 47 insertions(+)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 7672a22647b4..f4ba48c37b75 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
>>
>>  source "mm/damon/Kconfig"
>>
>> +config ARCH_SUPPORTS_LARGE_ANON_FOLIO
>> +       def_bool n
>> +       help
>> +         An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
>> +         to be enabled. It must also set the following integer values:
>> +         - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +         - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +
>> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +       int
>> +       help
>> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
>> +         that does not have the MADV_HUGEPAGE hint set.
>> +
>> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +       int
>> +       help
>> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
>> +         that has the MADV_HUGEPAGE hint set.
>> +
>> +config LARGE_ANON_FOLIO
>> +       bool "Allocate large folios for anonymous memory"
>> +       depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
>> +       default n
>> +       help
>> +         Use large (bigger than order-0) folios to back anonymous memory where
>> +         possible. This reduces the number of page faults, as well as other
>> +         per-page overheads to improve performance for many workloads.
>> +
>> +config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +       int
>> +       default 0 if !LARGE_ANON_FOLIO
>> +       default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +
>> +config LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +       int
>> +       default 0 if !LARGE_ANON_FOLIO
>> +       default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +
> 
> IMHO I don't think we need all of the new kconfigs. Ideally the large
> anon folios could be supported by all arches, although some of them
> may not benefit from larger TLB entries due to lack of hardware
> support.t
> 
> For now with a minimum implementation, I think you could define a
> macro or a function that returns the hardware preferred order.

Thanks for the feedback - that aligns with what Yu Zhao suggested. I'm
implementing it for v2.

Thanks,
Ryan


> 
>>  endmenu
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 9165ed1b9fc2..a8f7e2b28d7a 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3153,6 +3153,14 @@ static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>>         return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>>  }
>>
>> +static inline int max_anon_folio_order(struct vm_area_struct *vma)
>> +{
>> +       if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>> +               return CONFIG_LARGE_ANON_FOLIO_THP_ORDER_MAX;
>> +       else
>> +               return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>> +}
>> +
>>  /*
>>   * Handle write page faults for pages that can be reused in the current vma
>>   *
>> --
>> 2.25.1
>>
>>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order
@ 2023-06-29 11:31       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-29 11:31 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On 29/06/2023 02:38, Yang Shi wrote:
> On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> For variable-order anonymous folios, we need to determine the order that
>> we will allocate. From a SW perspective, the higher the order we
>> allocate, the less overhead we will have; fewer faults, fewer folios in
>> lists, etc. But of course there will also be more memory wastage as the
>> order increases.
>>
>> From a HW perspective, there are memory block sizes that can be
>> beneficial to reducing TLB pressure. arm64, for example, has the ability
>> to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
>> 64K base pages) such that one of these chunks only uses a single TLB
>> entry.
>>
>> So we let the architecture specify the order of the maximally beneficial
>> mapping unit when PTE-mapped. Furthermore, because in some cases, this
>> order may be quite big (and therefore potentially wasteful of memory),
>> allow the arch to specify 2 values; One is the max order for a mapping
>> that _would not_ use THP if all size and alignment constraints were met,
>> and the other is the max order for a mapping that _would_ use THP if all
>> those constraints were met.
>>
>> Implement this with Kconfig by introducing some new options to allow the
>> architecture to declare that it supports large anonymous folios along
>> with these 2 preferred max order values. Then introduce a user-facing
>> option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
>> enabled if the architecture has declared its support. When disabled, it
>> forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
>> LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
>> allocated.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/Kconfig  | 39 +++++++++++++++++++++++++++++++++++++++
>>  mm/memory.c |  8 ++++++++
>>  2 files changed, 47 insertions(+)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 7672a22647b4..f4ba48c37b75 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
>>
>>  source "mm/damon/Kconfig"
>>
>> +config ARCH_SUPPORTS_LARGE_ANON_FOLIO
>> +       def_bool n
>> +       help
>> +         An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
>> +         to be enabled. It must also set the following integer values:
>> +         - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +         - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +
>> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +       int
>> +       help
>> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
>> +         that does not have the MADV_HUGEPAGE hint set.
>> +
>> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +       int
>> +       help
>> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
>> +         that has the MADV_HUGEPAGE hint set.
>> +
>> +config LARGE_ANON_FOLIO
>> +       bool "Allocate large folios for anonymous memory"
>> +       depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
>> +       default n
>> +       help
>> +         Use large (bigger than order-0) folios to back anonymous memory where
>> +         possible. This reduces the number of page faults, as well as other
>> +         per-page overheads to improve performance for many workloads.
>> +
>> +config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +       int
>> +       default 0 if !LARGE_ANON_FOLIO
>> +       default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +
>> +config LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +       int
>> +       default 0 if !LARGE_ANON_FOLIO
>> +       default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +
> 
> IMHO I don't think we need all of the new kconfigs. Ideally the large
> anon folios could be supported by all arches, although some of them
> may not benefit from larger TLB entries due to lack of hardware
> support.t
> 
> For now with a minimum implementation, I think you could define a
> macro or a function that returns the hardware preferred order.

Thanks for the feedback - that aligns with what Yu Zhao suggested. I'm
implementing it for v2.

Thanks,
Ryan


> 
>>  endmenu
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 9165ed1b9fc2..a8f7e2b28d7a 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3153,6 +3153,14 @@ static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>>         return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>>  }
>>
>> +static inline int max_anon_folio_order(struct vm_area_struct *vma)
>> +{
>> +       if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>> +               return CONFIG_LARGE_ANON_FOLIO_THP_ORDER_MAX;
>> +       else
>> +               return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>> +}
>> +
>>  /*
>>   * Handle write page faults for pages that can be reused in the current vma
>>   *
>> --
>> 2.25.1
>>
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order
@ 2023-06-29 11:31       ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-29 11:31 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64

On 29/06/2023 02:38, Yang Shi wrote:
> On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> For variable-order anonymous folios, we need to determine the order that
>> we will allocate. From a SW perspective, the higher the order we
>> allocate, the less overhead we will have; fewer faults, fewer folios in
>> lists, etc. But of course there will also be more memory wastage as the
>> order increases.
>>
>> From a HW perspective, there are memory block sizes that can be
>> beneficial to reducing TLB pressure. arm64, for example, has the ability
>> to map "contpte" sized chunks (64K for a 4K base page, 2M for 16K and
>> 64K base pages) such that one of these chunks only uses a single TLB
>> entry.
>>
>> So we let the architecture specify the order of the maximally beneficial
>> mapping unit when PTE-mapped. Furthermore, because in some cases, this
>> order may be quite big (and therefore potentially wasteful of memory),
>> allow the arch to specify 2 values; One is the max order for a mapping
>> that _would not_ use THP if all size and alignment constraints were met,
>> and the other is the max order for a mapping that _would_ use THP if all
>> those constraints were met.
>>
>> Implement this with Kconfig by introducing some new options to allow the
>> architecture to declare that it supports large anonymous folios along
>> with these 2 preferred max order values. Then introduce a user-facing
>> option, LARGE_ANON_FOLIO, which defaults to disabled and can only be
>> enabled if the architecture has declared its support. When disabled, it
>> forces the max order values, LARGE_ANON_FOLIO_NOTHP_ORDER_MAX and
>> LARGE_ANON_FOLIO_THP_ORDER_MAX to 0, meaning only a single page is ever
>> allocated.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/Kconfig  | 39 +++++++++++++++++++++++++++++++++++++++
>>  mm/memory.c |  8 ++++++++
>>  2 files changed, 47 insertions(+)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 7672a22647b4..f4ba48c37b75 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1208,4 +1208,43 @@ config PER_VMA_LOCK
>>
>>  source "mm/damon/Kconfig"
>>
>> +config ARCH_SUPPORTS_LARGE_ANON_FOLIO
>> +       def_bool n
>> +       help
>> +         An arch should select this symbol if wants to allow LARGE_ANON_FOLIO
>> +         to be enabled. It must also set the following integer values:
>> +         - ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +         - ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +
>> +config ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +       int
>> +       help
>> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
>> +         that does not have the MADV_HUGEPAGE hint set.
>> +
>> +config ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +       int
>> +       help
>> +         The maximum size of folio to allocate for an anonymous VMA PTE-mapping
>> +         that has the MADV_HUGEPAGE hint set.
>> +
>> +config LARGE_ANON_FOLIO
>> +       bool "Allocate large folios for anonymous memory"
>> +       depends on ARCH_SUPPORTS_LARGE_ANON_FOLIO
>> +       default n
>> +       help
>> +         Use large (bigger than order-0) folios to back anonymous memory where
>> +         possible. This reduces the number of page faults, as well as other
>> +         per-page overheads to improve performance for many workloads.
>> +
>> +config LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +       int
>> +       default 0 if !LARGE_ANON_FOLIO
>> +       default ARCH_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX
>> +
>> +config LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +       int
>> +       default 0 if !LARGE_ANON_FOLIO
>> +       default ARCH_LARGE_ANON_FOLIO_THP_ORDER_MAX
>> +
> 
> IMHO I don't think we need all of the new kconfigs. Ideally the large
> anon folios could be supported by all arches, although some of them
> may not benefit from larger TLB entries due to lack of hardware
> support.t
> 
> For now with a minimum implementation, I think you could define a
> macro or a function that returns the hardware preferred order.

Thanks for the feedback - that aligns with what Yu Zhao suggested. I'm
implementing it for v2.

Thanks,
Ryan


> 
>>  endmenu
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 9165ed1b9fc2..a8f7e2b28d7a 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3153,6 +3153,14 @@ static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma,
>>         return vma_alloc_movable_folio(vma, vaddr, 0, zeroed);
>>  }
>>
>> +static inline int max_anon_folio_order(struct vm_area_struct *vma)
>> +{
>> +       if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>> +               return CONFIG_LARGE_ANON_FOLIO_THP_ORDER_MAX;
>> +       else
>> +               return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
>> +}
>> +
>>  /*
>>   * Handle write page faults for pages that can be reused in the current vma
>>   *
>> --
>> 2.25.1
>>
>>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
  2023-06-28 18:22         ` Yu Zhao
@ 2023-06-29 15:28           ` Ryan Roberts
  -1 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-29 15:28 UTC (permalink / raw)
  To: Yu Zhao, Yin Fengwei
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

On 28/06/2023 19:22, Yu Zhao wrote:
> On Tue, Jun 27, 2023 at 3:59 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/06/2023 08:49, Yu Zhao wrote:
>>> On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote:
>>>>
>>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> Following on from the previous RFCv2 [1], this series implements variable order,
>>>>> large folios for anonymous memory. The objective of this is to improve
>>>>> performance by allocating larger chunks of memory during anonymous page faults:
>>>>>
>>>>>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>>>>>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>>>>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>>>>>    overhead. This should benefit all architectures.
>>>>>  - Since we are now mapping physically contiguous chunks of memory, we can take
>>>>>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>>>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>>>>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>>>
>>>>> This patch set deals with the SW side of things only and based on feedback from
>>>>> the RFC, aims to be the most minimal initial change, upon which future
>>>>> incremental changes can be added. For this reason, the new behaviour is hidden
>>>>> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
>>>>> default. Although the code has been refactored to parameterize the desired order
>>>>> of the allocation, when the feature is disabled (by forcing the order to be
>>>>> always 0) my performance tests measure no regression. So I'm hoping this will be
>>>>> a suitable mechanism to allow incremental submissions to the kernel without
>>>>> affecting the rest of the world.
>>>>>
>>>>> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
>>>>> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
>>>>> getting that series into the kernel, but I'm hoping we can start the review
>>>>> process on this patch set independently. I have a branch at [3].
>>>>>
>>>>> I've posted a separate series concerning the HW part (contpte mapping) for arm64
>>>>> at [4].
>>>>>
>>>>>
>>>>> Performance
>>>>> -----------
>>>>>
>>>>> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
>>>>> javascript benchmark running in Chromium). Both cases are running on Ampere
>>>>> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
>>>>> is repeated 15 times over 5 reboots and averaged.
>>>>>
>>>>> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
>>>>> 'anonfolio' is the full patch set similar to the RFC with the additional changes
>>>>> to the extra 3 fault paths. The rest of the configs are described at [4].
>>>>>
>>>>> Kernel Compilation (smaller is better):
>>>>>
>>>>> | kernel          |   real-time |   kern-time |   user-time |
>>>>> |:----------------|------------:|------------:|------------:|
>>>>> | baseline-4k     |        0.0% |        0.0% |        0.0% |
>>>>> | anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
>>>>> | anonfolio       |       -5.4% |      -46.0% |       -0.3% |
>>>>> | contpte         |       -6.8% |      -45.7% |       -2.1% |
>>>>> | exefolio        |       -8.4% |      -46.4% |       -3.7% |
>>>>> | baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
>>>>> | baseline-64k    |      -10.5% |      -66.0% |       -3.5% |
>>>>>
>>>>> Speedometer 2.0 (bigger is better):
>>>>>
>>>>> | kernel          |   runs_per_min |
>>>>> |:----------------|---------------:|
>>>>> | baseline-4k     |           0.0% |
>>>>> | anonfolio-basic |           0.7% |
>>>>> | anonfolio       |           1.2% |
>>>>> | contpte         |           3.1% |
>>>>> | exefolio        |           4.2% |
>>>>> | baseline-16k    |           5.3% |
>>>>
>>>> Thanks for pushing this forward!
>>>>
>>>>> Changes since RFCv2
>>>>> -------------------
>>>>>
>>>>>   - Simplified series to bare minimum (on David Hildenbrand's advice)
>>>>
>>>> My impression is that this series still includes many pieces that can
>>>> be split out and discussed separately with followup series.
>>>>
>>>> (I skipped 04/10 and will look at it tomorrow.)
>>>
>>> I went through the series twice. Here what I think a bare minimum
>>> series (easier to review/debug/land) would look like:
> 
> ===
> 
>>> 1. a new arch specific function providing a prefered order within (0,
>>> PMD_ORDER).
>>> 2. an extended anon folio alloc API taking that order (02/10, partially).
>>> 3. an updated folio_add_new_anon_rmap() covering the large() &&
>>> !pmd_mappable() case (similar to 04/10).
>>> 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
>>> (06/10, reviewed-by provided).
>>> 5. finally, use the extended anon folio alloc API with the arch
>>> preferred order in do_anonymous_page() (10/10, partially).
> 
> ===
> 
>>> The rest can be split out into separate series and move forward in
>>> parallel with probably a long list of things we need/want to do.
>>
>> Thanks for the fadt review - I really appreciate it!
>>
>> I've responded to many of your comments. I'd appreciate if we can close those
>> points then I will work up a v2.
> 
> Thanks!
> 
> Based on the latest discussion here [1], my original list above can be
> optionally reduced to 4 patches: item 2 can be quashed into item 5.
> 
> Also please make sure we have only one global (apply to all archs)
> Kconfig option, and it should be added in item 5:
> 
>   if TRANSPARENT_HUGEPAGE
>     config FLEXIBLE/VARIABLE_THP # or whatever name you see fit
>   end if

Naming is always the hardest part. I've been calling it LARGE_ANON_FOLIO up
until now. But I think you are right that we should show that it is related to
THP, so I'll go with FLEXIBLE_THP for v2, and let people shout if they hate it.

If we are not letting the arch declare that it supports FLEXIBLE_THP, then I
think we need the default version of arch_wants_pte_order() to return a value
higher than 0 (which is what I have it returning at the moment). Because
otherwise, for an arch that hasn't defined its own version of
arch_wants_pte_order(), FLEXIBLE_THP on vs off will give the same result. So I
propose to set the default to ilog2(SZ_64K >> PAGE_SHIFT). Shout if you have any
concerns.

> 
> (How many new Kconfig options added within arch/arm64/ is not a concern of MM.)
> 
> And please make sure it's disabled by default,

Done

 because we are still
> missing many important functions, e.g., I don't think we can mlock()
> when large() && !pmd_mappable(), see mlock_pte_range() and
> mlock_vma_folio(). We can fix it along with many things later, but we
> need to present a plan and a schedule now. Otherwise, there would be
> pushback if we try to land the series without supporting mlock().

There are other areas that I'm aware off. I'll put together a table and send it
out once I have v2 out the door (hopefully tomorrow or Monday). Hopefully we can
work together to fill it in and figure out who can do what? I'm certainly
planning to continue to push this work forwards beyond this initial patch set.

Thanks,
Ryan

> 
> Do you or Fengwei plan to take on it? (I personally don't.) If not,
> I'll try to find someone from our team to look at it. (It'd be more
> scalable if we have a coordinated group of people individually solving
> different problems.)
> 
> [1] https://lore.kernel.org/r/b2c81404-67df-f841-ef02-919e841f49f2@arm.com/


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 00/10] variable-order, large folios for anonymous memory
@ 2023-06-29 15:28           ` Ryan Roberts
  0 siblings, 0 replies; 148+ messages in thread
From: Ryan Roberts @ 2023-06-29 15:28 UTC (permalink / raw)
  To: Yu Zhao, Yin Fengwei
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, David Hildenbrand, Catalin Marinas,
	Will Deacon, Geert Uytterhoeven, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-kernel, linux-mm, linux-alpha,
	linux-arm-kernel, linux-ia64, linux-m68k, linux-s390

On 28/06/2023 19:22, Yu Zhao wrote:
> On Tue, Jun 27, 2023 at 3:59 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/06/2023 08:49, Yu Zhao wrote:
>>> On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote:
>>>>
>>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> Following on from the previous RFCv2 [1], this series implements variable order,
>>>>> large folios for anonymous memory. The objective of this is to improve
>>>>> performance by allocating larger chunks of memory during anonymous page faults:
>>>>>
>>>>>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>>>>>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>>>>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>>>>>    overhead. This should benefit all architectures.
>>>>>  - Since we are now mapping physically contiguous chunks of memory, we can take
>>>>>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>>>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>>>>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>>>
>>>>> This patch set deals with the SW side of things only and based on feedback from
>>>>> the RFC, aims to be the most minimal initial change, upon which future
>>>>> incremental changes can be added. For this reason, the new behaviour is hidden
>>>>> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by
>>>>> default. Although the code has been refactored to parameterize the desired order
>>>>> of the allocation, when the feature is disabled (by forcing the order to be
>>>>> always 0) my performance tests measure no regression. So I'm hoping this will be
>>>>> a suitable mechanism to allow incremental submissions to the kernel without
>>>>> affecting the rest of the world.
>>>>>
>>>>> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
>>>>> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for
>>>>> getting that series into the kernel, but I'm hoping we can start the review
>>>>> process on this patch set independently. I have a branch at [3].
>>>>>
>>>>> I've posted a separate series concerning the HW part (contpte mapping) for arm64
>>>>> at [4].
>>>>>
>>>>>
>>>>> Performance
>>>>> -----------
>>>>>
>>>>> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
>>>>> javascript benchmark running in Chromium). Both cases are running on Ampere
>>>>> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
>>>>> is repeated 15 times over 5 reboots and averaged.
>>>>>
>>>>> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series.
>>>>> 'anonfolio' is the full patch set similar to the RFC with the additional changes
>>>>> to the extra 3 fault paths. The rest of the configs are described at [4].
>>>>>
>>>>> Kernel Compilation (smaller is better):
>>>>>
>>>>> | kernel          |   real-time |   kern-time |   user-time |
>>>>> |:----------------|------------:|------------:|------------:|
>>>>> | baseline-4k     |        0.0% |        0.0% |        0.0% |
>>>>> | anonfolio-basic |       -5.3% |      -42.9% |       -0.6% |
>>>>> | anonfolio       |       -5.4% |      -46.0% |       -0.3% |
>>>>> | contpte         |       -6.8% |      -45.7% |       -2.1% |
>>>>> | exefolio        |       -8.4% |      -46.4% |       -3.7% |
>>>>> | baseline-16k    |       -8.7% |      -49.2% |       -3.7% |
>>>>> | baseline-64k    |      -10.5% |      -66.0% |       -3.5% |
>>>>>
>>>>> Speedometer 2.0 (bigger is better):
>>>>>
>>>>> | kernel          |   runs_per_min |
>>>>> |:----------------|---------------:|
>>>>> | baseline-4k     |           0.0% |
>>>>> | anonfolio-basic |           0.7% |
>>>>> | anonfolio       |           1.2% |
>>>>> | contpte         |           3.1% |
>>>>> | exefolio        |           4.2% |
>>>>> | baseline-16k    |           5.3% |
>>>>
>>>> Thanks for pushing this forward!
>>>>
>>>>> Changes since RFCv2
>>>>> -------------------
>>>>>
>>>>>   - Simplified series to bare minimum (on David Hildenbrand's advice)
>>>>
>>>> My impression is that this series still includes many pieces that can
>>>> be split out and discussed separately with followup series.
>>>>
>>>> (I skipped 04/10 and will look at it tomorrow.)
>>>
>>> I went through the series twice. Here what I think a bare minimum
>>> series (easier to review/debug/land) would look like:
> 
> ===
> 
>>> 1. a new arch specific function providing a prefered order within (0,
>>> PMD_ORDER).
>>> 2. an extended anon folio alloc API taking that order (02/10, partially).
>>> 3. an updated folio_add_new_anon_rmap() covering the large() &&
>>> !pmd_mappable() case (similar to 04/10).
>>> 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap()
>>> (06/10, reviewed-by provided).
>>> 5. finally, use the extended anon folio alloc API with the arch
>>> preferred order in do_anonymous_page() (10/10, partially).
> 
> ===
> 
>>> The rest can be split out into separate series and move forward in
>>> parallel with probably a long list of things we need/want to do.
>>
>> Thanks for the fadt review - I really appreciate it!
>>
>> I've responded to many of your comments. I'd appreciate if we can close those
>> points then I will work up a v2.
> 
> Thanks!
> 
> Based on the latest discussion here [1], my original list above can be
> optionally reduced to 4 patches: item 2 can be quashed into item 5.
> 
> Also please make sure we have only one global (apply to all archs)
> Kconfig option, and it should be added in item 5:
> 
>   if TRANSPARENT_HUGEPAGE
>     config FLEXIBLE/VARIABLE_THP # or whatever name you see fit
>   end if

Naming is always the hardest part. I've been calling it LARGE_ANON_FOLIO up
until now. But I think you are right that we should show that it is related to
THP, so I'll go with FLEXIBLE_THP for v2, and let people shout if they hate it.

If we are not letting the arch declare that it supports FLEXIBLE_THP, then I
think we need the default version of arch_wants_pte_order() to return a value
higher than 0 (which is what I have it returning at the moment). Because
otherwise, for an arch that hasn't defined its own version of
arch_wants_pte_order(), FLEXIBLE_THP on vs off will give the same result. So I
propose to set the default to ilog2(SZ_64K >> PAGE_SHIFT). Shout if you have any
concerns.

> 
> (How many new Kconfig options added within arch/arm64/ is not a concern of MM.)
> 
> And please make sure it's disabled by default,

Done

 because we are still
> missing many important functions, e.g., I don't think we can mlock()
> when large() && !pmd_mappable(), see mlock_pte_range() and
> mlock_vma_folio(). We can fix it along with many things later, but we
> need to present a plan and a schedule now. Otherwise, there would be
> pushback if we try to land the series without supporting mlock().

There are other areas that I'm aware off. I'll put together a table and send it
out once I have v2 out the door (hopefully tomorrow or Monday). Hopefully we can
work together to fill it in and figure out who can do what? I'm certainly
planning to continue to push this work forwards beyond this initial patch set.

Thanks,
Ryan

> 
> Do you or Fengwei plan to take on it? (I personally don't.) If not,
> I'll try to find someone from our team to look at it. (It'd be more
> scalable if we have a coordinated group of people individually solving
> different problems.)
> 
> [1] https://lore.kernel.org/r/b2c81404-67df-f841-ef02-919e841f49f2@arm.com/


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
  2023-06-29 11:30       ` Ryan Roberts
  (?)
@ 2023-06-29 17:05         ` Yang Shi
  -1 siblings, 0 replies; 148+ messages in thread
From: Yang Shi @ 2023-06-29 17:05 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Thu, Jun 29, 2023 at 4:30 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 29/06/2023 03:13, Yang Shi wrote:
> > On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> With all of the enabler patches in place, modify the anonymous memory
> >> write allocation path so that it opportunistically attempts to allocate
> >> a large folio up to `max_anon_folio_order()` size (This value is
> >> ultimately configured by the architecture). This reduces the number of
> >> page faults, reduces the size of (e.g. LRU) lists, and generally
> >> improves performance by batching what were per-page operations into
> >> per-(large)-folio operations.
> >>
> >> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> >> `max_anon_folio_order()` always returns 0, meaning we get the existing
> >> allocation behaviour.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
> >>  1 file changed, 144 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index a8f7e2b28d7a..d23c44cc5092 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
> >>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
> >>  }
> >>
> >> +/*
> >> + * Returns index of first pte that is not none, or nr if all are none.
> >> + */
> >> +static inline int check_ptes_none(pte_t *pte, int nr)
> >> +{
> >> +       int i;
> >> +
> >> +       for (i = 0; i < nr; i++) {
> >> +               if (!pte_none(ptep_get(pte++)))
> >> +                       return i;
> >> +       }
> >> +
> >> +       return nr;
> >> +}
> >> +
> >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> >> +{
> >> +       /*
> >> +        * The aim here is to determine what size of folio we should allocate
> >> +        * for this fault. Factors include:
> >> +        * - Order must not be higher than `order` upon entry
> >> +        * - Folio must be naturally aligned within VA space
> >> +        * - Folio must not breach boundaries of vma
> >> +        * - Folio must be fully contained inside one pmd entry
> >> +        * - Folio must not overlap any non-none ptes
> >> +        *
> >> +        * Additionally, we do not allow order-1 since this breaks assumptions
> >> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
> >> +        * store state up to the 3rd struct page subpage), and these pages must
> >> +        * be THP in order to correctly use pre-existing THP infrastructure such
> >> +        * as folio_split().
> >> +        *
> >> +        * As a consequence of relying on the THP infrastructure, if the system
> >> +        * does not support THP, we always fallback to order-0.
> >> +        *
> >> +        * Note that the caller may or may not choose to lock the pte. If
> >> +        * unlocked, the calculation should be considered an estimate that will
> >> +        * need to be validated under the lock.
> >> +        */
> >> +
> >> +       struct vm_area_struct *vma = vmf->vma;
> >> +       int nr;
> >> +       unsigned long addr;
> >> +       pte_t *pte;
> >> +       pte_t *first_set = NULL;
> >> +       int ret;
> >> +
> >> +       if (has_transparent_hugepage()) {
> >> +               order = min(order, PMD_SHIFT - PAGE_SHIFT);
> >> +
> >> +               for (; order > 1; order--) {
> >> +                       nr = 1 << order;
> >> +                       addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
> >> +                       pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
> >> +
> >> +                       /* Check vma bounds. */
> >> +                       if (addr < vma->vm_start ||
> >> +                           addr + (nr << PAGE_SHIFT) > vma->vm_end)
> >> +                               continue;
> >> +
> >> +                       /* Ptes covered by order already known to be none. */
> >> +                       if (pte + nr <= first_set)
> >> +                               break;
> >> +
> >> +                       /* Already found set pte in range covered by order. */
> >> +                       if (pte <= first_set)
> >> +                               continue;
> >> +
> >> +                       /* Need to check if all the ptes are none. */
> >> +                       ret = check_ptes_none(pte, nr);
> >> +                       if (ret == nr)
> >> +                               break;
> >> +
> >> +                       first_set = pte + ret;
> >> +               }
> >> +
> >> +               if (order == 1)
> >> +                       order = 0;
> >> +       } else
> >> +               order = 0;
> >> +
> >> +       return order;
> >> +}
> >> +
> >>  /*
> >>   * Handle write page faults for pages that can be reused in the current vma
> >>   *
> >> @@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>         struct folio *folio;
> >>         vm_fault_t ret = 0;
> >>         pte_t entry;
> >> +       unsigned long addr;
> >> +       int order = uffd_wp ? 0 : max_anon_folio_order(vma);
> >> +       int pgcount = BIT(order);
> >>
> >>         /* File mapping without ->vm_ops ? */
> >>         if (vma->vm_flags & VM_SHARED)
> >> @@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>                         return handle_userfault(vmf, VM_UFFD_MISSING);
> >>                 }
> >> -               goto setpte;
> >> +               if (uffd_wp)
> >> +                       entry = pte_mkuffd_wp(entry);
> >> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> >> +
> >> +               /* No need to invalidate - it was non-present before */
> >> +               update_mmu_cache(vma, vmf->address, vmf->pte);
> >> +               goto unlock;
> >>         }
> >>
> >> -       /* Allocate our own private page. */
> >> +retry:
> >> +       /*
> >> +        * Estimate the folio order to allocate. We are not under the ptl here
> >> +        * so this estiamte needs to be re-checked later once we have the lock.
> >> +        */
> >> +       vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> >> +       order = calc_anon_folio_order_alloc(vmf, order);
> >> +       pte_unmap(vmf->pte);
> >> +
> >> +       /* Allocate our own private folio. */
> >>         if (unlikely(anon_vma_prepare(vma)))
> >>                 goto oom;
> >> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
> >> +       folio = try_vma_alloc_movable_folio(vma, vmf->address, order, true);
> >>         if (!folio)
> >>                 goto oom;
> >>
> >> +       /* We may have been granted less than we asked for. */
> >> +       order = folio_order(folio);
> >> +       pgcount = BIT(order);
> >> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
> >> +
> >>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> >>                 goto oom_free_page;
> >>         folio_throttle_swaprate(folio, GFP_KERNEL);
> >>
> >>         /*
> >>          * The memory barrier inside __folio_mark_uptodate makes sure that
> >> -        * preceding stores to the page contents become visible before
> >> -        * the set_pte_at() write.
> >> +        * preceding stores to the folio contents become visible before
> >> +        * the set_ptes() write.
> >>          */
> >>         __folio_mark_uptodate(folio);
> >>
> >> @@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>         if (vma->vm_flags & VM_WRITE)
> >>                 entry = pte_mkwrite(pte_mkdirty(entry));
> >>
> >> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >> -                       &vmf->ptl);
> >> -       if (vmf_pte_changed(vmf)) {
> >> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
> >> -               goto release;
> >> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> >> +
> >> +       /*
> >> +        * Ensure our estimate above is still correct; we could have raced with
> >> +        * another thread to service a fault in the region.
> >> +        */
> >> +       if (order == 0) {
> >> +               if (vmf_pte_changed(vmf)) {
> >> +                       update_mmu_tlb(vma, vmf->address, vmf->pte);
> >> +                       goto release;
> >> +               }
> >> +       } else if (check_ptes_none(vmf->pte, pgcount) != pgcount) {
> >> +               pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
> >> +
> >> +               /* If faulting pte was allocated by another, exit early. */
> >> +               if (!pte_none(ptep_get(pte))) {
> >> +                       update_mmu_tlb(vma, vmf->address, pte);
> >> +                       goto release;
> >> +               }
> >> +
> >> +               /* Else try again, with a lower order. */
> >> +               pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> +               folio_put(folio);
> >> +               order--;
> >> +               goto retry;
> >
> > I'm not sure whether this extra fallback logic is worth it or not. Do
> > you have any benchmark data or is it just an arbitrary design choice?
> > If it is just an arbitrary design choice, I'd like to go with the
> > simplest way by just exiting page fault handler, just like the
> > order-0, IMHO.
>
> Yes, its an arbitrary design choice. Based on Yu Zhao's feedback, I'm already
> reworking this so that we only try the preferred order and order-0, so no longer
> iterating through intermediate orders.
>
> I think what you are suggesting is that if attempting to allocate the preferred
> order and we find there was a race meaning that the folio now is overlapping
> populated ptes (but the faulting pte is still empty), just exit and rely on the
> page fault being re-triggered, rather than immediately falling back to order-0?

The faulting PTE might be filled too. Yes, just exit and rely on the
CPU re-trigger page fault.

>
> The reason I didn't do that was I wasn't sure if the return path might have
> assumptions that the faulting pte is now valid if no error was returned? I guess
> another option is to return VM_FAULT_RETRY but then it seemed cleaner to do the
> retry directly here. What do you suggest?

IIRC as long as the page fault handler doesn't return any error, it is
safe to rely on CPU re-trigger page fault if PTE is not installed.

VM_FAULT_RETRY means the page fault handler released mmap_lock (or
per-VMA lock with per-VMA lock enabled) due to waiting for page lock.
TBH I really don't want to make that semantic more complicated and
overloaded. And I don't see any fundamental difference between
vmf_pte_changed() for order-0 folio and overlapping PTEs installed for
large folio. So I'd like to follow the same behavior.

>
> Thanks,
> Ryan
>
>
>
> >
> >>         }
> >>
> >>         ret = check_stable_address_space(vma->vm_mm);
> >> @@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>                 return handle_userfault(vmf, VM_UFFD_MISSING);
> >>         }
> >>
> >> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
> >> +       folio_ref_add(folio, pgcount - 1);
> >> +
> >> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
> >> +       folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, addr);
> >>         folio_add_lru_vma(folio, vma);
> >> -setpte:
> >> +
> >>         if (uffd_wp)
> >>                 entry = pte_mkuffd_wp(entry);
> >> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> >> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
> >>
> >>         /* No need to invalidate - it was non-present before */
> >> -       update_mmu_cache(vma, vmf->address, vmf->pte);
> >> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
> >>  unlock:
> >>         pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>         return ret;
> >> --
> >> 2.25.1
> >>
> >>
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
@ 2023-06-29 17:05         ` Yang Shi
  0 siblings, 0 replies; 148+ messages in thread
From: Yang Shi @ 2023-06-29 17:05 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-s390

On Thu, Jun 29, 2023 at 4:30 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 29/06/2023 03:13, Yang Shi wrote:
> > On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> With all of the enabler patches in place, modify the anonymous memory
> >> write allocation path so that it opportunistically attempts to allocate
> >> a large folio up to `max_anon_folio_order()` size (This value is
> >> ultimately configured by the architecture). This reduces the number of
> >> page faults, reduces the size of (e.g. LRU) lists, and generally
> >> improves performance by batching what were per-page operations into
> >> per-(large)-folio operations.
> >>
> >> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> >> `max_anon_folio_order()` always returns 0, meaning we get the existing
> >> allocation behaviour.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
> >>  1 file changed, 144 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index a8f7e2b28d7a..d23c44cc5092 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
> >>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
> >>  }
> >>
> >> +/*
> >> + * Returns index of first pte that is not none, or nr if all are none.
> >> + */
> >> +static inline int check_ptes_none(pte_t *pte, int nr)
> >> +{
> >> +       int i;
> >> +
> >> +       for (i = 0; i < nr; i++) {
> >> +               if (!pte_none(ptep_get(pte++)))
> >> +                       return i;
> >> +       }
> >> +
> >> +       return nr;
> >> +}
> >> +
> >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> >> +{
> >> +       /*
> >> +        * The aim here is to determine what size of folio we should allocate
> >> +        * for this fault. Factors include:
> >> +        * - Order must not be higher than `order` upon entry
> >> +        * - Folio must be naturally aligned within VA space
> >> +        * - Folio must not breach boundaries of vma
> >> +        * - Folio must be fully contained inside one pmd entry
> >> +        * - Folio must not overlap any non-none ptes
> >> +        *
> >> +        * Additionally, we do not allow order-1 since this breaks assumptions
> >> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
> >> +        * store state up to the 3rd struct page subpage), and these pages must
> >> +        * be THP in order to correctly use pre-existing THP infrastructure such
> >> +        * as folio_split().
> >> +        *
> >> +        * As a consequence of relying on the THP infrastructure, if the system
> >> +        * does not support THP, we always fallback to order-0.
> >> +        *
> >> +        * Note that the caller may or may not choose to lock the pte. If
> >> +        * unlocked, the calculation should be considered an estimate that will
> >> +        * need to be validated under the lock.
> >> +        */
> >> +
> >> +       struct vm_area_struct *vma = vmf->vma;
> >> +       int nr;
> >> +       unsigned long addr;
> >> +       pte_t *pte;
> >> +       pte_t *first_set = NULL;
> >> +       int ret;
> >> +
> >> +       if (has_transparent_hugepage()) {
> >> +               order = min(order, PMD_SHIFT - PAGE_SHIFT);
> >> +
> >> +               for (; order > 1; order--) {
> >> +                       nr = 1 << order;
> >> +                       addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
> >> +                       pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
> >> +
> >> +                       /* Check vma bounds. */
> >> +                       if (addr < vma->vm_start ||
> >> +                           addr + (nr << PAGE_SHIFT) > vma->vm_end)
> >> +                               continue;
> >> +
> >> +                       /* Ptes covered by order already known to be none. */
> >> +                       if (pte + nr <= first_set)
> >> +                               break;
> >> +
> >> +                       /* Already found set pte in range covered by order. */
> >> +                       if (pte <= first_set)
> >> +                               continue;
> >> +
> >> +                       /* Need to check if all the ptes are none. */
> >> +                       ret = check_ptes_none(pte, nr);
> >> +                       if (ret == nr)
> >> +                               break;
> >> +
> >> +                       first_set = pte + ret;
> >> +               }
> >> +
> >> +               if (order == 1)
> >> +                       order = 0;
> >> +       } else
> >> +               order = 0;
> >> +
> >> +       return order;
> >> +}
> >> +
> >>  /*
> >>   * Handle write page faults for pages that can be reused in the current vma
> >>   *
> >> @@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>         struct folio *folio;
> >>         vm_fault_t ret = 0;
> >>         pte_t entry;
> >> +       unsigned long addr;
> >> +       int order = uffd_wp ? 0 : max_anon_folio_order(vma);
> >> +       int pgcount = BIT(order);
> >>
> >>         /* File mapping without ->vm_ops ? */
> >>         if (vma->vm_flags & VM_SHARED)
> >> @@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>                         return handle_userfault(vmf, VM_UFFD_MISSING);
> >>                 }
> >> -               goto setpte;
> >> +               if (uffd_wp)
> >> +                       entry = pte_mkuffd_wp(entry);
> >> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> >> +
> >> +               /* No need to invalidate - it was non-present before */
> >> +               update_mmu_cache(vma, vmf->address, vmf->pte);
> >> +               goto unlock;
> >>         }
> >>
> >> -       /* Allocate our own private page. */
> >> +retry:
> >> +       /*
> >> +        * Estimate the folio order to allocate. We are not under the ptl here
> >> +        * so this estiamte needs to be re-checked later once we have the lock.
> >> +        */
> >> +       vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> >> +       order = calc_anon_folio_order_alloc(vmf, order);
> >> +       pte_unmap(vmf->pte);
> >> +
> >> +       /* Allocate our own private folio. */
> >>         if (unlikely(anon_vma_prepare(vma)))
> >>                 goto oom;
> >> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
> >> +       folio = try_vma_alloc_movable_folio(vma, vmf->address, order, true);
> >>         if (!folio)
> >>                 goto oom;
> >>
> >> +       /* We may have been granted less than we asked for. */
> >> +       order = folio_order(folio);
> >> +       pgcount = BIT(order);
> >> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
> >> +
> >>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> >>                 goto oom_free_page;
> >>         folio_throttle_swaprate(folio, GFP_KERNEL);
> >>
> >>         /*
> >>          * The memory barrier inside __folio_mark_uptodate makes sure that
> >> -        * preceding stores to the page contents become visible before
> >> -        * the set_pte_at() write.
> >> +        * preceding stores to the folio contents become visible before
> >> +        * the set_ptes() write.
> >>          */
> >>         __folio_mark_uptodate(folio);
> >>
> >> @@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>         if (vma->vm_flags & VM_WRITE)
> >>                 entry = pte_mkwrite(pte_mkdirty(entry));
> >>
> >> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >> -                       &vmf->ptl);
> >> -       if (vmf_pte_changed(vmf)) {
> >> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
> >> -               goto release;
> >> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> >> +
> >> +       /*
> >> +        * Ensure our estimate above is still correct; we could have raced with
> >> +        * another thread to service a fault in the region.
> >> +        */
> >> +       if (order == 0) {
> >> +               if (vmf_pte_changed(vmf)) {
> >> +                       update_mmu_tlb(vma, vmf->address, vmf->pte);
> >> +                       goto release;
> >> +               }
> >> +       } else if (check_ptes_none(vmf->pte, pgcount) != pgcount) {
> >> +               pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
> >> +
> >> +               /* If faulting pte was allocated by another, exit early. */
> >> +               if (!pte_none(ptep_get(pte))) {
> >> +                       update_mmu_tlb(vma, vmf->address, pte);
> >> +                       goto release;
> >> +               }
> >> +
> >> +               /* Else try again, with a lower order. */
> >> +               pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> +               folio_put(folio);
> >> +               order--;
> >> +               goto retry;
> >
> > I'm not sure whether this extra fallback logic is worth it or not. Do
> > you have any benchmark data or is it just an arbitrary design choice?
> > If it is just an arbitrary design choice, I'd like to go with the
> > simplest way by just exiting page fault handler, just like the
> > order-0, IMHO.
>
> Yes, its an arbitrary design choice. Based on Yu Zhao's feedback, I'm already
> reworking this so that we only try the preferred order and order-0, so no longer
> iterating through intermediate orders.
>
> I think what you are suggesting is that if attempting to allocate the preferred
> order and we find there was a race meaning that the folio now is overlapping
> populated ptes (but the faulting pte is still empty), just exit and rely on the
> page fault being re-triggered, rather than immediately falling back to order-0?

The faulting PTE might be filled too. Yes, just exit and rely on the
CPU re-trigger page fault.

>
> The reason I didn't do that was I wasn't sure if the return path might have
> assumptions that the faulting pte is now valid if no error was returned? I guess
> another option is to return VM_FAULT_RETRY but then it seemed cleaner to do the
> retry directly here. What do you suggest?

IIRC as long as the page fault handler doesn't return any error, it is
safe to rely on CPU re-trigger page fault if PTE is not installed.

VM_FAULT_RETRY means the page fault handler released mmap_lock (or
per-VMA lock with per-VMA lock enabled) due to waiting for page lock.
TBH I really don't want to make that semantic more complicated and
overloaded. And I don't see any fundamental difference between
vmf_pte_changed() for order-0 folio and overlapping PTEs installed for
large folio. So I'd like to follow the same behavior.

>
> Thanks,
> Ryan
>
>
>
> >
> >>         }
> >>
> >>         ret = check_stable_address_space(vma->vm_mm);
> >> @@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>                 return handle_userfault(vmf, VM_UFFD_MISSING);
> >>         }
> >>
> >> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
> >> +       folio_ref_add(folio, pgcount - 1);
> >> +
> >> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
> >> +       folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, addr);
> >>         folio_add_lru_vma(folio, vma);
> >> -setpte:
> >> +
> >>         if (uffd_wp)
> >>                 entry = pte_mkuffd_wp(entry);
> >> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> >> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
> >>
> >>         /* No need to invalidate - it was non-present before */
> >> -       update_mmu_cache(vma, vmf->address, vmf->pte);
> >> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
> >>  unlock:
> >>         pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>         return ret;
> >> --
> >> 2.25.1
> >>
> >>
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
@ 2023-06-29 17:05         ` Yang Shi
  0 siblings, 0 replies; 148+ messages in thread
From: Yang Shi @ 2023-06-29 17:05 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Will Deacon, Geert Uytterhoeven,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-kernel, linux-mm, linux-alpha, linux-arm-kernel,
	linux-ia64

On Thu, Jun 29, 2023 at 4:30 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 29/06/2023 03:13, Yang Shi wrote:
> > On Mon, Jun 26, 2023 at 10:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> With all of the enabler patches in place, modify the anonymous memory
> >> write allocation path so that it opportunistically attempts to allocate
> >> a large folio up to `max_anon_folio_order()` size (This value is
> >> ultimately configured by the architecture). This reduces the number of
> >> page faults, reduces the size of (e.g. LRU) lists, and generally
> >> improves performance by batching what were per-page operations into
> >> per-(large)-folio operations.
> >>
> >> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> >> `max_anon_folio_order()` always returns 0, meaning we get the existing
> >> allocation behaviour.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++-----
> >>  1 file changed, 144 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index a8f7e2b28d7a..d23c44cc5092 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_area_struct *vma)
> >>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
> >>  }
> >>
> >> +/*
> >> + * Returns index of first pte that is not none, or nr if all are none.
> >> + */
> >> +static inline int check_ptes_none(pte_t *pte, int nr)
> >> +{
> >> +       int i;
> >> +
> >> +       for (i = 0; i < nr; i++) {
> >> +               if (!pte_none(ptep_get(pte++)))
> >> +                       return i;
> >> +       }
> >> +
> >> +       return nr;
> >> +}
> >> +
> >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> >> +{
> >> +       /*
> >> +        * The aim here is to determine what size of folio we should allocate
> >> +        * for this fault. Factors include:
> >> +        * - Order must not be higher than `order` upon entry
> >> +        * - Folio must be naturally aligned within VA space
> >> +        * - Folio must not breach boundaries of vma
> >> +        * - Folio must be fully contained inside one pmd entry
> >> +        * - Folio must not overlap any non-none ptes
> >> +        *
> >> +        * Additionally, we do not allow order-1 since this breaks assumptions
> >> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
> >> +        * store state up to the 3rd struct page subpage), and these pages must
> >> +        * be THP in order to correctly use pre-existing THP infrastructure such
> >> +        * as folio_split().
> >> +        *
> >> +        * As a consequence of relying on the THP infrastructure, if the system
> >> +        * does not support THP, we always fallback to order-0.
> >> +        *
> >> +        * Note that the caller may or may not choose to lock the pte. If
> >> +        * unlocked, the calculation should be considered an estimate that will
> >> +        * need to be validated under the lock.
> >> +        */
> >> +
> >> +       struct vm_area_struct *vma = vmf->vma;
> >> +       int nr;
> >> +       unsigned long addr;
> >> +       pte_t *pte;
> >> +       pte_t *first_set = NULL;
> >> +       int ret;
> >> +
> >> +       if (has_transparent_hugepage()) {
> >> +               order = min(order, PMD_SHIFT - PAGE_SHIFT);
> >> +
> >> +               for (; order > 1; order--) {
> >> +                       nr = 1 << order;
> >> +                       addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
> >> +                       pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
> >> +
> >> +                       /* Check vma bounds. */
> >> +                       if (addr < vma->vm_start ||
> >> +                           addr + (nr << PAGE_SHIFT) > vma->vm_end)
> >> +                               continue;
> >> +
> >> +                       /* Ptes covered by order already known to be none. */
> >> +                       if (pte + nr <= first_set)
> >> +                               break;
> >> +
> >> +                       /* Already found set pte in range covered by order. */
> >> +                       if (pte <= first_set)
> >> +                               continue;
> >> +
> >> +                       /* Need to check if all the ptes are none. */
> >> +                       ret = check_ptes_none(pte, nr);
> >> +                       if (ret == nr)
> >> +                               break;
> >> +
> >> +                       first_set = pte + ret;
> >> +               }
> >> +
> >> +               if (order == 1)
> >> +                       order = 0;
> >> +       } else
> >> +               order = 0;
> >> +
> >> +       return order;
> >> +}
> >> +
> >>  /*
> >>   * Handle write page faults for pages that can be reused in the current vma
> >>   *
> >> @@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>         struct folio *folio;
> >>         vm_fault_t ret = 0;
> >>         pte_t entry;
> >> +       unsigned long addr;
> >> +       int order = uffd_wp ? 0 : max_anon_folio_order(vma);
> >> +       int pgcount = BIT(order);
> >>
> >>         /* File mapping without ->vm_ops ? */
> >>         if (vma->vm_flags & VM_SHARED)
> >> @@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>                         return handle_userfault(vmf, VM_UFFD_MISSING);
> >>                 }
> >> -               goto setpte;
> >> +               if (uffd_wp)
> >> +                       entry = pte_mkuffd_wp(entry);
> >> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> >> +
> >> +               /* No need to invalidate - it was non-present before */
> >> +               update_mmu_cache(vma, vmf->address, vmf->pte);
> >> +               goto unlock;
> >>         }
> >>
> >> -       /* Allocate our own private page. */
> >> +retry:
> >> +       /*
> >> +        * Estimate the folio order to allocate. We are not under the ptl here
> >> +        * so this estiamte needs to be re-checked later once we have the lock.
> >> +        */
> >> +       vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> >> +       order = calc_anon_folio_order_alloc(vmf, order);
> >> +       pte_unmap(vmf->pte);
> >> +
> >> +       /* Allocate our own private folio. */
> >>         if (unlikely(anon_vma_prepare(vma)))
> >>                 goto oom;
> >> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
> >> +       folio = try_vma_alloc_movable_folio(vma, vmf->address, order, true);
> >>         if (!folio)
> >>                 goto oom;
> >>
> >> +       /* We may have been granted less than we asked for. */
> >> +       order = folio_order(folio);
> >> +       pgcount = BIT(order);
> >> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
> >> +
> >>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> >>                 goto oom_free_page;
> >>         folio_throttle_swaprate(folio, GFP_KERNEL);
> >>
> >>         /*
> >>          * The memory barrier inside __folio_mark_uptodate makes sure that
> >> -        * preceding stores to the page contents become visible before
> >> -        * the set_pte_at() write.
> >> +        * preceding stores to the folio contents become visible before
> >> +        * the set_ptes() write.
> >>          */
> >>         __folio_mark_uptodate(folio);
> >>
> >> @@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>         if (vma->vm_flags & VM_WRITE)
> >>                 entry = pte_mkwrite(pte_mkdirty(entry));
> >>
> >> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >> -                       &vmf->ptl);
> >> -       if (vmf_pte_changed(vmf)) {
> >> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
> >> -               goto release;
> >> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> >> +
> >> +       /*
> >> +        * Ensure our estimate above is still correct; we could have raced with
> >> +        * another thread to service a fault in the region.
> >> +        */
> >> +       if (order == 0) {
> >> +               if (vmf_pte_changed(vmf)) {
> >> +                       update_mmu_tlb(vma, vmf->address, vmf->pte);
> >> +                       goto release;
> >> +               }
> >> +       } else if (check_ptes_none(vmf->pte, pgcount) != pgcount) {
> >> +               pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
> >> +
> >> +               /* If faulting pte was allocated by another, exit early. */
> >> +               if (!pte_none(ptep_get(pte))) {
> >> +                       update_mmu_tlb(vma, vmf->address, pte);
> >> +                       goto release;
> >> +               }
> >> +
> >> +               /* Else try again, with a lower order. */
> >> +               pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> +               folio_put(folio);
> >> +               order--;
> >> +               goto retry;
> >
> > I'm not sure whether this extra fallback logic is worth it or not. Do
> > you have any benchmark data or is it just an arbitrary design choice?
> > If it is just an arbitrary design choice, I'd like to go with the
> > simplest way by just exiting page fault handler, just like the
> > order-0, IMHO.
>
> Yes, its an arbitrary design choice. Based on Yu Zhao's feedback, I'm already
> reworking this so that we only try the preferred order and order-0, so no longer
> iterating through intermediate orders.
>
> I think what you are suggesting is that if attempting to allocate the preferred
> order and we find there was a race meaning that the folio now is overlapping
> populated ptes (but the faulting pte is still empty), just exit and rely on the
> page fault being re-triggered, rather than immediately falling back to order-0?

The faulting PTE might be filled too. Yes, just exit and rely on the
CPU re-trigger page fault.

>
> The reason I didn't do that was I wasn't sure if the return path might have
> assumptions that the faulting pte is now valid if no error was returned? I guess
> another option is to return VM_FAULT_RETRY but then it seemed cleaner to do the
> retry directly here. What do you suggest?

IIRC as long as the page fault handler doesn't return any error, it is
safe to rely on CPU re-trigger page fault if PTE is not installed.

VM_FAULT_RETRY means the page fault handler released mmap_lock (or
per-VMA lock with per-VMA lock enabled) due to waiting for page lock.
TBH I really don't want to make that semantic more complicated and
overloaded. And I don't see any fundamental difference between
vmf_pte_changed() for order-0 folio and overlapping PTEs installed for
large folio. So I'd like to follow the same behavior.

>
> Thanks,
> Ryan
>
>
>
> >
> >>         }
> >>
> >>         ret = check_stable_address_space(vma->vm_mm);
> >> @@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>                 return handle_userfault(vmf, VM_UFFD_MISSING);
> >>         }
> >>
> >> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
> >> +       folio_ref_add(folio, pgcount - 1);
> >> +
> >> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
> >> +       folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, addr);
> >>         folio_add_lru_vma(folio, vma);
> >> -setpte:
> >> +
> >>         if (uffd_wp)
> >>                 entry = pte_mkuffd_wp(entry);
> >> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> >> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
> >>
> >>         /* No need to invalidate - it was non-present before */
> >> -       update_mmu_cache(vma, vmf->address, vmf->pte);
> >> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
> >>  unlock:
> >>         pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>         return ret;
> >> --
> >> 2.25.1
> >>
> >>
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

end of thread, other threads:[~2023-06-29 17:06 UTC | newest]

Thread overview: 148+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-26 17:14 [PATCH v1 00/10] variable-order, large folios for anonymous memory Ryan Roberts
2023-06-26 17:14 ` Ryan Roberts
2023-06-26 17:14 ` [PATCH v1 01/10] mm: Expose clear_huge_page() unconditionally Ryan Roberts
2023-06-26 17:14   ` Ryan Roberts
2023-06-27  1:55   ` Yu Zhao
2023-06-27  1:55     ` Yu Zhao
2023-06-27  1:55     ` Yu Zhao
2023-06-27  7:21     ` Ryan Roberts
2023-06-27  7:21       ` Ryan Roberts
2023-06-27  7:21       ` Ryan Roberts
2023-06-27  8:29       ` Yu Zhao
2023-06-27  8:29         ` Yu Zhao
2023-06-27  8:29         ` Yu Zhao
2023-06-27  9:41         ` Ryan Roberts
2023-06-27  9:41           ` Ryan Roberts
2023-06-27  9:41           ` Ryan Roberts
2023-06-27 18:26           ` Yu Zhao
2023-06-27 18:26             ` Yu Zhao
2023-06-27 18:26             ` Yu Zhao
2023-06-28 10:56             ` Ryan Roberts
2023-06-28 10:56               ` Ryan Roberts
2023-06-28 10:56               ` Ryan Roberts
2023-06-26 17:14 ` [PATCH v1 02/10] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() Ryan Roberts
2023-06-26 17:14   ` Ryan Roberts
2023-06-27  2:27   ` Yu Zhao
2023-06-27  2:27     ` Yu Zhao
2023-06-27  2:27     ` Yu Zhao
2023-06-27  7:27     ` Ryan Roberts
2023-06-27  7:27       ` Ryan Roberts
2023-06-27  7:27       ` Ryan Roberts
2023-06-26 17:14 ` [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio() Ryan Roberts
2023-06-26 17:14   ` Ryan Roberts
2023-06-27  2:34   ` Yu Zhao
2023-06-27  2:34     ` Yu Zhao
2023-06-27  2:34     ` Yu Zhao
2023-06-27  5:29     ` Yu Zhao
2023-06-27  5:29       ` Yu Zhao
2023-06-27  5:29       ` Yu Zhao
2023-06-27  7:56       ` Ryan Roberts
2023-06-27  7:56         ` Ryan Roberts
2023-06-27  7:56         ` Ryan Roberts
2023-06-28  2:32         ` Yin Fengwei
2023-06-28  2:32           ` Yin Fengwei
2023-06-28  2:32           ` Yin Fengwei
2023-06-28 11:06           ` Ryan Roberts
2023-06-28 11:06             ` Ryan Roberts
2023-06-26 17:14 ` [PATCH v1 04/10] mm: Implement folio_add_new_anon_rmap_range() Ryan Roberts
2023-06-26 17:14   ` Ryan Roberts
2023-06-27  7:08   ` Yu Zhao
2023-06-27  7:08     ` Yu Zhao
2023-06-27  7:08     ` Yu Zhao
2023-06-27  8:09     ` Ryan Roberts
2023-06-27  8:09       ` Ryan Roberts
2023-06-27  8:09       ` Ryan Roberts
2023-06-28  2:20       ` Yin Fengwei
2023-06-28  2:20         ` Yin Fengwei
2023-06-28  2:20         ` Yin Fengwei
2023-06-28 11:09         ` Ryan Roberts
2023-06-28 11:09           ` Ryan Roberts
2023-06-28 11:09           ` Ryan Roberts
2023-06-28  2:17     ` Yin Fengwei
2023-06-28  2:17       ` Yin Fengwei
2023-06-28  2:17       ` Yin Fengwei
2023-06-26 17:14 ` [PATCH v1 05/10] mm: Implement folio_remove_rmap_range() Ryan Roberts
2023-06-26 17:14   ` Ryan Roberts
2023-06-27  3:06   ` Yu Zhao
2023-06-27  3:06     ` Yu Zhao
2023-06-27  3:06     ` Yu Zhao
2023-06-26 17:14 ` [PATCH v1 06/10] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts
2023-06-26 17:14   ` Ryan Roberts
2023-06-27  2:54   ` Yu Zhao
2023-06-27  2:54     ` Yu Zhao
2023-06-27  2:54     ` Yu Zhao
2023-06-28  2:43   ` Yin Fengwei
2023-06-28  2:43     ` Yin Fengwei
2023-06-28  2:43     ` Yin Fengwei
2023-06-26 17:14 ` [PATCH v1 07/10] mm: Batch-zap large anonymous folio PTE mappings Ryan Roberts
2023-06-26 17:14   ` Ryan Roberts
2023-06-27  3:04   ` Yu Zhao
2023-06-27  3:04     ` Yu Zhao
2023-06-27  3:04     ` Yu Zhao
2023-06-27  9:46     ` Ryan Roberts
2023-06-27  9:46       ` Ryan Roberts
2023-06-27  9:46       ` Ryan Roberts
2023-06-26 17:14 ` [PATCH v1 08/10] mm: Kconfig hooks to determine max anon folio allocation order Ryan Roberts
2023-06-26 17:14   ` Ryan Roberts
2023-06-27  2:47   ` Yu Zhao
2023-06-27  2:47     ` Yu Zhao
2023-06-27  2:47     ` Yu Zhao
2023-06-27  9:54     ` Ryan Roberts
2023-06-27  9:54       ` Ryan Roberts
2023-06-27  9:54       ` Ryan Roberts
2023-06-29  1:38   ` Yang Shi
2023-06-29  1:38     ` Yang Shi
2023-06-29  1:38     ` Yang Shi
2023-06-29 11:31     ` Ryan Roberts
2023-06-29 11:31       ` Ryan Roberts
2023-06-29 11:31       ` Ryan Roberts
2023-06-26 17:14 ` [PATCH v1 09/10] arm64: mm: Declare support for large anonymous folios Ryan Roberts
2023-06-26 17:14   ` Ryan Roberts
2023-06-27  2:53   ` Yu Zhao
2023-06-27  2:53     ` Yu Zhao
2023-06-27  2:53     ` Yu Zhao
2023-06-26 17:14 ` [PATCH v1 10/10] mm: Allocate large folios for anonymous memory Ryan Roberts
2023-06-26 17:14   ` Ryan Roberts
2023-06-27  3:01   ` Yu Zhao
2023-06-27  3:01     ` Yu Zhao
2023-06-27  3:01     ` Yu Zhao
2023-06-27  9:57     ` Ryan Roberts
2023-06-27  9:57       ` Ryan Roberts
2023-06-27  9:57       ` Ryan Roberts
2023-06-27 18:33       ` Yu Zhao
2023-06-27 18:33         ` Yu Zhao
2023-06-27 18:33         ` Yu Zhao
2023-06-29  2:13   ` Yang Shi
2023-06-29  2:13     ` Yang Shi
2023-06-29  2:13     ` Yang Shi
2023-06-29 11:30     ` Ryan Roberts
2023-06-29 11:30       ` Ryan Roberts
2023-06-29 11:30       ` Ryan Roberts
2023-06-29 17:05       ` Yang Shi
2023-06-29 17:05         ` Yang Shi
2023-06-29 17:05         ` Yang Shi
2023-06-27  3:30 ` [PATCH v1 00/10] variable-order, " Yu Zhao
2023-06-27  3:30   ` Yu Zhao
2023-06-27  3:30   ` Yu Zhao
2023-06-27  7:49   ` Yu Zhao
2023-06-27  7:49     ` Yu Zhao
2023-06-27  7:49     ` Yu Zhao
2023-06-27  9:59     ` Ryan Roberts
2023-06-27  9:59       ` Ryan Roberts
2023-06-27  9:59       ` Ryan Roberts
2023-06-28 18:22       ` Yu Zhao
2023-06-28 18:22         ` Yu Zhao
2023-06-28 23:59         ` Yin Fengwei
2023-06-28 23:59           ` Yin Fengwei
2023-06-28 23:59           ` Yin Fengwei
2023-06-29  0:27           ` Yu Zhao
2023-06-29  0:27             ` Yu Zhao
2023-06-29  0:27             ` Yu Zhao
2023-06-29  0:31             ` Yin Fengwei
2023-06-29  0:31               ` Yin Fengwei
2023-06-29  0:31               ` Yin Fengwei
2023-06-29 15:28         ` Ryan Roberts
2023-06-29 15:28           ` Ryan Roberts
2023-06-29  2:21     ` Yang Shi
2023-06-29  2:21       ` Yang Shi
2023-06-29  2:21       ` Yang Shi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.