All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-03 13:53 ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Hi All,

This is v2 of a series to implement variable order, large folios for anonymous
memory. The objective of this is to improve performance by allocating larger
chunks of memory during anonymous page faults. See [1] for background.

I've significantly reworked and simplified the patch set based on comments from
Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
VARIABLE_THP, on Yu's advice.

The last patch is for arm64 to explicitly override the default
arch_wants_pte_order() and is intended as an example. If this series is accepted
I suggest taking the first 4 patches through the mm tree and the arm64 change
could be handled through the arm64 tree separately. Neither has any build
dependency on the other.

The one area where I haven't followed Yu's advice is in the determination of the
size of folio to use. It was suggested that I have a single preferred large
order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
being existing overlapping populated PTEs, etc) then fallback immediately to
order-0. It turned out that this approach caused a performance regression in the
Speedometer benchmark. With my v1 patch, there were significant quantities of
memory which could not be placed in the 64K bucket and were instead being
allocated for the 32K and 16K buckets. With the proposed simplification, that
memory ended up using the 4K bucket, so page faults increased by 2.75x compared
to the v1 patch (although due to the 64K bucket, this number is still a bit
lower than the baseline). So instead, I continue to calculate a folio order that
is somewhere between the preferred order and 0. (See below for more details).

The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
[2], which is a hard dependency. I have a branch at [3].


Changes since v1 [1]
--------------------

  - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
  - replaced with arch-independent alloc_anon_folio()
      - follows THP allocation approach
  - no longer retry with intermediate orders if allocation fails
      - fallback directly to order-0
  - remove folio_add_new_anon_rmap_range() patch
      - instead add its new functionality to folio_add_new_anon_rmap()
  - remove batch-zap pte mappings optimization patch
      - remove enabler folio_remove_rmap_range() patch too
      - These offer real perf improvement so will submit separately
  - simplify Kconfig
      - single FLEXIBLE_THP option, which is independent of arch
      - depends on TRANSPARENT_HUGEPAGE
      - when enabled default to max anon folio size of 64K unless arch
        explicitly overrides
  - simplify changes to do_anonymous_page():
      - no more retry loop


Performance
-----------

Below results show 3 benchmarks; kernel compilation with 8 jobs, kernel
compilation with 80 jobs, and speedometer 2.0 (a javascript benchmark running in
Chromium). All cases are running on Ampere Altra with 1 NUMA node enabled,
Ubuntu 22.04 and XFS filesystem. Each benchmark is repeated 15 times over 5
reboots and averaged.

'anonfolio-lkml-v1' is the v1 patchset at [1]. 'anonfolio-lkml-v2' is this v2
patchset. 'anonfolio-lkml-v2-simple-order' is anonfolio-lkml-v2 but with the
order selection simplification that Yu Zhao suggested - I'm trying to justify
here why I did not follow the advice.


Kernel compilation with 8 jobs:

| kernel                         |   real-time |   kern-time |   user-time |
|:-------------------------------|------------:|------------:|------------:|
| baseline-4k                    |        0.0% |        0.0% |        0.0% |
| anonfolio-lkml-v1              |       -5.3% |      -42.9% |       -0.6% |
| anonfolio-lkml-v2-simple-order |       -4.4% |      -36.5% |       -0.4% |
| anonfolio-lkml-v2              |       -4.8% |      -38.6% |       -0.6% |

We can see that the simple-order approach is responsible for a regression of
0.4%.


Kernel compilation with 80 jobs:

| kernel                         |   real-time |   kern-time |   user-time |
|:-------------------------------|------------:|------------:|------------:|
| baseline-4k                    |        0.0% |        0.0% |        0.0% |
| anonfolio-lkml-v1              |       -4.6% |      -45.7% |        1.4% |
| anonfolio-lkml-v2-simple-order |       -4.7% |      -40.2% |       -0.1% |
| anonfolio-lkml-v2              |       -5.0% |      -42.6% |       -0.3% |

simple-order costs 0.3 % here. v2 is actually performing higher than v1 due to
fixing the v1 regression on user-time.


Speedometer 2.0:

| kernel                         |   runs_per_min |
|:-------------------------------|---------------:|
| baseline-4k                    |           0.0% |
| anonfolio-lkml-v1              |           0.7% |
| anonfolio-lkml-v2-simple-order |          -0.9% |
| anonfolio-lkml-v2              |           0.5% |

simple-order regresses performance by 0.9% vs the baseline, for a total negative
swing of 1.6% vs v1. This is fixed by keeping the more complex order selection
mechanism from v1.


The remaining (kernel time) performance gap between v1 and v2 for the above
benchmarks is due to the removal of the "batch zap" patch in v2. Adding that
back in gives us the performance back. I intend to submit that as a separate
series once this series is accepted.


[1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
[3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v2

Thanks,
Ryan


Ryan Roberts (5):
  mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
  mm: Allow deferred splitting of arbitrary large anon folios
  mm: Default implementation of arch_wants_pte_order()
  mm: FLEXIBLE_THP for improved performance
  arm64: mm: Override arch_wants_pte_order()

 arch/arm64/Kconfig               |  12 +++
 arch/arm64/include/asm/pgtable.h |   4 +
 arch/arm64/mm/mmu.c              |   8 ++
 include/linux/pgtable.h          |  13 +++
 mm/Kconfig                       |  10 ++
 mm/memory.c                      | 168 ++++++++++++++++++++++++++++---
 mm/rmap.c                        |  28 ++++--
 7 files changed, 222 insertions(+), 21 deletions(-)

--
2.25.1


^ permalink raw reply	[flat|nested] 167+ messages in thread

* [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-03 13:53 ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Hi All,

This is v2 of a series to implement variable order, large folios for anonymous
memory. The objective of this is to improve performance by allocating larger
chunks of memory during anonymous page faults. See [1] for background.

I've significantly reworked and simplified the patch set based on comments from
Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
VARIABLE_THP, on Yu's advice.

The last patch is for arm64 to explicitly override the default
arch_wants_pte_order() and is intended as an example. If this series is accepted
I suggest taking the first 4 patches through the mm tree and the arm64 change
could be handled through the arm64 tree separately. Neither has any build
dependency on the other.

The one area where I haven't followed Yu's advice is in the determination of the
size of folio to use. It was suggested that I have a single preferred large
order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
being existing overlapping populated PTEs, etc) then fallback immediately to
order-0. It turned out that this approach caused a performance regression in the
Speedometer benchmark. With my v1 patch, there were significant quantities of
memory which could not be placed in the 64K bucket and were instead being
allocated for the 32K and 16K buckets. With the proposed simplification, that
memory ended up using the 4K bucket, so page faults increased by 2.75x compared
to the v1 patch (although due to the 64K bucket, this number is still a bit
lower than the baseline). So instead, I continue to calculate a folio order that
is somewhere between the preferred order and 0. (See below for more details).

The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
[2], which is a hard dependency. I have a branch at [3].


Changes since v1 [1]
--------------------

  - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
  - replaced with arch-independent alloc_anon_folio()
      - follows THP allocation approach
  - no longer retry with intermediate orders if allocation fails
      - fallback directly to order-0
  - remove folio_add_new_anon_rmap_range() patch
      - instead add its new functionality to folio_add_new_anon_rmap()
  - remove batch-zap pte mappings optimization patch
      - remove enabler folio_remove_rmap_range() patch too
      - These offer real perf improvement so will submit separately
  - simplify Kconfig
      - single FLEXIBLE_THP option, which is independent of arch
      - depends on TRANSPARENT_HUGEPAGE
      - when enabled default to max anon folio size of 64K unless arch
        explicitly overrides
  - simplify changes to do_anonymous_page():
      - no more retry loop


Performance
-----------

Below results show 3 benchmarks; kernel compilation with 8 jobs, kernel
compilation with 80 jobs, and speedometer 2.0 (a javascript benchmark running in
Chromium). All cases are running on Ampere Altra with 1 NUMA node enabled,
Ubuntu 22.04 and XFS filesystem. Each benchmark is repeated 15 times over 5
reboots and averaged.

'anonfolio-lkml-v1' is the v1 patchset at [1]. 'anonfolio-lkml-v2' is this v2
patchset. 'anonfolio-lkml-v2-simple-order' is anonfolio-lkml-v2 but with the
order selection simplification that Yu Zhao suggested - I'm trying to justify
here why I did not follow the advice.


Kernel compilation with 8 jobs:

| kernel                         |   real-time |   kern-time |   user-time |
|:-------------------------------|------------:|------------:|------------:|
| baseline-4k                    |        0.0% |        0.0% |        0.0% |
| anonfolio-lkml-v1              |       -5.3% |      -42.9% |       -0.6% |
| anonfolio-lkml-v2-simple-order |       -4.4% |      -36.5% |       -0.4% |
| anonfolio-lkml-v2              |       -4.8% |      -38.6% |       -0.6% |

We can see that the simple-order approach is responsible for a regression of
0.4%.


Kernel compilation with 80 jobs:

| kernel                         |   real-time |   kern-time |   user-time |
|:-------------------------------|------------:|------------:|------------:|
| baseline-4k                    |        0.0% |        0.0% |        0.0% |
| anonfolio-lkml-v1              |       -4.6% |      -45.7% |        1.4% |
| anonfolio-lkml-v2-simple-order |       -4.7% |      -40.2% |       -0.1% |
| anonfolio-lkml-v2              |       -5.0% |      -42.6% |       -0.3% |

simple-order costs 0.3 % here. v2 is actually performing higher than v1 due to
fixing the v1 regression on user-time.


Speedometer 2.0:

| kernel                         |   runs_per_min |
|:-------------------------------|---------------:|
| baseline-4k                    |           0.0% |
| anonfolio-lkml-v1              |           0.7% |
| anonfolio-lkml-v2-simple-order |          -0.9% |
| anonfolio-lkml-v2              |           0.5% |

simple-order regresses performance by 0.9% vs the baseline, for a total negative
swing of 1.6% vs v1. This is fixed by keeping the more complex order selection
mechanism from v1.


The remaining (kernel time) performance gap between v1 and v2 for the above
benchmarks is due to the removal of the "batch zap" patch in v2. Adding that
back in gives us the performance back. I intend to submit that as a separate
series once this series is accepted.


[1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
[3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v2

Thanks,
Ryan


Ryan Roberts (5):
  mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
  mm: Allow deferred splitting of arbitrary large anon folios
  mm: Default implementation of arch_wants_pte_order()
  mm: FLEXIBLE_THP for improved performance
  arm64: mm: Override arch_wants_pte_order()

 arch/arm64/Kconfig               |  12 +++
 arch/arm64/include/asm/pgtable.h |   4 +
 arch/arm64/mm/mmu.c              |   8 ++
 include/linux/pgtable.h          |  13 +++
 mm/Kconfig                       |  10 ++
 mm/memory.c                      | 168 ++++++++++++++++++++++++++++---
 mm/rmap.c                        |  28 ++++--
 7 files changed, 222 insertions(+), 21 deletions(-)

--
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
  2023-07-03 13:53 ` Ryan Roberts
@ 2023-07-03 13:53   ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

In preparation for FLEXIBLE_THP support, improve
folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
passed to it. In this case, all contained pages are accounted using the
"small" pages scheme.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/rmap.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 1d8369549424..82ef5ba363d1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
  * This means the inc-and-test can be bypassed.
  * The folio does not have to be locked.
  *
- * If the folio is large, it is accounted as a THP.  As the folio
+ * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
  * is new, it's assumed to be mapped exclusively by a single process.
  */
 void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 		unsigned long address)
 {
-	int nr;
+	int nr = folio_nr_pages(folio);
+	int i;
+	struct page *page;
 
-	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	VM_BUG_ON_VMA(address < vma->vm_start ||
+			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
 	__folio_set_swapbacked(folio);
 
-	if (likely(!folio_test_pmd_mappable(folio))) {
+	if (!folio_test_large(folio)) {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_mapcount, 0);
-		nr = 1;
+		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
+	} else if (!folio_test_pmd_mappable(folio)) {
+		/* increment count (starts at 0) */
+		atomic_set(&folio->_nr_pages_mapped, nr);
+
+		page = &folio->page;
+		for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) {
+			/* increment count (starts at -1) */
+			atomic_set(&page->_mapcount, 0);
+			__page_set_anon_rmap(folio, page, vma, address, 1);
+		}
 	} else {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_entire_mapcount, 0);
 		atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
-		nr = folio_nr_pages(folio);
 		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
+		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 	}
 
 	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
-	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 }
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
@ 2023-07-03 13:53   ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

In preparation for FLEXIBLE_THP support, improve
folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
passed to it. In this case, all contained pages are accounted using the
"small" pages scheme.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/rmap.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 1d8369549424..82ef5ba363d1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
  * This means the inc-and-test can be bypassed.
  * The folio does not have to be locked.
  *
- * If the folio is large, it is accounted as a THP.  As the folio
+ * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
  * is new, it's assumed to be mapped exclusively by a single process.
  */
 void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 		unsigned long address)
 {
-	int nr;
+	int nr = folio_nr_pages(folio);
+	int i;
+	struct page *page;
 
-	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	VM_BUG_ON_VMA(address < vma->vm_start ||
+			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
 	__folio_set_swapbacked(folio);
 
-	if (likely(!folio_test_pmd_mappable(folio))) {
+	if (!folio_test_large(folio)) {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_mapcount, 0);
-		nr = 1;
+		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
+	} else if (!folio_test_pmd_mappable(folio)) {
+		/* increment count (starts at 0) */
+		atomic_set(&folio->_nr_pages_mapped, nr);
+
+		page = &folio->page;
+		for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) {
+			/* increment count (starts at -1) */
+			atomic_set(&page->_mapcount, 0);
+			__page_set_anon_rmap(folio, page, vma, address, 1);
+		}
 	} else {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_entire_mapcount, 0);
 		atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
-		nr = folio_nr_pages(folio);
 		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
+		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 	}
 
 	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
-	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 }
 
 /**
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
  2023-07-03 13:53 ` Ryan Roberts
@ 2023-07-03 13:53   ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

With the introduction of large folios for anonymous memory, we would
like to be able to split them when they have unmapped subpages, in order
to free those unused pages under memory pressure. So remove the
artificial requirement that the large folio needed to be at least
PMD-sized.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 82ef5ba363d1..bbcb2308a1c5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 		 * page of the folio is unmapped and at least one page
 		 * is still mapped.
 		 */
-		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
+		if (folio_test_large(folio) && folio_test_anon(folio))
 			if (!compound || nr < nr_pmdmapped)
 				deferred_split_folio(folio);
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
@ 2023-07-03 13:53   ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

With the introduction of large folios for anonymous memory, we would
like to be able to split them when they have unmapped subpages, in order
to free those unused pages under memory pressure. So remove the
artificial requirement that the large folio needed to be at least
PMD-sized.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 82ef5ba363d1..bbcb2308a1c5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 		 * page of the folio is unmapped and at least one page
 		 * is still mapped.
 		 */
-		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
+		if (folio_test_large(folio) && folio_test_anon(folio))
 			if (!compound || nr < nr_pmdmapped)
 				deferred_split_folio(folio);
 	}
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-03 13:53 ` Ryan Roberts
@ 2023-07-03 13:53   ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

arch_wants_pte_order() can be overridden by the arch to return the
preferred folio order for pte-mapped memory. This is useful as some
architectures (e.g. arm64) can coalesce TLB entries when the physical
memory is suitably contiguous.

The first user for this hint will be FLEXIBLE_THP, which aims to
allocate large folios for anonymous memory to reduce page faults and
other per-page operation costs.

Here we add the default implementation of the function, used when the
architecture does not define it, which returns the order corresponding
to 64K.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a661a17173fa..f7e38598f20b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -13,6 +13,7 @@
 #include <linux/errno.h>
 #include <asm-generic/pgtable_uffd.h>
 #include <linux/page_table_check.h>
+#include <linux/sizes.h>
 
 #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
 	defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
@@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
 }
 #endif
 
+#ifndef arch_wants_pte_order
+/*
+ * Returns preferred folio order for pte-mapped memory. Must be in range [0,
+ * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
+ * to be at least order-2.
+ */
+static inline int arch_wants_pte_order(struct vm_area_struct *vma)
+{
+	return ilog2(SZ_64K >> PAGE_SHIFT);
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-03 13:53   ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

arch_wants_pte_order() can be overridden by the arch to return the
preferred folio order for pte-mapped memory. This is useful as some
architectures (e.g. arm64) can coalesce TLB entries when the physical
memory is suitably contiguous.

The first user for this hint will be FLEXIBLE_THP, which aims to
allocate large folios for anonymous memory to reduce page faults and
other per-page operation costs.

Here we add the default implementation of the function, used when the
architecture does not define it, which returns the order corresponding
to 64K.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a661a17173fa..f7e38598f20b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -13,6 +13,7 @@
 #include <linux/errno.h>
 #include <asm-generic/pgtable_uffd.h>
 #include <linux/page_table_check.h>
+#include <linux/sizes.h>
 
 #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
 	defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
@@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
 }
 #endif
 
+#ifndef arch_wants_pte_order
+/*
+ * Returns preferred folio order for pte-mapped memory. Must be in range [0,
+ * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
+ * to be at least order-2.
+ */
+static inline int arch_wants_pte_order(struct vm_area_struct *vma)
+{
+	return ilog2(SZ_64K >> PAGE_SHIFT);
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-03 13:53 ` Ryan Roberts
@ 2023-07-03 13:53   ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
allocated in large folios of a specified order. All pages of the large
folio are pte-mapped during the same page fault, significantly reducing
the number of page faults. The number of per-page operations (e.g. ref
counting, rmap management lru list management) are also significantly
reduced since those ops now become per-folio.

The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
defaults to disabled for now; there is a long list of todos to make
FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
madvise ops, etc). These items will be tackled in subsequent patches.

When enabled, the preferred folio order is as returned by
arch_wants_pte_order(), which may be overridden by the arch as it sees
fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
contiguous set of ptes map physically contigious, naturally aligned
memory, so this mechanism allows the architecture to optimize as
required.

If the preferred order can't be used (e.g. because the folio would
breach the bounds of the vma, or because ptes in the region are already
mapped) then we fall back to a suitable lower order.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/Kconfig  |  10 ++++
 mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 165 insertions(+), 13 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 7672a22647b4..1c06b2c0a24e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
 	  support of file THPs will be developed in the next few release
 	  cycles.
 
+config FLEXIBLE_THP
+	bool "Flexible order THP"
+	depends on TRANSPARENT_HUGEPAGE
+	default n
+	help
+	  Use large (bigger than order-0) folios to back anonymous memory where
+	  possible, even if the order of the folio is smaller than the PMD
+	  order. This reduces the number of page faults, as well as other
+	  per-page overheads to improve performance for many workloads.
+
 endif # TRANSPARENT_HUGEPAGE
 
 #
diff --git a/mm/memory.c b/mm/memory.c
index fb30f7523550..abe2ea94f3f5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
 	return 0;
 }
 
+#ifdef CONFIG_FLEXIBLE_THP
+/*
+ * Allocates, zeros and returns a folio of the requested order for use as
+ * anonymous memory.
+ */
+static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
+				      unsigned long addr, int order)
+{
+	gfp_t gfp;
+	struct folio *folio;
+
+	if (order == 0)
+		return vma_alloc_zeroed_movable_folio(vma, addr);
+
+	gfp = vma_thp_gfp_mask(vma);
+	folio = vma_alloc_folio(gfp, order, vma, addr, true);
+	if (folio)
+		clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
+
+	return folio;
+}
+
+/*
+ * Preferred folio order to allocate for anonymous memory.
+ */
+#define max_anon_folio_order(vma)	arch_wants_pte_order(vma)
+#else
+#define alloc_anon_folio(vma, addr, order) \
+				vma_alloc_zeroed_movable_folio(vma, addr)
+#define max_anon_folio_order(vma)	0
+#endif
+
+/*
+ * Returns index of first pte that is not none, or nr if all are none.
+ */
+static inline int check_ptes_none(pte_t *pte, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		if (!pte_none(ptep_get(pte++)))
+			return i;
+	}
+
+	return nr;
+}
+
+static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
+{
+	/*
+	 * The aim here is to determine what size of folio we should allocate
+	 * for this fault. Factors include:
+	 * - Order must not be higher than `order` upon entry
+	 * - Folio must be naturally aligned within VA space
+	 * - Folio must be fully contained inside one pmd entry
+	 * - Folio must not breach boundaries of vma
+	 * - Folio must not overlap any non-none ptes
+	 *
+	 * Additionally, we do not allow order-1 since this breaks assumptions
+	 * elsewhere in the mm; THP pages must be at least order-2 (since they
+	 * store state up to the 3rd struct page subpage), and these pages must
+	 * be THP in order to correctly use pre-existing THP infrastructure such
+	 * as folio_split().
+	 *
+	 * Note that the caller may or may not choose to lock the pte. If
+	 * unlocked, the result is racy and the user must re-check any overlap
+	 * with non-none ptes under the lock.
+	 */
+
+	struct vm_area_struct *vma = vmf->vma;
+	int nr;
+	unsigned long addr;
+	pte_t *pte;
+	pte_t *first_set = NULL;
+	int ret;
+
+	order = min(order, PMD_SHIFT - PAGE_SHIFT);
+
+	for (; order > 1; order--) {
+		nr = 1 << order;
+		addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
+		pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
+
+		/* Check vma bounds. */
+		if (addr < vma->vm_start ||
+		    addr + (nr << PAGE_SHIFT) > vma->vm_end)
+			continue;
+
+		/* Ptes covered by order already known to be none. */
+		if (pte + nr <= first_set)
+			break;
+
+		/* Already found set pte in range covered by order. */
+		if (pte <= first_set)
+			continue;
+
+		/* Need to check if all the ptes are none. */
+		ret = check_ptes_none(pte, nr);
+		if (ret == nr)
+			break;
+
+		first_set = pte + ret;
+	}
+
+	if (order == 1)
+		order = 0;
+
+	return order;
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *
@@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		goto oom;
 
 	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
-		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+		new_folio = alloc_anon_folio(vma, vmf->address, 0);
 		if (!new_folio)
 			goto oom;
 	} else {
@@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	struct folio *folio;
 	vm_fault_t ret = 0;
 	pte_t entry;
+	int order;
+	int pgcount;
+	unsigned long addr;
 
 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
@@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return handle_userfault(vmf, VM_UFFD_MISSING);
 		}
-		goto setpte;
+		if (uffd_wp)
+			entry = pte_mkuffd_wp(entry);
+		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, vmf->address, vmf->pte);
+		goto unlock;
+	}
+
+	/*
+	 * If allocating a large folio, determine the biggest suitable order for
+	 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
+	 * overlap with any populated PTEs, etc). We are not under the ptl here
+	 * so we will need to re-check that we are not overlapping any populated
+	 * PTEs once we have the lock.
+	 */
+	order = uffd_wp ? 0 : max_anon_folio_order(vma);
+	if (order > 0) {
+		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+		order = calc_anon_folio_order_alloc(vmf, order);
+		pte_unmap(vmf->pte);
 	}
 
-	/* Allocate our own private page. */
+	/* Allocate our own private folio. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	folio = alloc_anon_folio(vma, vmf->address, order);
+	if (!folio && order > 0) {
+		order = 0;
+		folio = alloc_anon_folio(vma, vmf->address, order);
+	}
 	if (!folio)
 		goto oom;
 
+	pgcount = 1 << order;
+	addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
 	folio_throttle_swaprate(folio, GFP_KERNEL);
 
 	/*
 	 * The memory barrier inside __folio_mark_uptodate makes sure that
-	 * preceding stores to the page contents become visible before
-	 * the set_pte_at() write.
+	 * preceding stores to the folio contents become visible before
+	 * the set_ptes() write.
 	 */
 	__folio_mark_uptodate(folio);
 
@@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	if (vmf_pte_changed(vmf)) {
 		update_mmu_tlb(vma, vmf->address, vmf->pte);
 		goto release;
+	} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
+		goto release;
 	}
 
 	ret = check_stable_address_space(vma->vm_mm);
@@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	folio_add_new_anon_rmap(folio, vma, vmf->address);
+	folio_ref_add(folio, pgcount - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
+	folio_add_new_anon_rmap(folio, vma, addr);
 	folio_add_lru_vma(folio, vma);
-setpte:
+
 	if (uffd_wp)
 		entry = pte_mkuffd_wp(entry);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, vmf->address, vmf->pte);
+	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-03 13:53   ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
allocated in large folios of a specified order. All pages of the large
folio are pte-mapped during the same page fault, significantly reducing
the number of page faults. The number of per-page operations (e.g. ref
counting, rmap management lru list management) are also significantly
reduced since those ops now become per-folio.

The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
defaults to disabled for now; there is a long list of todos to make
FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
madvise ops, etc). These items will be tackled in subsequent patches.

When enabled, the preferred folio order is as returned by
arch_wants_pte_order(), which may be overridden by the arch as it sees
fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
contiguous set of ptes map physically contigious, naturally aligned
memory, so this mechanism allows the architecture to optimize as
required.

If the preferred order can't be used (e.g. because the folio would
breach the bounds of the vma, or because ptes in the region are already
mapped) then we fall back to a suitable lower order.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/Kconfig  |  10 ++++
 mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 165 insertions(+), 13 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 7672a22647b4..1c06b2c0a24e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
 	  support of file THPs will be developed in the next few release
 	  cycles.
 
+config FLEXIBLE_THP
+	bool "Flexible order THP"
+	depends on TRANSPARENT_HUGEPAGE
+	default n
+	help
+	  Use large (bigger than order-0) folios to back anonymous memory where
+	  possible, even if the order of the folio is smaller than the PMD
+	  order. This reduces the number of page faults, as well as other
+	  per-page overheads to improve performance for many workloads.
+
 endif # TRANSPARENT_HUGEPAGE
 
 #
diff --git a/mm/memory.c b/mm/memory.c
index fb30f7523550..abe2ea94f3f5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
 	return 0;
 }
 
+#ifdef CONFIG_FLEXIBLE_THP
+/*
+ * Allocates, zeros and returns a folio of the requested order for use as
+ * anonymous memory.
+ */
+static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
+				      unsigned long addr, int order)
+{
+	gfp_t gfp;
+	struct folio *folio;
+
+	if (order == 0)
+		return vma_alloc_zeroed_movable_folio(vma, addr);
+
+	gfp = vma_thp_gfp_mask(vma);
+	folio = vma_alloc_folio(gfp, order, vma, addr, true);
+	if (folio)
+		clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
+
+	return folio;
+}
+
+/*
+ * Preferred folio order to allocate for anonymous memory.
+ */
+#define max_anon_folio_order(vma)	arch_wants_pte_order(vma)
+#else
+#define alloc_anon_folio(vma, addr, order) \
+				vma_alloc_zeroed_movable_folio(vma, addr)
+#define max_anon_folio_order(vma)	0
+#endif
+
+/*
+ * Returns index of first pte that is not none, or nr if all are none.
+ */
+static inline int check_ptes_none(pte_t *pte, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		if (!pte_none(ptep_get(pte++)))
+			return i;
+	}
+
+	return nr;
+}
+
+static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
+{
+	/*
+	 * The aim here is to determine what size of folio we should allocate
+	 * for this fault. Factors include:
+	 * - Order must not be higher than `order` upon entry
+	 * - Folio must be naturally aligned within VA space
+	 * - Folio must be fully contained inside one pmd entry
+	 * - Folio must not breach boundaries of vma
+	 * - Folio must not overlap any non-none ptes
+	 *
+	 * Additionally, we do not allow order-1 since this breaks assumptions
+	 * elsewhere in the mm; THP pages must be at least order-2 (since they
+	 * store state up to the 3rd struct page subpage), and these pages must
+	 * be THP in order to correctly use pre-existing THP infrastructure such
+	 * as folio_split().
+	 *
+	 * Note that the caller may or may not choose to lock the pte. If
+	 * unlocked, the result is racy and the user must re-check any overlap
+	 * with non-none ptes under the lock.
+	 */
+
+	struct vm_area_struct *vma = vmf->vma;
+	int nr;
+	unsigned long addr;
+	pte_t *pte;
+	pte_t *first_set = NULL;
+	int ret;
+
+	order = min(order, PMD_SHIFT - PAGE_SHIFT);
+
+	for (; order > 1; order--) {
+		nr = 1 << order;
+		addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
+		pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
+
+		/* Check vma bounds. */
+		if (addr < vma->vm_start ||
+		    addr + (nr << PAGE_SHIFT) > vma->vm_end)
+			continue;
+
+		/* Ptes covered by order already known to be none. */
+		if (pte + nr <= first_set)
+			break;
+
+		/* Already found set pte in range covered by order. */
+		if (pte <= first_set)
+			continue;
+
+		/* Need to check if all the ptes are none. */
+		ret = check_ptes_none(pte, nr);
+		if (ret == nr)
+			break;
+
+		first_set = pte + ret;
+	}
+
+	if (order == 1)
+		order = 0;
+
+	return order;
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *
@@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		goto oom;
 
 	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
-		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+		new_folio = alloc_anon_folio(vma, vmf->address, 0);
 		if (!new_folio)
 			goto oom;
 	} else {
@@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	struct folio *folio;
 	vm_fault_t ret = 0;
 	pte_t entry;
+	int order;
+	int pgcount;
+	unsigned long addr;
 
 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
@@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return handle_userfault(vmf, VM_UFFD_MISSING);
 		}
-		goto setpte;
+		if (uffd_wp)
+			entry = pte_mkuffd_wp(entry);
+		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, vmf->address, vmf->pte);
+		goto unlock;
+	}
+
+	/*
+	 * If allocating a large folio, determine the biggest suitable order for
+	 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
+	 * overlap with any populated PTEs, etc). We are not under the ptl here
+	 * so we will need to re-check that we are not overlapping any populated
+	 * PTEs once we have the lock.
+	 */
+	order = uffd_wp ? 0 : max_anon_folio_order(vma);
+	if (order > 0) {
+		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+		order = calc_anon_folio_order_alloc(vmf, order);
+		pte_unmap(vmf->pte);
 	}
 
-	/* Allocate our own private page. */
+	/* Allocate our own private folio. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	folio = alloc_anon_folio(vma, vmf->address, order);
+	if (!folio && order > 0) {
+		order = 0;
+		folio = alloc_anon_folio(vma, vmf->address, order);
+	}
 	if (!folio)
 		goto oom;
 
+	pgcount = 1 << order;
+	addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
 	folio_throttle_swaprate(folio, GFP_KERNEL);
 
 	/*
 	 * The memory barrier inside __folio_mark_uptodate makes sure that
-	 * preceding stores to the page contents become visible before
-	 * the set_pte_at() write.
+	 * preceding stores to the folio contents become visible before
+	 * the set_ptes() write.
 	 */
 	__folio_mark_uptodate(folio);
 
@@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	if (vmf_pte_changed(vmf)) {
 		update_mmu_tlb(vma, vmf->address, vmf->pte);
 		goto release;
+	} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
+		goto release;
 	}
 
 	ret = check_stable_address_space(vma->vm_mm);
@@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	folio_add_new_anon_rmap(folio, vma, vmf->address);
+	folio_ref_add(folio, pgcount - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
+	folio_add_new_anon_rmap(folio, vma, addr);
 	folio_add_lru_vma(folio, vma);
-setpte:
+
 	if (uffd_wp)
 		entry = pte_mkuffd_wp(entry);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, vmf->address, vmf->pte);
+	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH v2 5/5] arm64: mm: Override arch_wants_pte_order()
  2023-07-03 13:53 ` Ryan Roberts
@ 2023-07-03 13:53   ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Define an arch-specific override of arch_wants_pte_order() so that when
FLEXIBLE_THP is enabled, large folios will be allocated for anonymous
memory with an order that is compatible with arm64's contpte mappings.

arch_wants_pte_order() returns an order according to the following
policy: For the unhinted case, when THP is not requested for the vma,
don't allow anything bigger than 64K. This means we don't waste too much
memory. Additionally, for 4K pages this is the contpte size, and for
16K, this is (usually) the HPA size when the uarch feature is
implemented. For the hinted case, when THP is requested for the vma,
allow the contpte size for all page size configurations; 64K for 4K, 2M
for 16K and 2M for 64K.

Additionally, the THP and NOTHP order constants are defined using
Kconfig so it is possible to override them at build time.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/Kconfig               | 12 ++++++++++++
 arch/arm64/include/asm/pgtable.h |  4 ++++
 arch/arm64/mm/mmu.c              |  8 ++++++++
 3 files changed, 24 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 343e1e1cae10..689c5bf13dc1 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -281,6 +281,18 @@ config ARM64_CONT_PMD_SHIFT
 	default 5 if ARM64_16K_PAGES
 	default 4
 
+config ARM64_PTE_ORDER_NOTHP
+	int
+	default 0 if ARM64_64K_PAGES	# 64K (1 page)
+	default 2 if ARM64_16K_PAGES	# 64K (4 pages; benefits from HPA where HW supports it)
+	default 4 if ARM64_4K_PAGES	# 64K (16 pages; eligible for contpte-mapping)
+
+config ARM64_PTE_ORDER_THP
+	int
+	default 5 if ARM64_64K_PAGES	# 2M  (32 pages; eligible for contpte-mapping)
+	default 7 if ARM64_16K_PAGES	# 2M  (128 pages; eligible for contpte-mapping)
+	default 4 if ARM64_4K_PAGES	# 64K (16 pages; eligible for contpte-mapping)
+
 config ARCH_MMAP_RND_BITS_MIN
 	default 14 if ARM64_64K_PAGES
 	default 16 if ARM64_16K_PAGES
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 6fd012663a01..8463d5f9f307 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1117,6 +1117,10 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
 extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    pte_t old_pte, pte_t new_pte);
+
+#define arch_wants_pte_order arch_wants_pte_order
+extern int arch_wants_pte_order(struct vm_area_struct *vma);
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index af6bc8403ee4..8556c4a9b507 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1481,3 +1481,11 @@ void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte
 {
 	set_pte_at(vma->vm_mm, addr, ptep, pte);
 }
+
+int arch_wants_pte_order(struct vm_area_struct *vma)
+{
+	if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
+		return CONFIG_ARM64_PTE_ORDER_THP;
+	else
+		return CONFIG_ARM64_PTE_ORDER_NOTHP;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH v2 5/5] arm64: mm: Override arch_wants_pte_order()
@ 2023-07-03 13:53   ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Define an arch-specific override of arch_wants_pte_order() so that when
FLEXIBLE_THP is enabled, large folios will be allocated for anonymous
memory with an order that is compatible with arm64's contpte mappings.

arch_wants_pte_order() returns an order according to the following
policy: For the unhinted case, when THP is not requested for the vma,
don't allow anything bigger than 64K. This means we don't waste too much
memory. Additionally, for 4K pages this is the contpte size, and for
16K, this is (usually) the HPA size when the uarch feature is
implemented. For the hinted case, when THP is requested for the vma,
allow the contpte size for all page size configurations; 64K for 4K, 2M
for 16K and 2M for 64K.

Additionally, the THP and NOTHP order constants are defined using
Kconfig so it is possible to override them at build time.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/Kconfig               | 12 ++++++++++++
 arch/arm64/include/asm/pgtable.h |  4 ++++
 arch/arm64/mm/mmu.c              |  8 ++++++++
 3 files changed, 24 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 343e1e1cae10..689c5bf13dc1 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -281,6 +281,18 @@ config ARM64_CONT_PMD_SHIFT
 	default 5 if ARM64_16K_PAGES
 	default 4
 
+config ARM64_PTE_ORDER_NOTHP
+	int
+	default 0 if ARM64_64K_PAGES	# 64K (1 page)
+	default 2 if ARM64_16K_PAGES	# 64K (4 pages; benefits from HPA where HW supports it)
+	default 4 if ARM64_4K_PAGES	# 64K (16 pages; eligible for contpte-mapping)
+
+config ARM64_PTE_ORDER_THP
+	int
+	default 5 if ARM64_64K_PAGES	# 2M  (32 pages; eligible for contpte-mapping)
+	default 7 if ARM64_16K_PAGES	# 2M  (128 pages; eligible for contpte-mapping)
+	default 4 if ARM64_4K_PAGES	# 64K (16 pages; eligible for contpte-mapping)
+
 config ARCH_MMAP_RND_BITS_MIN
 	default 14 if ARM64_64K_PAGES
 	default 16 if ARM64_16K_PAGES
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 6fd012663a01..8463d5f9f307 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1117,6 +1117,10 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
 extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    pte_t old_pte, pte_t new_pte);
+
+#define arch_wants_pte_order arch_wants_pte_order
+extern int arch_wants_pte_order(struct vm_area_struct *vma);
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index af6bc8403ee4..8556c4a9b507 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1481,3 +1481,11 @@ void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte
 {
 	set_pte_at(vma->vm_mm, addr, ptep, pte);
 }
+
+int arch_wants_pte_order(struct vm_area_struct *vma)
+{
+	if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
+		return CONFIG_ARM64_PTE_ORDER_THP;
+	else
+		return CONFIG_ARM64_PTE_ORDER_NOTHP;
+}
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-03 13:53   ` Ryan Roberts
@ 2023-07-03 15:51     ` kernel test robot
  -1 siblings, 0 replies; 167+ messages in thread
From: kernel test robot @ 2023-07-03 15:51 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi
  Cc: llvm, oe-kbuild-all, Linux Memory Management List, Ryan Roberts,
	linux-arm-kernel, linux-kernel

Hi Ryan,

kernel test robot noticed the following build errors:

[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on v6.4]
[cannot apply to akpm-mm/mm-everything linus/master next-20230703]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Non-pmd-mappable-large-folios-for-folio_add_new_anon_rmap/20230703-215627
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
patch link:    https://lore.kernel.org/r/20230703135330.1865927-5-ryan.roberts%40arm.com
patch subject: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
config: um-allyesconfig (https://download.01.org/0day-ci/archive/20230703/202307032325.u93xmWbG-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce: (https://download.01.org/0day-ci/archive/20230703/202307032325.u93xmWbG-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307032325.u93xmWbG-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __raw_readb(PCI_IOBASE + addr);
                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
   #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
                                                     ^
   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
   #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
                                                     ^
   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writeb(value, PCI_IOBASE + addr);
                               ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsb(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsw(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsl(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesb(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesw(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesl(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
>> mm/memory.c:4271:2: error: implicit declaration of function 'set_ptes' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
           set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
           ^
   mm/memory.c:4271:2: note: did you mean 'set_pte'?
   arch/um/include/asm/pgtable.h:232:20: note: 'set_pte' declared here
   static inline void set_pte(pte_t *pteptr, pte_t pteval)
                      ^
>> mm/memory.c:4274:2: error: implicit declaration of function 'update_mmu_cache_range' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
           update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
           ^
   12 warnings and 2 errors generated.


vim +/set_ptes +4271 mm/memory.c

  4135	
  4136	/*
  4137	 * We enter with non-exclusive mmap_lock (to exclude vma changes,
  4138	 * but allow concurrent faults), and pte mapped but not yet locked.
  4139	 * We return with mmap_lock still held, but pte unmapped and unlocked.
  4140	 */
  4141	static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
  4142	{
  4143		bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
  4144		struct vm_area_struct *vma = vmf->vma;
  4145		struct folio *folio;
  4146		vm_fault_t ret = 0;
  4147		pte_t entry;
  4148		int order;
  4149		int pgcount;
  4150		unsigned long addr;
  4151	
  4152		/* File mapping without ->vm_ops ? */
  4153		if (vma->vm_flags & VM_SHARED)
  4154			return VM_FAULT_SIGBUS;
  4155	
  4156		/*
  4157		 * Use pte_alloc() instead of pte_alloc_map().  We can't run
  4158		 * pte_offset_map() on pmds where a huge pmd might be created
  4159		 * from a different thread.
  4160		 *
  4161		 * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
  4162		 * parallel threads are excluded by other means.
  4163		 *
  4164		 * Here we only have mmap_read_lock(mm).
  4165		 */
  4166		if (pte_alloc(vma->vm_mm, vmf->pmd))
  4167			return VM_FAULT_OOM;
  4168	
  4169		/* See comment in handle_pte_fault() */
  4170		if (unlikely(pmd_trans_unstable(vmf->pmd)))
  4171			return 0;
  4172	
  4173		/* Use the zero-page for reads */
  4174		if (!(vmf->flags & FAULT_FLAG_WRITE) &&
  4175				!mm_forbids_zeropage(vma->vm_mm)) {
  4176			entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
  4177							vma->vm_page_prot));
  4178			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  4179					vmf->address, &vmf->ptl);
  4180			if (vmf_pte_changed(vmf)) {
  4181				update_mmu_tlb(vma, vmf->address, vmf->pte);
  4182				goto unlock;
  4183			}
  4184			ret = check_stable_address_space(vma->vm_mm);
  4185			if (ret)
  4186				goto unlock;
  4187			/* Deliver the page fault to userland, check inside PT lock */
  4188			if (userfaultfd_missing(vma)) {
  4189				pte_unmap_unlock(vmf->pte, vmf->ptl);
  4190				return handle_userfault(vmf, VM_UFFD_MISSING);
  4191			}
  4192			if (uffd_wp)
  4193				entry = pte_mkuffd_wp(entry);
  4194			set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
  4195	
  4196			/* No need to invalidate - it was non-present before */
  4197			update_mmu_cache(vma, vmf->address, vmf->pte);
  4198			goto unlock;
  4199		}
  4200	
  4201		/*
  4202		 * If allocating a large folio, determine the biggest suitable order for
  4203		 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
  4204		 * overlap with any populated PTEs, etc). We are not under the ptl here
  4205		 * so we will need to re-check that we are not overlapping any populated
  4206		 * PTEs once we have the lock.
  4207		 */
  4208		order = uffd_wp ? 0 : max_anon_folio_order(vma);
  4209		if (order > 0) {
  4210			vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
  4211			order = calc_anon_folio_order_alloc(vmf, order);
  4212			pte_unmap(vmf->pte);
  4213		}
  4214	
  4215		/* Allocate our own private folio. */
  4216		if (unlikely(anon_vma_prepare(vma)))
  4217			goto oom;
  4218		folio = alloc_anon_folio(vma, vmf->address, order);
  4219		if (!folio && order > 0) {
  4220			order = 0;
  4221			folio = alloc_anon_folio(vma, vmf->address, order);
  4222		}
  4223		if (!folio)
  4224			goto oom;
  4225	
  4226		pgcount = 1 << order;
  4227		addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
  4228	
  4229		if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
  4230			goto oom_free_page;
  4231		folio_throttle_swaprate(folio, GFP_KERNEL);
  4232	
  4233		/*
  4234		 * The memory barrier inside __folio_mark_uptodate makes sure that
  4235		 * preceding stores to the folio contents become visible before
  4236		 * the set_ptes() write.
  4237		 */
  4238		__folio_mark_uptodate(folio);
  4239	
  4240		entry = mk_pte(&folio->page, vma->vm_page_prot);
  4241		entry = pte_sw_mkyoung(entry);
  4242		if (vma->vm_flags & VM_WRITE)
  4243			entry = pte_mkwrite(pte_mkdirty(entry));
  4244	
  4245		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
  4246		if (vmf_pte_changed(vmf)) {
  4247			update_mmu_tlb(vma, vmf->address, vmf->pte);
  4248			goto release;
  4249		} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
  4250			goto release;
  4251		}
  4252	
  4253		ret = check_stable_address_space(vma->vm_mm);
  4254		if (ret)
  4255			goto release;
  4256	
  4257		/* Deliver the page fault to userland, check inside PT lock */
  4258		if (userfaultfd_missing(vma)) {
  4259			pte_unmap_unlock(vmf->pte, vmf->ptl);
  4260			folio_put(folio);
  4261			return handle_userfault(vmf, VM_UFFD_MISSING);
  4262		}
  4263	
  4264		folio_ref_add(folio, pgcount - 1);
  4265		add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
  4266		folio_add_new_anon_rmap(folio, vma, addr);
  4267		folio_add_lru_vma(folio, vma);
  4268	
  4269		if (uffd_wp)
  4270			entry = pte_mkuffd_wp(entry);
> 4271		set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
  4272	
  4273		/* No need to invalidate - it was non-present before */
> 4274		update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
  4275	unlock:
  4276		pte_unmap_unlock(vmf->pte, vmf->ptl);
  4277		return ret;
  4278	release:
  4279		folio_put(folio);
  4280		goto unlock;
  4281	oom_free_page:
  4282		folio_put(folio);
  4283	oom:
  4284		return VM_FAULT_OOM;
  4285	}
  4286	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-03 15:51     ` kernel test robot
  0 siblings, 0 replies; 167+ messages in thread
From: kernel test robot @ 2023-07-03 15:51 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi
  Cc: llvm, oe-kbuild-all, Linux Memory Management List, Ryan Roberts,
	linux-arm-kernel, linux-kernel

Hi Ryan,

kernel test robot noticed the following build errors:

[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on v6.4]
[cannot apply to akpm-mm/mm-everything linus/master next-20230703]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Non-pmd-mappable-large-folios-for-folio_add_new_anon_rmap/20230703-215627
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
patch link:    https://lore.kernel.org/r/20230703135330.1865927-5-ryan.roberts%40arm.com
patch subject: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
config: um-allyesconfig (https://download.01.org/0day-ci/archive/20230703/202307032325.u93xmWbG-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce: (https://download.01.org/0day-ci/archive/20230703/202307032325.u93xmWbG-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307032325.u93xmWbG-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __raw_readb(PCI_IOBASE + addr);
                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
   #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
                                                     ^
   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
   #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
                                                     ^
   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writeb(value, PCI_IOBASE + addr);
                               ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsb(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsw(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsl(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesb(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesw(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesl(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
>> mm/memory.c:4271:2: error: implicit declaration of function 'set_ptes' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
           set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
           ^
   mm/memory.c:4271:2: note: did you mean 'set_pte'?
   arch/um/include/asm/pgtable.h:232:20: note: 'set_pte' declared here
   static inline void set_pte(pte_t *pteptr, pte_t pteval)
                      ^
>> mm/memory.c:4274:2: error: implicit declaration of function 'update_mmu_cache_range' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
           update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
           ^
   12 warnings and 2 errors generated.


vim +/set_ptes +4271 mm/memory.c

  4135	
  4136	/*
  4137	 * We enter with non-exclusive mmap_lock (to exclude vma changes,
  4138	 * but allow concurrent faults), and pte mapped but not yet locked.
  4139	 * We return with mmap_lock still held, but pte unmapped and unlocked.
  4140	 */
  4141	static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
  4142	{
  4143		bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
  4144		struct vm_area_struct *vma = vmf->vma;
  4145		struct folio *folio;
  4146		vm_fault_t ret = 0;
  4147		pte_t entry;
  4148		int order;
  4149		int pgcount;
  4150		unsigned long addr;
  4151	
  4152		/* File mapping without ->vm_ops ? */
  4153		if (vma->vm_flags & VM_SHARED)
  4154			return VM_FAULT_SIGBUS;
  4155	
  4156		/*
  4157		 * Use pte_alloc() instead of pte_alloc_map().  We can't run
  4158		 * pte_offset_map() on pmds where a huge pmd might be created
  4159		 * from a different thread.
  4160		 *
  4161		 * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
  4162		 * parallel threads are excluded by other means.
  4163		 *
  4164		 * Here we only have mmap_read_lock(mm).
  4165		 */
  4166		if (pte_alloc(vma->vm_mm, vmf->pmd))
  4167			return VM_FAULT_OOM;
  4168	
  4169		/* See comment in handle_pte_fault() */
  4170		if (unlikely(pmd_trans_unstable(vmf->pmd)))
  4171			return 0;
  4172	
  4173		/* Use the zero-page for reads */
  4174		if (!(vmf->flags & FAULT_FLAG_WRITE) &&
  4175				!mm_forbids_zeropage(vma->vm_mm)) {
  4176			entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
  4177							vma->vm_page_prot));
  4178			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  4179					vmf->address, &vmf->ptl);
  4180			if (vmf_pte_changed(vmf)) {
  4181				update_mmu_tlb(vma, vmf->address, vmf->pte);
  4182				goto unlock;
  4183			}
  4184			ret = check_stable_address_space(vma->vm_mm);
  4185			if (ret)
  4186				goto unlock;
  4187			/* Deliver the page fault to userland, check inside PT lock */
  4188			if (userfaultfd_missing(vma)) {
  4189				pte_unmap_unlock(vmf->pte, vmf->ptl);
  4190				return handle_userfault(vmf, VM_UFFD_MISSING);
  4191			}
  4192			if (uffd_wp)
  4193				entry = pte_mkuffd_wp(entry);
  4194			set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
  4195	
  4196			/* No need to invalidate - it was non-present before */
  4197			update_mmu_cache(vma, vmf->address, vmf->pte);
  4198			goto unlock;
  4199		}
  4200	
  4201		/*
  4202		 * If allocating a large folio, determine the biggest suitable order for
  4203		 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
  4204		 * overlap with any populated PTEs, etc). We are not under the ptl here
  4205		 * so we will need to re-check that we are not overlapping any populated
  4206		 * PTEs once we have the lock.
  4207		 */
  4208		order = uffd_wp ? 0 : max_anon_folio_order(vma);
  4209		if (order > 0) {
  4210			vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
  4211			order = calc_anon_folio_order_alloc(vmf, order);
  4212			pte_unmap(vmf->pte);
  4213		}
  4214	
  4215		/* Allocate our own private folio. */
  4216		if (unlikely(anon_vma_prepare(vma)))
  4217			goto oom;
  4218		folio = alloc_anon_folio(vma, vmf->address, order);
  4219		if (!folio && order > 0) {
  4220			order = 0;
  4221			folio = alloc_anon_folio(vma, vmf->address, order);
  4222		}
  4223		if (!folio)
  4224			goto oom;
  4225	
  4226		pgcount = 1 << order;
  4227		addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
  4228	
  4229		if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
  4230			goto oom_free_page;
  4231		folio_throttle_swaprate(folio, GFP_KERNEL);
  4232	
  4233		/*
  4234		 * The memory barrier inside __folio_mark_uptodate makes sure that
  4235		 * preceding stores to the folio contents become visible before
  4236		 * the set_ptes() write.
  4237		 */
  4238		__folio_mark_uptodate(folio);
  4239	
  4240		entry = mk_pte(&folio->page, vma->vm_page_prot);
  4241		entry = pte_sw_mkyoung(entry);
  4242		if (vma->vm_flags & VM_WRITE)
  4243			entry = pte_mkwrite(pte_mkdirty(entry));
  4244	
  4245		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
  4246		if (vmf_pte_changed(vmf)) {
  4247			update_mmu_tlb(vma, vmf->address, vmf->pte);
  4248			goto release;
  4249		} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
  4250			goto release;
  4251		}
  4252	
  4253		ret = check_stable_address_space(vma->vm_mm);
  4254		if (ret)
  4255			goto release;
  4256	
  4257		/* Deliver the page fault to userland, check inside PT lock */
  4258		if (userfaultfd_missing(vma)) {
  4259			pte_unmap_unlock(vmf->pte, vmf->ptl);
  4260			folio_put(folio);
  4261			return handle_userfault(vmf, VM_UFFD_MISSING);
  4262		}
  4263	
  4264		folio_ref_add(folio, pgcount - 1);
  4265		add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
  4266		folio_add_new_anon_rmap(folio, vma, addr);
  4267		folio_add_lru_vma(folio, vma);
  4268	
  4269		if (uffd_wp)
  4270			entry = pte_mkuffd_wp(entry);
> 4271		set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
  4272	
  4273		/* No need to invalidate - it was non-present before */
> 4274		update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
  4275	unlock:
  4276		pte_unmap_unlock(vmf->pte, vmf->ptl);
  4277		return ret;
  4278	release:
  4279		folio_put(folio);
  4280		goto unlock;
  4281	oom_free_page:
  4282		folio_put(folio);
  4283	oom:
  4284		return VM_FAULT_OOM;
  4285	}
  4286	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-03 13:53   ` Ryan Roberts
@ 2023-07-03 16:01     ` kernel test robot
  -1 siblings, 0 replies; 167+ messages in thread
From: kernel test robot @ 2023-07-03 16:01 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi
  Cc: llvm, oe-kbuild-all, Linux Memory Management List, Ryan Roberts,
	linux-arm-kernel, linux-kernel

Hi Ryan,

kernel test robot noticed the following build errors:

[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on v6.4]
[cannot apply to akpm-mm/mm-everything linus/master next-20230703]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Non-pmd-mappable-large-folios-for-folio_add_new_anon_rmap/20230703-215627
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
patch link:    https://lore.kernel.org/r/20230703135330.1865927-5-ryan.roberts%40arm.com
patch subject: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
config: um-allnoconfig (https://download.01.org/0day-ci/archive/20230703/202307032330.TguyNttt-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230703/202307032330.TguyNttt-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307032330.TguyNttt-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     547 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     560 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     573 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     584 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     594 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     604 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     692 |         readsb(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     700 |         readsw(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     708 |         readsl(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     717 |         writesb(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     726 |         writesw(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     735 |         writesl(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
>> mm/memory.c:4271:2: error: call to undeclared function 'set_ptes'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    4271 |         set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
         |         ^
   mm/memory.c:4271:2: note: did you mean 'set_pte'?
   arch/um/include/asm/pgtable.h:232:20: note: 'set_pte' declared here
     232 | static inline void set_pte(pte_t *pteptr, pte_t pteval)
         |                    ^
>> mm/memory.c:4274:2: error: call to undeclared function 'update_mmu_cache_range'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    4274 |         update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
         |         ^
   12 warnings and 2 errors generated.


vim +/set_ptes +4271 mm/memory.c

  4135	
  4136	/*
  4137	 * We enter with non-exclusive mmap_lock (to exclude vma changes,
  4138	 * but allow concurrent faults), and pte mapped but not yet locked.
  4139	 * We return with mmap_lock still held, but pte unmapped and unlocked.
  4140	 */
  4141	static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
  4142	{
  4143		bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
  4144		struct vm_area_struct *vma = vmf->vma;
  4145		struct folio *folio;
  4146		vm_fault_t ret = 0;
  4147		pte_t entry;
  4148		int order;
  4149		int pgcount;
  4150		unsigned long addr;
  4151	
  4152		/* File mapping without ->vm_ops ? */
  4153		if (vma->vm_flags & VM_SHARED)
  4154			return VM_FAULT_SIGBUS;
  4155	
  4156		/*
  4157		 * Use pte_alloc() instead of pte_alloc_map().  We can't run
  4158		 * pte_offset_map() on pmds where a huge pmd might be created
  4159		 * from a different thread.
  4160		 *
  4161		 * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
  4162		 * parallel threads are excluded by other means.
  4163		 *
  4164		 * Here we only have mmap_read_lock(mm).
  4165		 */
  4166		if (pte_alloc(vma->vm_mm, vmf->pmd))
  4167			return VM_FAULT_OOM;
  4168	
  4169		/* See comment in handle_pte_fault() */
  4170		if (unlikely(pmd_trans_unstable(vmf->pmd)))
  4171			return 0;
  4172	
  4173		/* Use the zero-page for reads */
  4174		if (!(vmf->flags & FAULT_FLAG_WRITE) &&
  4175				!mm_forbids_zeropage(vma->vm_mm)) {
  4176			entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
  4177							vma->vm_page_prot));
  4178			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  4179					vmf->address, &vmf->ptl);
  4180			if (vmf_pte_changed(vmf)) {
  4181				update_mmu_tlb(vma, vmf->address, vmf->pte);
  4182				goto unlock;
  4183			}
  4184			ret = check_stable_address_space(vma->vm_mm);
  4185			if (ret)
  4186				goto unlock;
  4187			/* Deliver the page fault to userland, check inside PT lock */
  4188			if (userfaultfd_missing(vma)) {
  4189				pte_unmap_unlock(vmf->pte, vmf->ptl);
  4190				return handle_userfault(vmf, VM_UFFD_MISSING);
  4191			}
  4192			if (uffd_wp)
  4193				entry = pte_mkuffd_wp(entry);
  4194			set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
  4195	
  4196			/* No need to invalidate - it was non-present before */
  4197			update_mmu_cache(vma, vmf->address, vmf->pte);
  4198			goto unlock;
  4199		}
  4200	
  4201		/*
  4202		 * If allocating a large folio, determine the biggest suitable order for
  4203		 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
  4204		 * overlap with any populated PTEs, etc). We are not under the ptl here
  4205		 * so we will need to re-check that we are not overlapping any populated
  4206		 * PTEs once we have the lock.
  4207		 */
  4208		order = uffd_wp ? 0 : max_anon_folio_order(vma);
  4209		if (order > 0) {
  4210			vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
  4211			order = calc_anon_folio_order_alloc(vmf, order);
  4212			pte_unmap(vmf->pte);
  4213		}
  4214	
  4215		/* Allocate our own private folio. */
  4216		if (unlikely(anon_vma_prepare(vma)))
  4217			goto oom;
  4218		folio = alloc_anon_folio(vma, vmf->address, order);
  4219		if (!folio && order > 0) {
  4220			order = 0;
  4221			folio = alloc_anon_folio(vma, vmf->address, order);
  4222		}
  4223		if (!folio)
  4224			goto oom;
  4225	
  4226		pgcount = 1 << order;
  4227		addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
  4228	
  4229		if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
  4230			goto oom_free_page;
  4231		folio_throttle_swaprate(folio, GFP_KERNEL);
  4232	
  4233		/*
  4234		 * The memory barrier inside __folio_mark_uptodate makes sure that
  4235		 * preceding stores to the folio contents become visible before
  4236		 * the set_ptes() write.
  4237		 */
  4238		__folio_mark_uptodate(folio);
  4239	
  4240		entry = mk_pte(&folio->page, vma->vm_page_prot);
  4241		entry = pte_sw_mkyoung(entry);
  4242		if (vma->vm_flags & VM_WRITE)
  4243			entry = pte_mkwrite(pte_mkdirty(entry));
  4244	
  4245		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
  4246		if (vmf_pte_changed(vmf)) {
  4247			update_mmu_tlb(vma, vmf->address, vmf->pte);
  4248			goto release;
  4249		} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
  4250			goto release;
  4251		}
  4252	
  4253		ret = check_stable_address_space(vma->vm_mm);
  4254		if (ret)
  4255			goto release;
  4256	
  4257		/* Deliver the page fault to userland, check inside PT lock */
  4258		if (userfaultfd_missing(vma)) {
  4259			pte_unmap_unlock(vmf->pte, vmf->ptl);
  4260			folio_put(folio);
  4261			return handle_userfault(vmf, VM_UFFD_MISSING);
  4262		}
  4263	
  4264		folio_ref_add(folio, pgcount - 1);
  4265		add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
  4266		folio_add_new_anon_rmap(folio, vma, addr);
  4267		folio_add_lru_vma(folio, vma);
  4268	
  4269		if (uffd_wp)
  4270			entry = pte_mkuffd_wp(entry);
> 4271		set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
  4272	
  4273		/* No need to invalidate - it was non-present before */
> 4274		update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
  4275	unlock:
  4276		pte_unmap_unlock(vmf->pte, vmf->ptl);
  4277		return ret;
  4278	release:
  4279		folio_put(folio);
  4280		goto unlock;
  4281	oom_free_page:
  4282		folio_put(folio);
  4283	oom:
  4284		return VM_FAULT_OOM;
  4285	}
  4286	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-03 16:01     ` kernel test robot
  0 siblings, 0 replies; 167+ messages in thread
From: kernel test robot @ 2023-07-03 16:01 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi
  Cc: llvm, oe-kbuild-all, Linux Memory Management List, Ryan Roberts,
	linux-arm-kernel, linux-kernel

Hi Ryan,

kernel test robot noticed the following build errors:

[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on v6.4]
[cannot apply to akpm-mm/mm-everything linus/master next-20230703]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Non-pmd-mappable-large-folios-for-folio_add_new_anon_rmap/20230703-215627
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
patch link:    https://lore.kernel.org/r/20230703135330.1865927-5-ryan.roberts%40arm.com
patch subject: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
config: um-allnoconfig (https://download.01.org/0day-ci/archive/20230703/202307032330.TguyNttt-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230703/202307032330.TguyNttt-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307032330.TguyNttt-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     547 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     560 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     573 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     584 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     594 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     604 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     692 |         readsb(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     700 |         readsw(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     708 |         readsl(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     717 |         writesb(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     726 |         writesw(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     735 |         writesl(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
>> mm/memory.c:4271:2: error: call to undeclared function 'set_ptes'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    4271 |         set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
         |         ^
   mm/memory.c:4271:2: note: did you mean 'set_pte'?
   arch/um/include/asm/pgtable.h:232:20: note: 'set_pte' declared here
     232 | static inline void set_pte(pte_t *pteptr, pte_t pteval)
         |                    ^
>> mm/memory.c:4274:2: error: call to undeclared function 'update_mmu_cache_range'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    4274 |         update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
         |         ^
   12 warnings and 2 errors generated.


vim +/set_ptes +4271 mm/memory.c

  4135	
  4136	/*
  4137	 * We enter with non-exclusive mmap_lock (to exclude vma changes,
  4138	 * but allow concurrent faults), and pte mapped but not yet locked.
  4139	 * We return with mmap_lock still held, but pte unmapped and unlocked.
  4140	 */
  4141	static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
  4142	{
  4143		bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
  4144		struct vm_area_struct *vma = vmf->vma;
  4145		struct folio *folio;
  4146		vm_fault_t ret = 0;
  4147		pte_t entry;
  4148		int order;
  4149		int pgcount;
  4150		unsigned long addr;
  4151	
  4152		/* File mapping without ->vm_ops ? */
  4153		if (vma->vm_flags & VM_SHARED)
  4154			return VM_FAULT_SIGBUS;
  4155	
  4156		/*
  4157		 * Use pte_alloc() instead of pte_alloc_map().  We can't run
  4158		 * pte_offset_map() on pmds where a huge pmd might be created
  4159		 * from a different thread.
  4160		 *
  4161		 * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
  4162		 * parallel threads are excluded by other means.
  4163		 *
  4164		 * Here we only have mmap_read_lock(mm).
  4165		 */
  4166		if (pte_alloc(vma->vm_mm, vmf->pmd))
  4167			return VM_FAULT_OOM;
  4168	
  4169		/* See comment in handle_pte_fault() */
  4170		if (unlikely(pmd_trans_unstable(vmf->pmd)))
  4171			return 0;
  4172	
  4173		/* Use the zero-page for reads */
  4174		if (!(vmf->flags & FAULT_FLAG_WRITE) &&
  4175				!mm_forbids_zeropage(vma->vm_mm)) {
  4176			entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
  4177							vma->vm_page_prot));
  4178			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  4179					vmf->address, &vmf->ptl);
  4180			if (vmf_pte_changed(vmf)) {
  4181				update_mmu_tlb(vma, vmf->address, vmf->pte);
  4182				goto unlock;
  4183			}
  4184			ret = check_stable_address_space(vma->vm_mm);
  4185			if (ret)
  4186				goto unlock;
  4187			/* Deliver the page fault to userland, check inside PT lock */
  4188			if (userfaultfd_missing(vma)) {
  4189				pte_unmap_unlock(vmf->pte, vmf->ptl);
  4190				return handle_userfault(vmf, VM_UFFD_MISSING);
  4191			}
  4192			if (uffd_wp)
  4193				entry = pte_mkuffd_wp(entry);
  4194			set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
  4195	
  4196			/* No need to invalidate - it was non-present before */
  4197			update_mmu_cache(vma, vmf->address, vmf->pte);
  4198			goto unlock;
  4199		}
  4200	
  4201		/*
  4202		 * If allocating a large folio, determine the biggest suitable order for
  4203		 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
  4204		 * overlap with any populated PTEs, etc). We are not under the ptl here
  4205		 * so we will need to re-check that we are not overlapping any populated
  4206		 * PTEs once we have the lock.
  4207		 */
  4208		order = uffd_wp ? 0 : max_anon_folio_order(vma);
  4209		if (order > 0) {
  4210			vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
  4211			order = calc_anon_folio_order_alloc(vmf, order);
  4212			pte_unmap(vmf->pte);
  4213		}
  4214	
  4215		/* Allocate our own private folio. */
  4216		if (unlikely(anon_vma_prepare(vma)))
  4217			goto oom;
  4218		folio = alloc_anon_folio(vma, vmf->address, order);
  4219		if (!folio && order > 0) {
  4220			order = 0;
  4221			folio = alloc_anon_folio(vma, vmf->address, order);
  4222		}
  4223		if (!folio)
  4224			goto oom;
  4225	
  4226		pgcount = 1 << order;
  4227		addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
  4228	
  4229		if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
  4230			goto oom_free_page;
  4231		folio_throttle_swaprate(folio, GFP_KERNEL);
  4232	
  4233		/*
  4234		 * The memory barrier inside __folio_mark_uptodate makes sure that
  4235		 * preceding stores to the folio contents become visible before
  4236		 * the set_ptes() write.
  4237		 */
  4238		__folio_mark_uptodate(folio);
  4239	
  4240		entry = mk_pte(&folio->page, vma->vm_page_prot);
  4241		entry = pte_sw_mkyoung(entry);
  4242		if (vma->vm_flags & VM_WRITE)
  4243			entry = pte_mkwrite(pte_mkdirty(entry));
  4244	
  4245		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
  4246		if (vmf_pte_changed(vmf)) {
  4247			update_mmu_tlb(vma, vmf->address, vmf->pte);
  4248			goto release;
  4249		} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
  4250			goto release;
  4251		}
  4252	
  4253		ret = check_stable_address_space(vma->vm_mm);
  4254		if (ret)
  4255			goto release;
  4256	
  4257		/* Deliver the page fault to userland, check inside PT lock */
  4258		if (userfaultfd_missing(vma)) {
  4259			pte_unmap_unlock(vmf->pte, vmf->ptl);
  4260			folio_put(folio);
  4261			return handle_userfault(vmf, VM_UFFD_MISSING);
  4262		}
  4263	
  4264		folio_ref_add(folio, pgcount - 1);
  4265		add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
  4266		folio_add_new_anon_rmap(folio, vma, addr);
  4267		folio_add_lru_vma(folio, vma);
  4268	
  4269		if (uffd_wp)
  4270			entry = pte_mkuffd_wp(entry);
> 4271		set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
  4272	
  4273		/* No need to invalidate - it was non-present before */
> 4274		update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
  4275	unlock:
  4276		pte_unmap_unlock(vmf->pte, vmf->ptl);
  4277		return ret;
  4278	release:
  4279		folio_put(folio);
  4280		goto unlock;
  4281	oom_free_page:
  4282		folio_put(folio);
  4283	oom:
  4284		return VM_FAULT_OOM;
  4285	}
  4286	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
  2023-07-03 13:53   ` Ryan Roberts
@ 2023-07-03 19:05     ` Yu Zhao
  -1 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-03 19:05 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> In preparation for FLEXIBLE_THP support, improve
> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
> passed to it. In this case, all contained pages are accounted using the
> "small" pages scheme.

Nit: In this case, all *subpages*  are accounted using the *order-0
folio* (or base page) scheme.

> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Reviewed-by: Yu Zhao <yuzhao@google.com>

>  mm/rmap.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1d8369549424..82ef5ba363d1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>   * This means the inc-and-test can be bypassed.
>   * The folio does not have to be locked.
>   *
> - * If the folio is large, it is accounted as a THP.  As the folio
> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>   * is new, it's assumed to be mapped exclusively by a single process.
>   */
>  void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>                 unsigned long address)
>  {
> -       int nr;
> +       int nr = folio_nr_pages(folio);
> +       int i;
> +       struct page *page;
>
> -       VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> +       VM_BUG_ON_VMA(address < vma->vm_start ||
> +                       address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>         __folio_set_swapbacked(folio);
>
> -       if (likely(!folio_test_pmd_mappable(folio))) {
> +       if (!folio_test_large(folio)) {
>                 /* increment count (starts at -1) */
>                 atomic_set(&folio->_mapcount, 0);
> -               nr = 1;
> +               __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
> +       } else if (!folio_test_pmd_mappable(folio)) {
> +               /* increment count (starts at 0) */
> +               atomic_set(&folio->_nr_pages_mapped, nr);
> +
> +               page = &folio->page;
> +               for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) {
> +                       /* increment count (starts at -1) */
> +                       atomic_set(&page->_mapcount, 0);
> +                       __page_set_anon_rmap(folio, page, vma, address, 1);
> +               }

Nit: use folio_page(), e.g.,

  } else if (!folio_test_pmd_mappable(folio)) {
    int i;

    for (i = 0; i < nr; i++) {
      struct page *page = folio_page(folio, i);

      /* increment count (starts at -1) */
      atomic_set(&page->_mapcount, 0);
      __page_set_anon_rmap(folio, page, vma, address + PAGE_SIZE * i, 1);
    }
    /* increment count (starts at 0) */
    atomic_set(&folio->_nr_pages_mapped, nr);
  } else {

>         } else {
>                 /* increment count (starts at -1) */
>                 atomic_set(&folio->_entire_mapcount, 0);
>                 atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
> -               nr = folio_nr_pages(folio);
>                 __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
> +               __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>         }
>
>         __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
> -       __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  }

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
@ 2023-07-03 19:05     ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-03 19:05 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> In preparation for FLEXIBLE_THP support, improve
> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
> passed to it. In this case, all contained pages are accounted using the
> "small" pages scheme.

Nit: In this case, all *subpages*  are accounted using the *order-0
folio* (or base page) scheme.

> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Reviewed-by: Yu Zhao <yuzhao@google.com>

>  mm/rmap.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1d8369549424..82ef5ba363d1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>   * This means the inc-and-test can be bypassed.
>   * The folio does not have to be locked.
>   *
> - * If the folio is large, it is accounted as a THP.  As the folio
> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>   * is new, it's assumed to be mapped exclusively by a single process.
>   */
>  void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>                 unsigned long address)
>  {
> -       int nr;
> +       int nr = folio_nr_pages(folio);
> +       int i;
> +       struct page *page;
>
> -       VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> +       VM_BUG_ON_VMA(address < vma->vm_start ||
> +                       address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>         __folio_set_swapbacked(folio);
>
> -       if (likely(!folio_test_pmd_mappable(folio))) {
> +       if (!folio_test_large(folio)) {
>                 /* increment count (starts at -1) */
>                 atomic_set(&folio->_mapcount, 0);
> -               nr = 1;
> +               __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
> +       } else if (!folio_test_pmd_mappable(folio)) {
> +               /* increment count (starts at 0) */
> +               atomic_set(&folio->_nr_pages_mapped, nr);
> +
> +               page = &folio->page;
> +               for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) {
> +                       /* increment count (starts at -1) */
> +                       atomic_set(&page->_mapcount, 0);
> +                       __page_set_anon_rmap(folio, page, vma, address, 1);
> +               }

Nit: use folio_page(), e.g.,

  } else if (!folio_test_pmd_mappable(folio)) {
    int i;

    for (i = 0; i < nr; i++) {
      struct page *page = folio_page(folio, i);

      /* increment count (starts at -1) */
      atomic_set(&page->_mapcount, 0);
      __page_set_anon_rmap(folio, page, vma, address + PAGE_SIZE * i, 1);
    }
    /* increment count (starts at 0) */
    atomic_set(&folio->_nr_pages_mapped, nr);
  } else {

>         } else {
>                 /* increment count (starts at -1) */
>                 atomic_set(&folio->_entire_mapcount, 0);
>                 atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
> -               nr = folio_nr_pages(folio);
>                 __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
> +               __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>         }
>
>         __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
> -       __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  }

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-03 13:53   ` Ryan Roberts
@ 2023-07-03 19:50     ` Yu Zhao
  -1 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-03 19:50 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> arch_wants_pte_order() can be overridden by the arch to return the
> preferred folio order for pte-mapped memory. This is useful as some
> architectures (e.g. arm64) can coalesce TLB entries when the physical
> memory is suitably contiguous.
>
> The first user for this hint will be FLEXIBLE_THP, which aims to
> allocate large folios for anonymous memory to reduce page faults and
> other per-page operation costs.
>
> Here we add the default implementation of the function, used when the
> architecture does not define it, which returns the order corresponding
> to 64K.

I don't really mind a non-zero default value. But people would ask why
non-zero and why 64KB. Probably you could argue this is the large size
all known archs support if they have TLB coalescing. For x86, AMD CPUs
would want to override this. I'll leave it to Fengwei to decide
whether Intel wants a different default value.

Also I don't like the vma parameter because it makes
arch_wants_pte_order() a mix of hw preference and vma policy. From my
POV, the function should be only about the former; the latter should
be decided by arch-independent MM code. However, I can live with it if
ARM MM people think this is really what you want. ATM, I'm skeptical
they do.

> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

After another CPU vendor, e.g., Fengwei, and an ARM MM person, e.g.,
Will give the green light:
Reviewed-by: Yu Zhao <yuzhao@google.com>

> ---
>  include/linux/pgtable.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index a661a17173fa..f7e38598f20b 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -13,6 +13,7 @@
>  #include <linux/errno.h>
>  #include <asm-generic/pgtable_uffd.h>
>  #include <linux/page_table_check.h>
> +#include <linux/sizes.h>
>
>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>         defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>  }
>  #endif
>
> +#ifndef arch_wants_pte_order
> +/*
> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios

The warning is helpful.

> + * to be at least order-2.
> + */
> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> +{
> +       return ilog2(SZ_64K >> PAGE_SHIFT);
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>                                        unsigned long address,

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-03 19:50     ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-03 19:50 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> arch_wants_pte_order() can be overridden by the arch to return the
> preferred folio order for pte-mapped memory. This is useful as some
> architectures (e.g. arm64) can coalesce TLB entries when the physical
> memory is suitably contiguous.
>
> The first user for this hint will be FLEXIBLE_THP, which aims to
> allocate large folios for anonymous memory to reduce page faults and
> other per-page operation costs.
>
> Here we add the default implementation of the function, used when the
> architecture does not define it, which returns the order corresponding
> to 64K.

I don't really mind a non-zero default value. But people would ask why
non-zero and why 64KB. Probably you could argue this is the large size
all known archs support if they have TLB coalescing. For x86, AMD CPUs
would want to override this. I'll leave it to Fengwei to decide
whether Intel wants a different default value.

Also I don't like the vma parameter because it makes
arch_wants_pte_order() a mix of hw preference and vma policy. From my
POV, the function should be only about the former; the latter should
be decided by arch-independent MM code. However, I can live with it if
ARM MM people think this is really what you want. ATM, I'm skeptical
they do.

> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

After another CPU vendor, e.g., Fengwei, and an ARM MM person, e.g.,
Will give the green light:
Reviewed-by: Yu Zhao <yuzhao@google.com>

> ---
>  include/linux/pgtable.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index a661a17173fa..f7e38598f20b 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -13,6 +13,7 @@
>  #include <linux/errno.h>
>  #include <asm-generic/pgtable_uffd.h>
>  #include <linux/page_table_check.h>
> +#include <linux/sizes.h>
>
>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>         defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>  }
>  #endif
>
> +#ifndef arch_wants_pte_order
> +/*
> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios

The warning is helpful.

> + * to be at least order-2.
> + */
> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> +{
> +       return ilog2(SZ_64K >> PAGE_SHIFT);
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>                                        unsigned long address,

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 5/5] arm64: mm: Override arch_wants_pte_order()
  2023-07-03 13:53   ` Ryan Roberts
@ 2023-07-03 20:02     ` Yu Zhao
  -1 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-03 20:02 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Define an arch-specific override of arch_wants_pte_order() so that when
> FLEXIBLE_THP is enabled, large folios will be allocated for anonymous
> memory with an order that is compatible with arm64's contpte mappings.
>
> arch_wants_pte_order() returns an order according to the following
> policy: For the unhinted case, when THP is not requested for the vma,
> don't allow anything bigger than 64K. This means we don't waste too much
> memory. Additionally, for 4K pages this is the contpte size, and for
> 16K, this is (usually) the HPA size when the uarch feature is
> implemented. For the hinted case, when THP is requested for the vma,
> allow the contpte size for all page size configurations; 64K for 4K, 2M
> for 16K and 2M for 64K.
>
> Additionally, the THP and NOTHP order constants are defined using
> Kconfig so it is possible to override them at build time.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/Kconfig               | 12 ++++++++++++
>  arch/arm64/include/asm/pgtable.h |  4 ++++
>  arch/arm64/mm/mmu.c              |  8 ++++++++
>  3 files changed, 24 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 343e1e1cae10..689c5bf13dc1 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -281,6 +281,18 @@ config ARM64_CONT_PMD_SHIFT
>         default 5 if ARM64_16K_PAGES
>         default 4
>
> +config ARM64_PTE_ORDER_NOTHP
> +       int
> +       default 0 if ARM64_64K_PAGES    # 64K (1 page)
> +       default 2 if ARM64_16K_PAGES    # 64K (4 pages; benefits from HPA where HW supports it)
> +       default 4 if ARM64_4K_PAGES     # 64K (16 pages; eligible for contpte-mapping)
> +
> +config ARM64_PTE_ORDER_THP
> +       int
> +       default 5 if ARM64_64K_PAGES    # 2M  (32 pages; eligible for contpte-mapping)
> +       default 7 if ARM64_16K_PAGES    # 2M  (128 pages; eligible for contpte-mapping)
> +       default 4 if ARM64_4K_PAGES     # 64K (16 pages; eligible for contpte-mapping)
> +
>  config ARCH_MMAP_RND_BITS_MIN
>         default 14 if ARM64_64K_PAGES
>         default 16 if ARM64_16K_PAGES
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 6fd012663a01..8463d5f9f307 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1117,6 +1117,10 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
>  extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>                                     unsigned long addr, pte_t *ptep,
>                                     pte_t old_pte, pte_t new_pte);
> +
> +#define arch_wants_pte_order arch_wants_pte_order
> +extern int arch_wants_pte_order(struct vm_area_struct *vma);
> +
>  #endif /* !__ASSEMBLY__ */
>
>  #endif /* __ASM_PGTABLE_H */
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index af6bc8403ee4..8556c4a9b507 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1481,3 +1481,11 @@ void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte
>  {
>         set_pte_at(vma->vm_mm, addr, ptep, pte);
>  }
> +
> +int arch_wants_pte_order(struct vm_area_struct *vma)
> +{
> +       if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> +               return CONFIG_ARM64_PTE_ORDER_THP;
> +       else
> +               return CONFIG_ARM64_PTE_ORDER_NOTHP;
> +}

I don't really like this because it's a mix of h/w preference and s/w
policy -- from my POV, it's supposed to be the former only. The policy
part should be left to core MM (arch-independent).

That being said, no objection if ARM MM people think this is really
what they want.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 5/5] arm64: mm: Override arch_wants_pte_order()
@ 2023-07-03 20:02     ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-03 20:02 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Define an arch-specific override of arch_wants_pte_order() so that when
> FLEXIBLE_THP is enabled, large folios will be allocated for anonymous
> memory with an order that is compatible with arm64's contpte mappings.
>
> arch_wants_pte_order() returns an order according to the following
> policy: For the unhinted case, when THP is not requested for the vma,
> don't allow anything bigger than 64K. This means we don't waste too much
> memory. Additionally, for 4K pages this is the contpte size, and for
> 16K, this is (usually) the HPA size when the uarch feature is
> implemented. For the hinted case, when THP is requested for the vma,
> allow the contpte size for all page size configurations; 64K for 4K, 2M
> for 16K and 2M for 64K.
>
> Additionally, the THP and NOTHP order constants are defined using
> Kconfig so it is possible to override them at build time.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/Kconfig               | 12 ++++++++++++
>  arch/arm64/include/asm/pgtable.h |  4 ++++
>  arch/arm64/mm/mmu.c              |  8 ++++++++
>  3 files changed, 24 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 343e1e1cae10..689c5bf13dc1 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -281,6 +281,18 @@ config ARM64_CONT_PMD_SHIFT
>         default 5 if ARM64_16K_PAGES
>         default 4
>
> +config ARM64_PTE_ORDER_NOTHP
> +       int
> +       default 0 if ARM64_64K_PAGES    # 64K (1 page)
> +       default 2 if ARM64_16K_PAGES    # 64K (4 pages; benefits from HPA where HW supports it)
> +       default 4 if ARM64_4K_PAGES     # 64K (16 pages; eligible for contpte-mapping)
> +
> +config ARM64_PTE_ORDER_THP
> +       int
> +       default 5 if ARM64_64K_PAGES    # 2M  (32 pages; eligible for contpte-mapping)
> +       default 7 if ARM64_16K_PAGES    # 2M  (128 pages; eligible for contpte-mapping)
> +       default 4 if ARM64_4K_PAGES     # 64K (16 pages; eligible for contpte-mapping)
> +
>  config ARCH_MMAP_RND_BITS_MIN
>         default 14 if ARM64_64K_PAGES
>         default 16 if ARM64_16K_PAGES
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 6fd012663a01..8463d5f9f307 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1117,6 +1117,10 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
>  extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>                                     unsigned long addr, pte_t *ptep,
>                                     pte_t old_pte, pte_t new_pte);
> +
> +#define arch_wants_pte_order arch_wants_pte_order
> +extern int arch_wants_pte_order(struct vm_area_struct *vma);
> +
>  #endif /* !__ASSEMBLY__ */
>
>  #endif /* __ASM_PGTABLE_H */
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index af6bc8403ee4..8556c4a9b507 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1481,3 +1481,11 @@ void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte
>  {
>         set_pte_at(vma->vm_mm, addr, ptep, pte);
>  }
> +
> +int arch_wants_pte_order(struct vm_area_struct *vma)
> +{
> +       if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> +               return CONFIG_ARM64_PTE_ORDER_THP;
> +       else
> +               return CONFIG_ARM64_PTE_ORDER_NOTHP;
> +}

I don't really like this because it's a mix of h/w preference and s/w
policy -- from my POV, it's supposed to be the former only. The policy
part should be left to core MM (arch-independent).

That being said, no objection if ARM MM people think this is really
what they want.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-03 13:53   ` Ryan Roberts
@ 2023-07-04  1:35     ` Yu Zhao
  -1 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-04  1:35 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 13988 bytes --]

On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> allocated in large folios of a specified order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
>
> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> defaults to disabled for now; there is a long list of todos to make
> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
> madvise ops, etc). These items will be tackled in subsequent patches.
>
> When enabled, the preferred folio order is as returned by
> arch_wants_pte_order(), which may be overridden by the arch as it sees
> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a

coalesce

> contiguous set of ptes map physically contigious, naturally aligned

contiguous

> memory, so this mechanism allows the architecture to optimize as
> required.
>
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/Kconfig  |  10 ++++
>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 165 insertions(+), 13 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..1c06b2c0a24e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
>           support of file THPs will be developed in the next few release
>           cycles.
>
> +config FLEXIBLE_THP
> +       bool "Flexible order THP"
> +       depends on TRANSPARENT_HUGEPAGE
> +       default n

The default value is already N.

> +       help
> +         Use large (bigger than order-0) folios to back anonymous memory where
> +         possible, even if the order of the folio is smaller than the PMD
> +         order. This reduces the number of page faults, as well as other
> +         per-page overheads to improve performance for many workloads.
> +
>  endif # TRANSPARENT_HUGEPAGE
>
>  #
> diff --git a/mm/memory.c b/mm/memory.c
> index fb30f7523550..abe2ea94f3f5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>         return 0;
>  }
>
> +#ifdef CONFIG_FLEXIBLE_THP
> +/*
> + * Allocates, zeros and returns a folio of the requested order for use as
> + * anonymous memory.
> + */
> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
> +                                     unsigned long addr, int order)
> +{
> +       gfp_t gfp;
> +       struct folio *folio;
> +
> +       if (order == 0)
> +               return vma_alloc_zeroed_movable_folio(vma, addr);
> +
> +       gfp = vma_thp_gfp_mask(vma);
> +       folio = vma_alloc_folio(gfp, order, vma, addr, true);
> +       if (folio)
> +               clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
> +
> +       return folio;
> +}
> +
> +/*
> + * Preferred folio order to allocate for anonymous memory.
> + */
> +#define max_anon_folio_order(vma)      arch_wants_pte_order(vma)
> +#else
> +#define alloc_anon_folio(vma, addr, order) \
> +                               vma_alloc_zeroed_movable_folio(vma, addr)
> +#define max_anon_folio_order(vma)      0
> +#endif
> +
> +/*
> + * Returns index of first pte that is not none, or nr if all are none.
> + */
> +static inline int check_ptes_none(pte_t *pte, int nr)
> +{
> +       int i;
> +
> +       for (i = 0; i < nr; i++) {
> +               if (!pte_none(ptep_get(pte++)))
> +                       return i;
> +       }
> +
> +       return nr;
> +}
> +
> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> +{
> +       /*
> +        * The aim here is to determine what size of folio we should allocate
> +        * for this fault. Factors include:
> +        * - Order must not be higher than `order` upon entry
> +        * - Folio must be naturally aligned within VA space
> +        * - Folio must be fully contained inside one pmd entry
> +        * - Folio must not breach boundaries of vma
> +        * - Folio must not overlap any non-none ptes
> +        *
> +        * Additionally, we do not allow order-1 since this breaks assumptions
> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
> +        * store state up to the 3rd struct page subpage), and these pages must
> +        * be THP in order to correctly use pre-existing THP infrastructure such
> +        * as folio_split().
> +        *
> +        * Note that the caller may or may not choose to lock the pte. If
> +        * unlocked, the result is racy and the user must re-check any overlap
> +        * with non-none ptes under the lock.
> +        */
> +
> +       struct vm_area_struct *vma = vmf->vma;
> +       int nr;
> +       unsigned long addr;
> +       pte_t *pte;
> +       pte_t *first_set = NULL;
> +       int ret;
> +
> +       order = min(order, PMD_SHIFT - PAGE_SHIFT);
> +
> +       for (; order > 1; order--) {

I'm not sure how we can justify this policy. As an initial step, it'd
be a lot easier to sell if we only considered the order of
arch_wants_pte_order() and the order 0.

> +               nr = 1 << order;
> +               addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
> +               pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
> +
> +               /* Check vma bounds. */
> +               if (addr < vma->vm_start ||
> +                   addr + (nr << PAGE_SHIFT) > vma->vm_end)
> +                       continue;
> +
> +               /* Ptes covered by order already known to be none. */
> +               if (pte + nr <= first_set)
> +                       break;
> +
> +               /* Already found set pte in range covered by order. */
> +               if (pte <= first_set)
> +                       continue;
> +
> +               /* Need to check if all the ptes are none. */
> +               ret = check_ptes_none(pte, nr);
> +               if (ret == nr)
> +                       break;
> +
> +               first_set = pte + ret;
> +       }
> +
> +       if (order == 1)
> +               order = 0;
> +
> +       return order;
> +}

Everything above can be simplified into two helpers:
vmf_pte_range_changed() and alloc_anon_folio() (or whatever names you
prefer). Details below.

>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>                 goto oom;
>
>         if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
> -               new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +               new_folio = alloc_anon_folio(vma, vmf->address, 0);

This seems unnecessary for now. Later on, we could fill in an aligned
area with multiple write-protected zero pages during a read fault and
then replace them with a large folio here.

>                 if (!new_folio)
>                         goto oom;
>         } else {
> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         struct folio *folio;
>         vm_fault_t ret = 0;
>         pte_t entry;
> +       int order;
> +       int pgcount;
> +       unsigned long addr;
>
>         /* File mapping without ->vm_ops ? */
>         if (vma->vm_flags & VM_SHARED)
> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
>                         return handle_userfault(vmf, VM_UFFD_MISSING);
>                 }
> -               goto setpte;
> +               if (uffd_wp)
> +                       entry = pte_mkuffd_wp(entry);
> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +
> +               /* No need to invalidate - it was non-present before */
> +               update_mmu_cache(vma, vmf->address, vmf->pte);
> +               goto unlock;
> +       }

Nor really needed IMO. Details below.

===

> +       /*
> +        * If allocating a large folio, determine the biggest suitable order for
> +        * the VMA (e.g. it must not exceed the VMA's bounds, it must not
> +        * overlap with any populated PTEs, etc). We are not under the ptl here
> +        * so we will need to re-check that we are not overlapping any populated
> +        * PTEs once we have the lock.
> +        */
> +       order = uffd_wp ? 0 : max_anon_folio_order(vma);
> +       if (order > 0) {
> +               vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> +               order = calc_anon_folio_order_alloc(vmf, order);
> +               pte_unmap(vmf->pte);
>         }

===

The section above together with the section below should be wrapped in a helper.

> -       /* Allocate our own private page. */
> +       /* Allocate our own private folio. */
>         if (unlikely(anon_vma_prepare(vma)))
>                 goto oom;

===

> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +       folio = alloc_anon_folio(vma, vmf->address, order);
> +       if (!folio && order > 0) {
> +               order = 0;
> +               folio = alloc_anon_folio(vma, vmf->address, order);
> +       }

===

One helper returns a folio of order arch_wants_pte_order(), or order 0
if it fails to allocate that order, e.g.,

folio = alloc_anon_folio(vmf);

And if vmf_orig_pte_uffd_wp(vmf) is true, the helper allocates order 0
regardless of arch_wants_pte_order(). Upon success, it can update
vmf->address, since if we run into a race with another PF, we exit the
fault handler and retry anyway.

>         if (!folio)
>                 goto oom;
>
> +       pgcount = 1 << order;
> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);

As shown above, the helper already updates vmf->address. And mm/ never
used pgcount before -- the convention is nr_pages = folio_nr_pages().

>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>                 goto oom_free_page;
>         folio_throttle_swaprate(folio, GFP_KERNEL);
>
>         /*
>          * The memory barrier inside __folio_mark_uptodate makes sure that
> -        * preceding stores to the page contents become visible before
> -        * the set_pte_at() write.
> +        * preceding stores to the folio contents become visible before
> +        * the set_ptes() write.

We don't have set_ptes() yet.

>          */
>         __folio_mark_uptodate(folio);
>
> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         if (vma->vm_flags & VM_WRITE)
>                 entry = pte_mkwrite(pte_mkdirty(entry));
>
> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -                       &vmf->ptl);
> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>         if (vmf_pte_changed(vmf)) {
>                 update_mmu_tlb(vma, vmf->address, vmf->pte);
>                 goto release;
> +       } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
> +               goto release;
>         }

Need new helper:

  if (vmf_pte_range_changed(vmf, nr_pages)) {
    for (i = 0; i < nr_pages; i++)
      update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i);
    goto release;
  }

(It should be fine to call update_mmu_tlb() even if it's not really necessary.)

>         ret = check_stable_address_space(vma->vm_mm);
> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>         }
>
> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
> +       folio_ref_add(folio, pgcount - 1);
> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
> +       folio_add_new_anon_rmap(folio, vma, addr);
>         folio_add_lru_vma(folio, vma);
> -setpte:
> +
>         if (uffd_wp)
>                 entry = pte_mkuffd_wp(entry);
> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);

We would have to do it one by one for now.

>         /* No need to invalidate - it was non-present before */
> -       update_mmu_cache(vma, vmf->address, vmf->pte);
> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);

Ditto.

How about this (by moving mk_pte()  and its friends here):
...
        folio_add_lru_vma(folio, vma);

        for (i = 0; i < nr_pages; i++) {
                entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
                entry = pte_sw_mkyoung(entry);
                if (vma->vm_flags & VM_WRITE)
                        entry = pte_mkwrite(pte_mkdirty(entry));
setpte:
                if (uffd_wp)
                        entry = pte_mkuffd_wp(entry);
                set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i,
vmf->pte + i, entry);

                /* No need to invalidate - it was non-present before */
                update_mmu_cache(vma, vmf->address + PAGE_SIZE * i,
vmf->pte + i);
        }

>  unlock:
>         pte_unmap_unlock(vmf->pte, vmf->ptl);
>         return ret;

Attaching a small patch in case anything above is not clear. Please
take a look. Thanks.

[-- Attachment #2: anon_folios.patch --]
[-- Type: text/x-patch, Size: 2658 bytes --]

diff --git a/mm/memory.c b/mm/memory.c
index 40a269457c8b..04fdb8529f68 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4063,6 +4063,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
  */
 static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 {
+	int i = 0;
+	int nr_pages = 1;
 	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
@@ -4107,10 +4109,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	folio = alloc_anon_folio(vmf); // updates vmf->address accordingly
 	if (!folio)
 		goto oom;
 
+	nr_pages = folio_nr_pages(folio);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
 	folio_throttle_swaprate(folio, GFP_KERNEL);
@@ -4122,17 +4126,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	 */
 	__folio_mark_uptodate(folio);
 
-	entry = mk_pte(&folio->page, vma->vm_page_prot);
-	entry = pte_sw_mkyoung(entry);
-	if (vma->vm_flags & VM_WRITE)
-		entry = pte_mkwrite(pte_mkdirty(entry));
-
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
 	if (!vmf->pte)
 		goto release;
-	if (vmf_pte_changed(vmf)) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
+	if (vmf_pte_range_changed(vmf, nr_pages)) {
+		for (i = 0; i < nr_pages; i++)
+			update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i);
 		goto release;
 	}
 
@@ -4147,16 +4147,24 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
+	folio_ref_add(folio, nr_pages - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
 	folio_add_new_anon_rmap(folio, vma, vmf->address);
 	folio_add_lru_vma(folio, vma);
+
+	for (i = 0; i < nr_pages; i++) {
+		entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
+		entry = pte_sw_mkyoung(entry);
+		if (vma->vm_flags & VM_WRITE)
+			entry = pte_mkwrite(pte_mkdirty(entry));
 setpte:
-	if (uffd_wp)
-		entry = pte_mkuffd_wp(entry);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+		if (uffd_wp)
+			entry = pte_mkuffd_wp(entry);
+		set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i, vmf->pte + i, entry);
 
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, vmf->address, vmf->pte);
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i);
+	}
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-04  1:35     ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-04  1:35 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 13988 bytes --]

On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> allocated in large folios of a specified order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
>
> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> defaults to disabled for now; there is a long list of todos to make
> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
> madvise ops, etc). These items will be tackled in subsequent patches.
>
> When enabled, the preferred folio order is as returned by
> arch_wants_pte_order(), which may be overridden by the arch as it sees
> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a

coalesce

> contiguous set of ptes map physically contigious, naturally aligned

contiguous

> memory, so this mechanism allows the architecture to optimize as
> required.
>
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/Kconfig  |  10 ++++
>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 165 insertions(+), 13 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..1c06b2c0a24e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
>           support of file THPs will be developed in the next few release
>           cycles.
>
> +config FLEXIBLE_THP
> +       bool "Flexible order THP"
> +       depends on TRANSPARENT_HUGEPAGE
> +       default n

The default value is already N.

> +       help
> +         Use large (bigger than order-0) folios to back anonymous memory where
> +         possible, even if the order of the folio is smaller than the PMD
> +         order. This reduces the number of page faults, as well as other
> +         per-page overheads to improve performance for many workloads.
> +
>  endif # TRANSPARENT_HUGEPAGE
>
>  #
> diff --git a/mm/memory.c b/mm/memory.c
> index fb30f7523550..abe2ea94f3f5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>         return 0;
>  }
>
> +#ifdef CONFIG_FLEXIBLE_THP
> +/*
> + * Allocates, zeros and returns a folio of the requested order for use as
> + * anonymous memory.
> + */
> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
> +                                     unsigned long addr, int order)
> +{
> +       gfp_t gfp;
> +       struct folio *folio;
> +
> +       if (order == 0)
> +               return vma_alloc_zeroed_movable_folio(vma, addr);
> +
> +       gfp = vma_thp_gfp_mask(vma);
> +       folio = vma_alloc_folio(gfp, order, vma, addr, true);
> +       if (folio)
> +               clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
> +
> +       return folio;
> +}
> +
> +/*
> + * Preferred folio order to allocate for anonymous memory.
> + */
> +#define max_anon_folio_order(vma)      arch_wants_pte_order(vma)
> +#else
> +#define alloc_anon_folio(vma, addr, order) \
> +                               vma_alloc_zeroed_movable_folio(vma, addr)
> +#define max_anon_folio_order(vma)      0
> +#endif
> +
> +/*
> + * Returns index of first pte that is not none, or nr if all are none.
> + */
> +static inline int check_ptes_none(pte_t *pte, int nr)
> +{
> +       int i;
> +
> +       for (i = 0; i < nr; i++) {
> +               if (!pte_none(ptep_get(pte++)))
> +                       return i;
> +       }
> +
> +       return nr;
> +}
> +
> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> +{
> +       /*
> +        * The aim here is to determine what size of folio we should allocate
> +        * for this fault. Factors include:
> +        * - Order must not be higher than `order` upon entry
> +        * - Folio must be naturally aligned within VA space
> +        * - Folio must be fully contained inside one pmd entry
> +        * - Folio must not breach boundaries of vma
> +        * - Folio must not overlap any non-none ptes
> +        *
> +        * Additionally, we do not allow order-1 since this breaks assumptions
> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
> +        * store state up to the 3rd struct page subpage), and these pages must
> +        * be THP in order to correctly use pre-existing THP infrastructure such
> +        * as folio_split().
> +        *
> +        * Note that the caller may or may not choose to lock the pte. If
> +        * unlocked, the result is racy and the user must re-check any overlap
> +        * with non-none ptes under the lock.
> +        */
> +
> +       struct vm_area_struct *vma = vmf->vma;
> +       int nr;
> +       unsigned long addr;
> +       pte_t *pte;
> +       pte_t *first_set = NULL;
> +       int ret;
> +
> +       order = min(order, PMD_SHIFT - PAGE_SHIFT);
> +
> +       for (; order > 1; order--) {

I'm not sure how we can justify this policy. As an initial step, it'd
be a lot easier to sell if we only considered the order of
arch_wants_pte_order() and the order 0.

> +               nr = 1 << order;
> +               addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
> +               pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
> +
> +               /* Check vma bounds. */
> +               if (addr < vma->vm_start ||
> +                   addr + (nr << PAGE_SHIFT) > vma->vm_end)
> +                       continue;
> +
> +               /* Ptes covered by order already known to be none. */
> +               if (pte + nr <= first_set)
> +                       break;
> +
> +               /* Already found set pte in range covered by order. */
> +               if (pte <= first_set)
> +                       continue;
> +
> +               /* Need to check if all the ptes are none. */
> +               ret = check_ptes_none(pte, nr);
> +               if (ret == nr)
> +                       break;
> +
> +               first_set = pte + ret;
> +       }
> +
> +       if (order == 1)
> +               order = 0;
> +
> +       return order;
> +}

Everything above can be simplified into two helpers:
vmf_pte_range_changed() and alloc_anon_folio() (or whatever names you
prefer). Details below.

>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>                 goto oom;
>
>         if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
> -               new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +               new_folio = alloc_anon_folio(vma, vmf->address, 0);

This seems unnecessary for now. Later on, we could fill in an aligned
area with multiple write-protected zero pages during a read fault and
then replace them with a large folio here.

>                 if (!new_folio)
>                         goto oom;
>         } else {
> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         struct folio *folio;
>         vm_fault_t ret = 0;
>         pte_t entry;
> +       int order;
> +       int pgcount;
> +       unsigned long addr;
>
>         /* File mapping without ->vm_ops ? */
>         if (vma->vm_flags & VM_SHARED)
> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
>                         return handle_userfault(vmf, VM_UFFD_MISSING);
>                 }
> -               goto setpte;
> +               if (uffd_wp)
> +                       entry = pte_mkuffd_wp(entry);
> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +
> +               /* No need to invalidate - it was non-present before */
> +               update_mmu_cache(vma, vmf->address, vmf->pte);
> +               goto unlock;
> +       }

Nor really needed IMO. Details below.

===

> +       /*
> +        * If allocating a large folio, determine the biggest suitable order for
> +        * the VMA (e.g. it must not exceed the VMA's bounds, it must not
> +        * overlap with any populated PTEs, etc). We are not under the ptl here
> +        * so we will need to re-check that we are not overlapping any populated
> +        * PTEs once we have the lock.
> +        */
> +       order = uffd_wp ? 0 : max_anon_folio_order(vma);
> +       if (order > 0) {
> +               vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> +               order = calc_anon_folio_order_alloc(vmf, order);
> +               pte_unmap(vmf->pte);
>         }

===

The section above together with the section below should be wrapped in a helper.

> -       /* Allocate our own private page. */
> +       /* Allocate our own private folio. */
>         if (unlikely(anon_vma_prepare(vma)))
>                 goto oom;

===

> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +       folio = alloc_anon_folio(vma, vmf->address, order);
> +       if (!folio && order > 0) {
> +               order = 0;
> +               folio = alloc_anon_folio(vma, vmf->address, order);
> +       }

===

One helper returns a folio of order arch_wants_pte_order(), or order 0
if it fails to allocate that order, e.g.,

folio = alloc_anon_folio(vmf);

And if vmf_orig_pte_uffd_wp(vmf) is true, the helper allocates order 0
regardless of arch_wants_pte_order(). Upon success, it can update
vmf->address, since if we run into a race with another PF, we exit the
fault handler and retry anyway.

>         if (!folio)
>                 goto oom;
>
> +       pgcount = 1 << order;
> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);

As shown above, the helper already updates vmf->address. And mm/ never
used pgcount before -- the convention is nr_pages = folio_nr_pages().

>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>                 goto oom_free_page;
>         folio_throttle_swaprate(folio, GFP_KERNEL);
>
>         /*
>          * The memory barrier inside __folio_mark_uptodate makes sure that
> -        * preceding stores to the page contents become visible before
> -        * the set_pte_at() write.
> +        * preceding stores to the folio contents become visible before
> +        * the set_ptes() write.

We don't have set_ptes() yet.

>          */
>         __folio_mark_uptodate(folio);
>
> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         if (vma->vm_flags & VM_WRITE)
>                 entry = pte_mkwrite(pte_mkdirty(entry));
>
> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -                       &vmf->ptl);
> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>         if (vmf_pte_changed(vmf)) {
>                 update_mmu_tlb(vma, vmf->address, vmf->pte);
>                 goto release;
> +       } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
> +               goto release;
>         }

Need new helper:

  if (vmf_pte_range_changed(vmf, nr_pages)) {
    for (i = 0; i < nr_pages; i++)
      update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i);
    goto release;
  }

(It should be fine to call update_mmu_tlb() even if it's not really necessary.)

>         ret = check_stable_address_space(vma->vm_mm);
> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>         }
>
> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
> +       folio_ref_add(folio, pgcount - 1);
> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
> +       folio_add_new_anon_rmap(folio, vma, addr);
>         folio_add_lru_vma(folio, vma);
> -setpte:
> +
>         if (uffd_wp)
>                 entry = pte_mkuffd_wp(entry);
> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);

We would have to do it one by one for now.

>         /* No need to invalidate - it was non-present before */
> -       update_mmu_cache(vma, vmf->address, vmf->pte);
> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);

Ditto.

How about this (by moving mk_pte()  and its friends here):
...
        folio_add_lru_vma(folio, vma);

        for (i = 0; i < nr_pages; i++) {
                entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
                entry = pte_sw_mkyoung(entry);
                if (vma->vm_flags & VM_WRITE)
                        entry = pte_mkwrite(pte_mkdirty(entry));
setpte:
                if (uffd_wp)
                        entry = pte_mkuffd_wp(entry);
                set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i,
vmf->pte + i, entry);

                /* No need to invalidate - it was non-present before */
                update_mmu_cache(vma, vmf->address + PAGE_SIZE * i,
vmf->pte + i);
        }

>  unlock:
>         pte_unmap_unlock(vmf->pte, vmf->ptl);
>         return ret;

Attaching a small patch in case anything above is not clear. Please
take a look. Thanks.

[-- Attachment #2: anon_folios.patch --]
[-- Type: text/x-patch, Size: 2658 bytes --]

diff --git a/mm/memory.c b/mm/memory.c
index 40a269457c8b..04fdb8529f68 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4063,6 +4063,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
  */
 static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 {
+	int i = 0;
+	int nr_pages = 1;
 	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
@@ -4107,10 +4109,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	folio = alloc_anon_folio(vmf); // updates vmf->address accordingly
 	if (!folio)
 		goto oom;
 
+	nr_pages = folio_nr_pages(folio);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
 	folio_throttle_swaprate(folio, GFP_KERNEL);
@@ -4122,17 +4126,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	 */
 	__folio_mark_uptodate(folio);
 
-	entry = mk_pte(&folio->page, vma->vm_page_prot);
-	entry = pte_sw_mkyoung(entry);
-	if (vma->vm_flags & VM_WRITE)
-		entry = pte_mkwrite(pte_mkdirty(entry));
-
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
 	if (!vmf->pte)
 		goto release;
-	if (vmf_pte_changed(vmf)) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
+	if (vmf_pte_range_changed(vmf, nr_pages)) {
+		for (i = 0; i < nr_pages; i++)
+			update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i);
 		goto release;
 	}
 
@@ -4147,16 +4147,24 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
+	folio_ref_add(folio, nr_pages - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
 	folio_add_new_anon_rmap(folio, vma, vmf->address);
 	folio_add_lru_vma(folio, vma);
+
+	for (i = 0; i < nr_pages; i++) {
+		entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
+		entry = pte_sw_mkyoung(entry);
+		if (vma->vm_flags & VM_WRITE)
+			entry = pte_mkwrite(pte_mkdirty(entry));
 setpte:
-	if (uffd_wp)
-		entry = pte_mkuffd_wp(entry);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+		if (uffd_wp)
+			entry = pte_mkuffd_wp(entry);
+		set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i, vmf->pte + i, entry);
 
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, vmf->address, vmf->pte);
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i);
+	}
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);

[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
  2023-07-03 19:05     ` Yu Zhao
@ 2023-07-04  2:13       ` Yin, Fengwei
  -1 siblings, 0 replies; 167+ messages in thread
From: Yin, Fengwei @ 2023-07-04  2:13 UTC (permalink / raw)
  To: Yu Zhao, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm



On 7/4/2023 3:05 AM, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> In preparation for FLEXIBLE_THP support, improve
>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
>> passed to it. In this case, all contained pages are accounted using the
>> "small" pages scheme.
> 
> Nit: In this case, all *subpages*  are accounted using the *order-0
> folio* (or base page) scheme.
Matthew suggested not to use subpage with folio. Using page with folio:
https://lore.kernel.org/linux-mm/Y9qiS%2FIxZOMx62t6@casper.infradead.org/

> 
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> 
> Reviewed-by: Yu Zhao <yuzhao@google.com>
> 
>>  mm/rmap.c | 26 +++++++++++++++++++-------
>>  1 file changed, 19 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 1d8369549424..82ef5ba363d1 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>>   * This means the inc-and-test can be bypassed.
>>   * The folio does not have to be locked.
>>   *
>> - * If the folio is large, it is accounted as a THP.  As the folio
>> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>>   * is new, it's assumed to be mapped exclusively by a single process.
>>   */
>>  void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>                 unsigned long address)
>>  {
>> -       int nr;
>> +       int nr = folio_nr_pages(folio);
>> +       int i;
>> +       struct page *page;
>>
>> -       VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>> +       VM_BUG_ON_VMA(address < vma->vm_start ||
>> +                       address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>         __folio_set_swapbacked(folio);
>>
>> -       if (likely(!folio_test_pmd_mappable(folio))) {
>> +       if (!folio_test_large(folio)) {
>>                 /* increment count (starts at -1) */
>>                 atomic_set(&folio->_mapcount, 0);
>> -               nr = 1;
>> +               __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>> +       } else if (!folio_test_pmd_mappable(folio)) {
>> +               /* increment count (starts at 0) */
>> +               atomic_set(&folio->_nr_pages_mapped, nr);
>> +
>> +               page = &folio->page;
>> +               for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) {
>> +                       /* increment count (starts at -1) */
>> +                       atomic_set(&page->_mapcount, 0);
>> +                       __page_set_anon_rmap(folio, page, vma, address, 1);
>> +               }
> 
> Nit: use folio_page(), e.g.,
> 
>   } else if (!folio_test_pmd_mappable(folio)) {
>     int i;
> 
>     for (i = 0; i < nr; i++) {
>       struct page *page = folio_page(folio, i);
> 
>       /* increment count (starts at -1) */
>       atomic_set(&page->_mapcount, 0);
>       __page_set_anon_rmap(folio, page, vma, address + PAGE_SIZE * i, 1);
>     }
>     /* increment count (starts at 0) */
>     atomic_set(&folio->_nr_pages_mapped, nr);
>   } else {
> 
>>         } else {
>>                 /* increment count (starts at -1) */
>>                 atomic_set(&folio->_entire_mapcount, 0);
>>                 atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
>> -               nr = folio_nr_pages(folio);
>>                 __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
>> +               __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>         }
>>
>>         __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
>> -       __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>  }

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
@ 2023-07-04  2:13       ` Yin, Fengwei
  0 siblings, 0 replies; 167+ messages in thread
From: Yin, Fengwei @ 2023-07-04  2:13 UTC (permalink / raw)
  To: Yu Zhao, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm



On 7/4/2023 3:05 AM, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> In preparation for FLEXIBLE_THP support, improve
>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
>> passed to it. In this case, all contained pages are accounted using the
>> "small" pages scheme.
> 
> Nit: In this case, all *subpages*  are accounted using the *order-0
> folio* (or base page) scheme.
Matthew suggested not to use subpage with folio. Using page with folio:
https://lore.kernel.org/linux-mm/Y9qiS%2FIxZOMx62t6@casper.infradead.org/

> 
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> 
> Reviewed-by: Yu Zhao <yuzhao@google.com>
> 
>>  mm/rmap.c | 26 +++++++++++++++++++-------
>>  1 file changed, 19 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 1d8369549424..82ef5ba363d1 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>>   * This means the inc-and-test can be bypassed.
>>   * The folio does not have to be locked.
>>   *
>> - * If the folio is large, it is accounted as a THP.  As the folio
>> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>>   * is new, it's assumed to be mapped exclusively by a single process.
>>   */
>>  void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>                 unsigned long address)
>>  {
>> -       int nr;
>> +       int nr = folio_nr_pages(folio);
>> +       int i;
>> +       struct page *page;
>>
>> -       VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>> +       VM_BUG_ON_VMA(address < vma->vm_start ||
>> +                       address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>         __folio_set_swapbacked(folio);
>>
>> -       if (likely(!folio_test_pmd_mappable(folio))) {
>> +       if (!folio_test_large(folio)) {
>>                 /* increment count (starts at -1) */
>>                 atomic_set(&folio->_mapcount, 0);
>> -               nr = 1;
>> +               __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>> +       } else if (!folio_test_pmd_mappable(folio)) {
>> +               /* increment count (starts at 0) */
>> +               atomic_set(&folio->_nr_pages_mapped, nr);
>> +
>> +               page = &folio->page;
>> +               for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) {
>> +                       /* increment count (starts at -1) */
>> +                       atomic_set(&page->_mapcount, 0);
>> +                       __page_set_anon_rmap(folio, page, vma, address, 1);
>> +               }
> 
> Nit: use folio_page(), e.g.,
> 
>   } else if (!folio_test_pmd_mappable(folio)) {
>     int i;
> 
>     for (i = 0; i < nr; i++) {
>       struct page *page = folio_page(folio, i);
> 
>       /* increment count (starts at -1) */
>       atomic_set(&page->_mapcount, 0);
>       __page_set_anon_rmap(folio, page, vma, address + PAGE_SIZE * i, 1);
>     }
>     /* increment count (starts at 0) */
>     atomic_set(&folio->_nr_pages_mapped, nr);
>   } else {
> 
>>         } else {
>>                 /* increment count (starts at -1) */
>>                 atomic_set(&folio->_entire_mapcount, 0);
>>                 atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
>> -               nr = folio_nr_pages(folio);
>>                 __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
>> +               __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>         }
>>
>>         __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
>> -       __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>  }

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
  2023-07-03 13:53   ` Ryan Roberts
@ 2023-07-04  2:14     ` Yin, Fengwei
  -1 siblings, 0 replies; 167+ messages in thread
From: Yin, Fengwei @ 2023-07-04  2:14 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: linux-arm-kernel, linux-kernel, linux-mm



On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> In preparation for FLEXIBLE_THP support, improve
> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
> passed to it. In this case, all contained pages are accounted using the
> "small" pages scheme.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Yin, Fengwei <fengwei.yin@intel.com>

> ---
>  mm/rmap.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1d8369549424..82ef5ba363d1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>   * This means the inc-and-test can be bypassed.
>   * The folio does not have to be locked.
>   *
> - * If the folio is large, it is accounted as a THP.  As the folio
> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>   * is new, it's assumed to be mapped exclusively by a single process.
>   */
>  void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>  		unsigned long address)
>  {
> -	int nr;
> +	int nr = folio_nr_pages(folio);
> +	int i;
> +	struct page *page;
>  
> -	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> +	VM_BUG_ON_VMA(address < vma->vm_start ||
> +			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>  	__folio_set_swapbacked(folio);
>  
> -	if (likely(!folio_test_pmd_mappable(folio))) {
> +	if (!folio_test_large(folio)) {
>  		/* increment count (starts at -1) */
>  		atomic_set(&folio->_mapcount, 0);
> -		nr = 1;
> +		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
> +	} else if (!folio_test_pmd_mappable(folio)) {
> +		/* increment count (starts at 0) */
> +		atomic_set(&folio->_nr_pages_mapped, nr);
> +
> +		page = &folio->page;
> +		for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) {
> +			/* increment count (starts at -1) */
> +			atomic_set(&page->_mapcount, 0);
> +			__page_set_anon_rmap(folio, page, vma, address, 1);
> +		}
>  	} else {
>  		/* increment count (starts at -1) */
>  		atomic_set(&folio->_entire_mapcount, 0);
>  		atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
> -		nr = folio_nr_pages(folio);
>  		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
> +		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  	}
>  
>  	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
> -	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  }
>  
>  /**

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
@ 2023-07-04  2:14     ` Yin, Fengwei
  0 siblings, 0 replies; 167+ messages in thread
From: Yin, Fengwei @ 2023-07-04  2:14 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: linux-arm-kernel, linux-kernel, linux-mm



On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> In preparation for FLEXIBLE_THP support, improve
> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
> passed to it. In this case, all contained pages are accounted using the
> "small" pages scheme.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Yin, Fengwei <fengwei.yin@intel.com>

> ---
>  mm/rmap.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1d8369549424..82ef5ba363d1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>   * This means the inc-and-test can be bypassed.
>   * The folio does not have to be locked.
>   *
> - * If the folio is large, it is accounted as a THP.  As the folio
> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>   * is new, it's assumed to be mapped exclusively by a single process.
>   */
>  void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>  		unsigned long address)
>  {
> -	int nr;
> +	int nr = folio_nr_pages(folio);
> +	int i;
> +	struct page *page;
>  
> -	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> +	VM_BUG_ON_VMA(address < vma->vm_start ||
> +			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>  	__folio_set_swapbacked(folio);
>  
> -	if (likely(!folio_test_pmd_mappable(folio))) {
> +	if (!folio_test_large(folio)) {
>  		/* increment count (starts at -1) */
>  		atomic_set(&folio->_mapcount, 0);
> -		nr = 1;
> +		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
> +	} else if (!folio_test_pmd_mappable(folio)) {
> +		/* increment count (starts at 0) */
> +		atomic_set(&folio->_nr_pages_mapped, nr);
> +
> +		page = &folio->page;
> +		for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) {
> +			/* increment count (starts at -1) */
> +			atomic_set(&page->_mapcount, 0);
> +			__page_set_anon_rmap(folio, page, vma, address, 1);
> +		}
>  	} else {
>  		/* increment count (starts at -1) */
>  		atomic_set(&folio->_entire_mapcount, 0);
>  		atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
> -		nr = folio_nr_pages(folio);
>  		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
> +		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  	}
>  
>  	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
> -	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  }
>  
>  /**

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-03 13:53 ` Ryan Roberts
@ 2023-07-04  2:18   ` Yu Zhao
  -1 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-04  2:18 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi All,
>
> This is v2 of a series to implement variable order, large folios for anonymous
> memory. The objective of this is to improve performance by allocating larger
> chunks of memory during anonymous page faults. See [1] for background.

Thanks for the quick response!

> I've significantly reworked and simplified the patch set based on comments from
> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> VARIABLE_THP, on Yu's advice.
>
> The last patch is for arm64 to explicitly override the default
> arch_wants_pte_order() and is intended as an example. If this series is accepted
> I suggest taking the first 4 patches through the mm tree and the arm64 change
> could be handled through the arm64 tree separately. Neither has any build
> dependency on the other.
>
> The one area where I haven't followed Yu's advice is in the determination of the
> size of folio to use. It was suggested that I have a single preferred large
> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> being existing overlapping populated PTEs, etc) then fallback immediately to
> order-0. It turned out that this approach caused a performance regression in the
> Speedometer benchmark.

I suppose it's regression against the v1, not the unpatched kernel.

> With my v1 patch, there were significant quantities of
> memory which could not be placed in the 64K bucket and were instead being
> allocated for the 32K and 16K buckets. With the proposed simplification, that
> memory ended up using the 4K bucket, so page faults increased by 2.75x compared
> to the v1 patch (although due to the 64K bucket, this number is still a bit
> lower than the baseline). So instead, I continue to calculate a folio order that
> is somewhere between the preferred order and 0. (See below for more details).

I suppose the benchmark wasn't running under memory pressure, which is
uncommon for client devices. It could be easier the other way around:
using 32/16KB shows regression whereas order-0 shows better
performance under memory pressure.

I'm not sure we should use v1 as the baseline. Unpatched kernel sounds
more reasonable at this point. If 32/16KB is proven to be better in
most scenarios including under memory pressure, we can reintroduce
that policy. I highly doubt this is the case: we tried 16KB base page
size on client devices, and overall, the regressions outweighs the
benefits.

> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> [2], which is a hard dependency. I have a branch at [3].

It's not clear to me why [2] is a hard dependency.

It seems to me we are getting close and I was hoping we could get into
mm-unstable soon without depending on other series...

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-04  2:18   ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-04  2:18 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi All,
>
> This is v2 of a series to implement variable order, large folios for anonymous
> memory. The objective of this is to improve performance by allocating larger
> chunks of memory during anonymous page faults. See [1] for background.

Thanks for the quick response!

> I've significantly reworked and simplified the patch set based on comments from
> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> VARIABLE_THP, on Yu's advice.
>
> The last patch is for arm64 to explicitly override the default
> arch_wants_pte_order() and is intended as an example. If this series is accepted
> I suggest taking the first 4 patches through the mm tree and the arm64 change
> could be handled through the arm64 tree separately. Neither has any build
> dependency on the other.
>
> The one area where I haven't followed Yu's advice is in the determination of the
> size of folio to use. It was suggested that I have a single preferred large
> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> being existing overlapping populated PTEs, etc) then fallback immediately to
> order-0. It turned out that this approach caused a performance regression in the
> Speedometer benchmark.

I suppose it's regression against the v1, not the unpatched kernel.

> With my v1 patch, there were significant quantities of
> memory which could not be placed in the 64K bucket and were instead being
> allocated for the 32K and 16K buckets. With the proposed simplification, that
> memory ended up using the 4K bucket, so page faults increased by 2.75x compared
> to the v1 patch (although due to the 64K bucket, this number is still a bit
> lower than the baseline). So instead, I continue to calculate a folio order that
> is somewhere between the preferred order and 0. (See below for more details).

I suppose the benchmark wasn't running under memory pressure, which is
uncommon for client devices. It could be easier the other way around:
using 32/16KB shows regression whereas order-0 shows better
performance under memory pressure.

I'm not sure we should use v1 as the baseline. Unpatched kernel sounds
more reasonable at this point. If 32/16KB is proven to be better in
most scenarios including under memory pressure, we can reintroduce
that policy. I highly doubt this is the case: we tried 16KB base page
size on client devices, and overall, the regressions outweighs the
benefits.

> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> [2], which is a hard dependency. I have a branch at [3].

It's not clear to me why [2] is a hard dependency.

It seems to me we are getting close and I was hoping we could get into
mm-unstable soon without depending on other series...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-03 13:53   ` Ryan Roberts
@ 2023-07-04  2:22     ` Yin, Fengwei
  -1 siblings, 0 replies; 167+ messages in thread
From: Yin, Fengwei @ 2023-07-04  2:22 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: linux-arm-kernel, linux-kernel, linux-mm



On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> arch_wants_pte_order() can be overridden by the arch to return the
> preferred folio order for pte-mapped memory. This is useful as some
> architectures (e.g. arm64) can coalesce TLB entries when the physical
> memory is suitably contiguous.
> 
> The first user for this hint will be FLEXIBLE_THP, which aims to
> allocate large folios for anonymous memory to reduce page faults and
> other per-page operation costs.
> 
> Here we add the default implementation of the function, used when the
> architecture does not define it, which returns the order corresponding
> to 64K.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/pgtable.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index a661a17173fa..f7e38598f20b 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -13,6 +13,7 @@
>  #include <linux/errno.h>
>  #include <asm-generic/pgtable_uffd.h>
>  #include <linux/page_table_check.h>
> +#include <linux/sizes.h>
>  
>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>  	defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>  }
>  #endif
>  
> +#ifndef arch_wants_pte_order
> +/*
> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> + * to be at least order-2.
> + */
> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> +{
> +	return ilog2(SZ_64K >> PAGE_SHIFT);
Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?

Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.


Regards
Yin, Fengwei

> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>  				       unsigned long address,

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-04  2:22     ` Yin, Fengwei
  0 siblings, 0 replies; 167+ messages in thread
From: Yin, Fengwei @ 2023-07-04  2:22 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: linux-arm-kernel, linux-kernel, linux-mm



On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> arch_wants_pte_order() can be overridden by the arch to return the
> preferred folio order for pte-mapped memory. This is useful as some
> architectures (e.g. arm64) can coalesce TLB entries when the physical
> memory is suitably contiguous.
> 
> The first user for this hint will be FLEXIBLE_THP, which aims to
> allocate large folios for anonymous memory to reduce page faults and
> other per-page operation costs.
> 
> Here we add the default implementation of the function, used when the
> architecture does not define it, which returns the order corresponding
> to 64K.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/pgtable.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index a661a17173fa..f7e38598f20b 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -13,6 +13,7 @@
>  #include <linux/errno.h>
>  #include <asm-generic/pgtable_uffd.h>
>  #include <linux/page_table_check.h>
> +#include <linux/sizes.h>
>  
>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>  	defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>  }
>  #endif
>  
> +#ifndef arch_wants_pte_order
> +/*
> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> + * to be at least order-2.
> + */
> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> +{
> +	return ilog2(SZ_64K >> PAGE_SHIFT);
Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?

Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.


Regards
Yin, Fengwei

> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>  				       unsigned long address,

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-04  2:22     ` Yin, Fengwei
@ 2023-07-04  3:02       ` Yu Zhao
  -1 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-04  3:02 UTC (permalink / raw)
  To: Yin, Fengwei
  Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> > arch_wants_pte_order() can be overridden by the arch to return the
> > preferred folio order for pte-mapped memory. This is useful as some
> > architectures (e.g. arm64) can coalesce TLB entries when the physical
> > memory is suitably contiguous.
> >
> > The first user for this hint will be FLEXIBLE_THP, which aims to
> > allocate large folios for anonymous memory to reduce page faults and
> > other per-page operation costs.
> >
> > Here we add the default implementation of the function, used when the
> > architecture does not define it, which returns the order corresponding
> > to 64K.
> >
> > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > ---
> >  include/linux/pgtable.h | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> >
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index a661a17173fa..f7e38598f20b 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -13,6 +13,7 @@
> >  #include <linux/errno.h>
> >  #include <asm-generic/pgtable_uffd.h>
> >  #include <linux/page_table_check.h>
> > +#include <linux/sizes.h>
> >
> >  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
> >       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> > @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
> >  }
> >  #endif
> >
> > +#ifndef arch_wants_pte_order
> > +/*
> > + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> > + * to be at least order-2.
> > + */
> > +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> > +{
> > +     return ilog2(SZ_64K >> PAGE_SHIFT);
> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
>
> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.

The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
s/w policy not a h/w preference. Besides, I don't think we can include
mmzone.h in pgtable.h.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-04  3:02       ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-04  3:02 UTC (permalink / raw)
  To: Yin, Fengwei
  Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> > arch_wants_pte_order() can be overridden by the arch to return the
> > preferred folio order for pte-mapped memory. This is useful as some
> > architectures (e.g. arm64) can coalesce TLB entries when the physical
> > memory is suitably contiguous.
> >
> > The first user for this hint will be FLEXIBLE_THP, which aims to
> > allocate large folios for anonymous memory to reduce page faults and
> > other per-page operation costs.
> >
> > Here we add the default implementation of the function, used when the
> > architecture does not define it, which returns the order corresponding
> > to 64K.
> >
> > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > ---
> >  include/linux/pgtable.h | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> >
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index a661a17173fa..f7e38598f20b 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -13,6 +13,7 @@
> >  #include <linux/errno.h>
> >  #include <asm-generic/pgtable_uffd.h>
> >  #include <linux/page_table_check.h>
> > +#include <linux/sizes.h>
> >
> >  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
> >       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> > @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
> >  }
> >  #endif
> >
> > +#ifndef arch_wants_pte_order
> > +/*
> > + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> > + * to be at least order-2.
> > + */
> > +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> > +{
> > +     return ilog2(SZ_64K >> PAGE_SHIFT);
> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
>
> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.

The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
s/w policy not a h/w preference. Besides, I don't think we can include
mmzone.h in pgtable.h.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-03 13:53   ` Ryan Roberts
@ 2023-07-04  3:45     ` Yin, Fengwei
  -1 siblings, 0 replies; 167+ messages in thread
From: Yin, Fengwei @ 2023-07-04  3:45 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: linux-arm-kernel, linux-kernel, linux-mm


On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
THP is for huge page which is 2M size. We are not huge page here. But
I don't have good name either.

> allocated in large folios of a specified order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
> 
> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> defaults to disabled for now; there is a long list of todos to make
> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
> madvise ops, etc). These items will be tackled in subsequent patches.
> 
> When enabled, the preferred folio order is as returned by
> arch_wants_pte_order(), which may be overridden by the arch as it sees
> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
> contiguous set of ptes map physically contigious, naturally aligned
> memory, so this mechanism allows the architecture to optimize as
> required.
> 
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/Kconfig  |  10 ++++
>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 165 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..1c06b2c0a24e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
>  	  support of file THPs will be developed in the next few release
>  	  cycles.
>  
> +config FLEXIBLE_THP
> +	bool "Flexible order THP"
> +	depends on TRANSPARENT_HUGEPAGE
> +	default n
> +	help
> +	  Use large (bigger than order-0) folios to back anonymous memory where
> +	  possible, even if the order of the folio is smaller than the PMD
> +	  order. This reduces the number of page faults, as well as other
> +	  per-page overheads to improve performance for many workloads.
> +
>  endif # TRANSPARENT_HUGEPAGE
>  
>  #
> diff --git a/mm/memory.c b/mm/memory.c
> index fb30f7523550..abe2ea94f3f5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_FLEXIBLE_THP
> +/*
> + * Allocates, zeros and returns a folio of the requested order for use as
> + * anonymous memory.
> + */
> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
> +				      unsigned long addr, int order)
> +{
> +	gfp_t gfp;
> +	struct folio *folio;
> +
> +	if (order == 0)
> +		return vma_alloc_zeroed_movable_folio(vma, addr);
> +
> +	gfp = vma_thp_gfp_mask(vma);
> +	folio = vma_alloc_folio(gfp, order, vma, addr, true);
> +	if (folio)
> +		clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
> +
> +	return folio;
> +}
> +
> +/*
> + * Preferred folio order to allocate for anonymous memory.
> + */
> +#define max_anon_folio_order(vma)	arch_wants_pte_order(vma)
> +#else
> +#define alloc_anon_folio(vma, addr, order) \
> +				vma_alloc_zeroed_movable_folio(vma, addr)
> +#define max_anon_folio_order(vma)	0
> +#endif
> +
> +/*
> + * Returns index of first pte that is not none, or nr if all are none.
> + */
> +static inline int check_ptes_none(pte_t *pte, int nr)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr; i++) {
> +		if (!pte_none(ptep_get(pte++)))
> +			return i;
> +	}
> +
> +	return nr;
> +}
> +
> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> +{
> +	/*
> +	 * The aim here is to determine what size of folio we should allocate
> +	 * for this fault. Factors include:
> +	 * - Order must not be higher than `order` upon entry
> +	 * - Folio must be naturally aligned within VA space
> +	 * - Folio must be fully contained inside one pmd entry
> +	 * - Folio must not breach boundaries of vma
> +	 * - Folio must not overlap any non-none ptes
> +	 *
> +	 * Additionally, we do not allow order-1 since this breaks assumptions
> +	 * elsewhere in the mm; THP pages must be at least order-2 (since they
> +	 * store state up to the 3rd struct page subpage), and these pages must
> +	 * be THP in order to correctly use pre-existing THP infrastructure such
> +	 * as folio_split().
> +	 *
> +	 * Note that the caller may or may not choose to lock the pte. If
> +	 * unlocked, the result is racy and the user must re-check any overlap
> +	 * with non-none ptes under the lock.
> +	 */
> +
> +	struct vm_area_struct *vma = vmf->vma;
> +	int nr;
> +	unsigned long addr;
> +	pte_t *pte;
> +	pte_t *first_set = NULL;
> +	int ret;
> +
> +	order = min(order, PMD_SHIFT - PAGE_SHIFT);
> +
> +	for (; order > 1; order--) {
> +		nr = 1 << order;
> +		addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
> +		pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
> +
> +		/* Check vma bounds. */
> +		if (addr < vma->vm_start ||
> +		    addr + (nr << PAGE_SHIFT) > vma->vm_end)
> +			continue;
> +
> +		/* Ptes covered by order already known to be none. */
> +		if (pte + nr <= first_set)
> +			break;
> +
> +		/* Already found set pte in range covered by order. */
> +		if (pte <= first_set)
> +			continue;
> +
> +		/* Need to check if all the ptes are none. */
> +		ret = check_ptes_none(pte, nr);
> +		if (ret == nr)
> +			break;
> +
> +		first_set = pte + ret;
> +	}
> +
> +	if (order == 1)
> +		order = 0;
> +
> +	return order;
> +}
The logic in above function should be kept is whether the order fit in vma range.

check_ptes_none() is not accurate here because no page table lock hold and concurrent
fault could happen. So may just drop the check here? Check_ptes_none() is done after
take the page table lock.

We pick the arch prefered order or order 0 now.

> +
>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>  		goto oom;
>  
>  	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
> -		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +		new_folio = alloc_anon_folio(vma, vmf->address, 0);
>  		if (!new_folio)
>  			goto oom;
>  	} else {
> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	struct folio *folio;
>  	vm_fault_t ret = 0;
>  	pte_t entry;
> +	int order;
> +	int pgcount;
> +	unsigned long addr;
>  
>  	/* File mapping without ->vm_ops ? */
>  	if (vma->vm_flags & VM_SHARED)
> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>  			return handle_userfault(vmf, VM_UFFD_MISSING);
>  		}
> -		goto setpte;
> +		if (uffd_wp)
> +			entry = pte_mkuffd_wp(entry);
> +		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +
> +		/* No need to invalidate - it was non-present before */
> +		update_mmu_cache(vma, vmf->address, vmf->pte);
> +		goto unlock;
> +	}
> +
> +	/*
> +	 * If allocating a large folio, determine the biggest suitable order for
> +	 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
> +	 * overlap with any populated PTEs, etc). We are not under the ptl here
> +	 * so we will need to re-check that we are not overlapping any populated
> +	 * PTEs once we have the lock.
> +	 */
> +	order = uffd_wp ? 0 : max_anon_folio_order(vma);
> +	if (order > 0) {
> +		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> +		order = calc_anon_folio_order_alloc(vmf, order);
> +		pte_unmap(vmf->pte);
>  	}
>  
> -	/* Allocate our own private page. */
> +	/* Allocate our own private folio. */
>  	if (unlikely(anon_vma_prepare(vma)))
>  		goto oom;
> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +	folio = alloc_anon_folio(vma, vmf->address, order);
> +	if (!folio && order > 0) {
> +		order = 0;
> +		folio = alloc_anon_folio(vma, vmf->address, order);
> +	}
>  	if (!folio)
>  		goto oom;
>  
> +	pgcount = 1 << order;
> +	addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
> +
>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>  		goto oom_free_page;
>  	folio_throttle_swaprate(folio, GFP_KERNEL);
>  
>  	/*
>  	 * The memory barrier inside __folio_mark_uptodate makes sure that
> -	 * preceding stores to the page contents become visible before
> -	 * the set_pte_at() write.
> +	 * preceding stores to the folio contents become visible before
> +	 * the set_ptes() write.
>  	 */
>  	__folio_mark_uptodate(folio);
>  
> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	if (vma->vm_flags & VM_WRITE)
>  		entry = pte_mkwrite(pte_mkdirty(entry));
>  
> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -			&vmf->ptl);
> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>  	if (vmf_pte_changed(vmf)) {
>  		update_mmu_tlb(vma, vmf->address, vmf->pte);
>  		goto release;
> +	} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
This could be the case that we allocated order 4 page and find a neighbor PTE is
filled by concurrent fault. Should we put current folio and fallback to order 0
and try again immedately (goto order 0 allocation instead of return from this
function which will go through some page fault path again)?


Regards
Yin, Fengwei

> +		goto release;
>  	}
>  
>  	ret = check_stable_address_space(vma->vm_mm);
> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>  	}
>  
> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
> +	folio_ref_add(folio, pgcount - 1);
> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
> +	folio_add_new_anon_rmap(folio, vma, addr);
>  	folio_add_lru_vma(folio, vma);
> -setpte:
> +
>  	if (uffd_wp)
>  		entry = pte_mkuffd_wp(entry);
> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>  
>  	/* No need to invalidate - it was non-present before */
> -	update_mmu_cache(vma, vmf->address, vmf->pte);
> +	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
>  unlock:
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>  	return ret;

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-04  3:45     ` Yin, Fengwei
  0 siblings, 0 replies; 167+ messages in thread
From: Yin, Fengwei @ 2023-07-04  3:45 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: linux-arm-kernel, linux-kernel, linux-mm


On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
THP is for huge page which is 2M size. We are not huge page here. But
I don't have good name either.

> allocated in large folios of a specified order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
> 
> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> defaults to disabled for now; there is a long list of todos to make
> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
> madvise ops, etc). These items will be tackled in subsequent patches.
> 
> When enabled, the preferred folio order is as returned by
> arch_wants_pte_order(), which may be overridden by the arch as it sees
> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
> contiguous set of ptes map physically contigious, naturally aligned
> memory, so this mechanism allows the architecture to optimize as
> required.
> 
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/Kconfig  |  10 ++++
>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 165 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..1c06b2c0a24e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
>  	  support of file THPs will be developed in the next few release
>  	  cycles.
>  
> +config FLEXIBLE_THP
> +	bool "Flexible order THP"
> +	depends on TRANSPARENT_HUGEPAGE
> +	default n
> +	help
> +	  Use large (bigger than order-0) folios to back anonymous memory where
> +	  possible, even if the order of the folio is smaller than the PMD
> +	  order. This reduces the number of page faults, as well as other
> +	  per-page overheads to improve performance for many workloads.
> +
>  endif # TRANSPARENT_HUGEPAGE
>  
>  #
> diff --git a/mm/memory.c b/mm/memory.c
> index fb30f7523550..abe2ea94f3f5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_FLEXIBLE_THP
> +/*
> + * Allocates, zeros and returns a folio of the requested order for use as
> + * anonymous memory.
> + */
> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
> +				      unsigned long addr, int order)
> +{
> +	gfp_t gfp;
> +	struct folio *folio;
> +
> +	if (order == 0)
> +		return vma_alloc_zeroed_movable_folio(vma, addr);
> +
> +	gfp = vma_thp_gfp_mask(vma);
> +	folio = vma_alloc_folio(gfp, order, vma, addr, true);
> +	if (folio)
> +		clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
> +
> +	return folio;
> +}
> +
> +/*
> + * Preferred folio order to allocate for anonymous memory.
> + */
> +#define max_anon_folio_order(vma)	arch_wants_pte_order(vma)
> +#else
> +#define alloc_anon_folio(vma, addr, order) \
> +				vma_alloc_zeroed_movable_folio(vma, addr)
> +#define max_anon_folio_order(vma)	0
> +#endif
> +
> +/*
> + * Returns index of first pte that is not none, or nr if all are none.
> + */
> +static inline int check_ptes_none(pte_t *pte, int nr)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr; i++) {
> +		if (!pte_none(ptep_get(pte++)))
> +			return i;
> +	}
> +
> +	return nr;
> +}
> +
> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> +{
> +	/*
> +	 * The aim here is to determine what size of folio we should allocate
> +	 * for this fault. Factors include:
> +	 * - Order must not be higher than `order` upon entry
> +	 * - Folio must be naturally aligned within VA space
> +	 * - Folio must be fully contained inside one pmd entry
> +	 * - Folio must not breach boundaries of vma
> +	 * - Folio must not overlap any non-none ptes
> +	 *
> +	 * Additionally, we do not allow order-1 since this breaks assumptions
> +	 * elsewhere in the mm; THP pages must be at least order-2 (since they
> +	 * store state up to the 3rd struct page subpage), and these pages must
> +	 * be THP in order to correctly use pre-existing THP infrastructure such
> +	 * as folio_split().
> +	 *
> +	 * Note that the caller may or may not choose to lock the pte. If
> +	 * unlocked, the result is racy and the user must re-check any overlap
> +	 * with non-none ptes under the lock.
> +	 */
> +
> +	struct vm_area_struct *vma = vmf->vma;
> +	int nr;
> +	unsigned long addr;
> +	pte_t *pte;
> +	pte_t *first_set = NULL;
> +	int ret;
> +
> +	order = min(order, PMD_SHIFT - PAGE_SHIFT);
> +
> +	for (; order > 1; order--) {
> +		nr = 1 << order;
> +		addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
> +		pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
> +
> +		/* Check vma bounds. */
> +		if (addr < vma->vm_start ||
> +		    addr + (nr << PAGE_SHIFT) > vma->vm_end)
> +			continue;
> +
> +		/* Ptes covered by order already known to be none. */
> +		if (pte + nr <= first_set)
> +			break;
> +
> +		/* Already found set pte in range covered by order. */
> +		if (pte <= first_set)
> +			continue;
> +
> +		/* Need to check if all the ptes are none. */
> +		ret = check_ptes_none(pte, nr);
> +		if (ret == nr)
> +			break;
> +
> +		first_set = pte + ret;
> +	}
> +
> +	if (order == 1)
> +		order = 0;
> +
> +	return order;
> +}
The logic in above function should be kept is whether the order fit in vma range.

check_ptes_none() is not accurate here because no page table lock hold and concurrent
fault could happen. So may just drop the check here? Check_ptes_none() is done after
take the page table lock.

We pick the arch prefered order or order 0 now.

> +
>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>  		goto oom;
>  
>  	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
> -		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +		new_folio = alloc_anon_folio(vma, vmf->address, 0);
>  		if (!new_folio)
>  			goto oom;
>  	} else {
> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	struct folio *folio;
>  	vm_fault_t ret = 0;
>  	pte_t entry;
> +	int order;
> +	int pgcount;
> +	unsigned long addr;
>  
>  	/* File mapping without ->vm_ops ? */
>  	if (vma->vm_flags & VM_SHARED)
> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>  			return handle_userfault(vmf, VM_UFFD_MISSING);
>  		}
> -		goto setpte;
> +		if (uffd_wp)
> +			entry = pte_mkuffd_wp(entry);
> +		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +
> +		/* No need to invalidate - it was non-present before */
> +		update_mmu_cache(vma, vmf->address, vmf->pte);
> +		goto unlock;
> +	}
> +
> +	/*
> +	 * If allocating a large folio, determine the biggest suitable order for
> +	 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
> +	 * overlap with any populated PTEs, etc). We are not under the ptl here
> +	 * so we will need to re-check that we are not overlapping any populated
> +	 * PTEs once we have the lock.
> +	 */
> +	order = uffd_wp ? 0 : max_anon_folio_order(vma);
> +	if (order > 0) {
> +		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> +		order = calc_anon_folio_order_alloc(vmf, order);
> +		pte_unmap(vmf->pte);
>  	}
>  
> -	/* Allocate our own private page. */
> +	/* Allocate our own private folio. */
>  	if (unlikely(anon_vma_prepare(vma)))
>  		goto oom;
> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +	folio = alloc_anon_folio(vma, vmf->address, order);
> +	if (!folio && order > 0) {
> +		order = 0;
> +		folio = alloc_anon_folio(vma, vmf->address, order);
> +	}
>  	if (!folio)
>  		goto oom;
>  
> +	pgcount = 1 << order;
> +	addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
> +
>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>  		goto oom_free_page;
>  	folio_throttle_swaprate(folio, GFP_KERNEL);
>  
>  	/*
>  	 * The memory barrier inside __folio_mark_uptodate makes sure that
> -	 * preceding stores to the page contents become visible before
> -	 * the set_pte_at() write.
> +	 * preceding stores to the folio contents become visible before
> +	 * the set_ptes() write.
>  	 */
>  	__folio_mark_uptodate(folio);
>  
> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	if (vma->vm_flags & VM_WRITE)
>  		entry = pte_mkwrite(pte_mkdirty(entry));
>  
> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -			&vmf->ptl);
> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>  	if (vmf_pte_changed(vmf)) {
>  		update_mmu_tlb(vma, vmf->address, vmf->pte);
>  		goto release;
> +	} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
This could be the case that we allocated order 4 page and find a neighbor PTE is
filled by concurrent fault. Should we put current folio and fallback to order 0
and try again immedately (goto order 0 allocation instead of return from this
function which will go through some page fault path again)?


Regards
Yin, Fengwei

> +		goto release;
>  	}
>  
>  	ret = check_stable_address_space(vma->vm_mm);
> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>  	}
>  
> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
> +	folio_ref_add(folio, pgcount - 1);
> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
> +	folio_add_new_anon_rmap(folio, vma, addr);
>  	folio_add_lru_vma(folio, vma);
> -setpte:
> +
>  	if (uffd_wp)
>  		entry = pte_mkuffd_wp(entry);
> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>  
>  	/* No need to invalidate - it was non-present before */
> -	update_mmu_cache(vma, vmf->address, vmf->pte);
> +	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
>  unlock:
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>  	return ret;

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-04  3:02       ` Yu Zhao
@ 2023-07-04  3:59         ` Yu Zhao
  -1 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-04  3:59 UTC (permalink / raw)
  To: Yin, Fengwei, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >
> >
> >
> > On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> > > arch_wants_pte_order() can be overridden by the arch to return the
> > > preferred folio order for pte-mapped memory. This is useful as some
> > > architectures (e.g. arm64) can coalesce TLB entries when the physical
> > > memory is suitably contiguous.
> > >
> > > The first user for this hint will be FLEXIBLE_THP, which aims to
> > > allocate large folios for anonymous memory to reduce page faults and
> > > other per-page operation costs.
> > >
> > > Here we add the default implementation of the function, used when the
> > > architecture does not define it, which returns the order corresponding
> > > to 64K.
> > >
> > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > > ---
> > >  include/linux/pgtable.h | 13 +++++++++++++
> > >  1 file changed, 13 insertions(+)
> > >
> > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > > index a661a17173fa..f7e38598f20b 100644
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -13,6 +13,7 @@
> > >  #include <linux/errno.h>
> > >  #include <asm-generic/pgtable_uffd.h>
> > >  #include <linux/page_table_check.h>
> > > +#include <linux/sizes.h>
> > >
> > >  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
> > >       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> > > @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
> > >  }
> > >  #endif
> > >
> > > +#ifndef arch_wants_pte_order
> > > +/*
> > > + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> > > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> > > + * to be at least order-2.
> > > + */
> > > +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> > > +{
> > > +     return ilog2(SZ_64K >> PAGE_SHIFT);
> > Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
> >
> > Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
> > If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
>
> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
> s/w policy not a h/w preference. Besides, I don't think we can include
> mmzone.h in pgtable.h.

I think we can make a compromise:
1. change the default implementation of arch_has_hw_pte_young() to return 0, and
2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
don't override arch_has_hw_pte_young(), or if its return value is too
large to fit.
This should also take care of the regression, right?

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-04  3:59         ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-04  3:59 UTC (permalink / raw)
  To: Yin, Fengwei, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >
> >
> >
> > On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> > > arch_wants_pte_order() can be overridden by the arch to return the
> > > preferred folio order for pte-mapped memory. This is useful as some
> > > architectures (e.g. arm64) can coalesce TLB entries when the physical
> > > memory is suitably contiguous.
> > >
> > > The first user for this hint will be FLEXIBLE_THP, which aims to
> > > allocate large folios for anonymous memory to reduce page faults and
> > > other per-page operation costs.
> > >
> > > Here we add the default implementation of the function, used when the
> > > architecture does not define it, which returns the order corresponding
> > > to 64K.
> > >
> > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > > ---
> > >  include/linux/pgtable.h | 13 +++++++++++++
> > >  1 file changed, 13 insertions(+)
> > >
> > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > > index a661a17173fa..f7e38598f20b 100644
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -13,6 +13,7 @@
> > >  #include <linux/errno.h>
> > >  #include <asm-generic/pgtable_uffd.h>
> > >  #include <linux/page_table_check.h>
> > > +#include <linux/sizes.h>
> > >
> > >  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
> > >       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> > > @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
> > >  }
> > >  #endif
> > >
> > > +#ifndef arch_wants_pte_order
> > > +/*
> > > + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> > > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> > > + * to be at least order-2.
> > > + */
> > > +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> > > +{
> > > +     return ilog2(SZ_64K >> PAGE_SHIFT);
> > Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
> >
> > Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
> > If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
>
> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
> s/w policy not a h/w preference. Besides, I don't think we can include
> mmzone.h in pgtable.h.

I think we can make a compromise:
1. change the default implementation of arch_has_hw_pte_young() to return 0, and
2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
don't override arch_has_hw_pte_young(), or if its return value is too
large to fit.
This should also take care of the regression, right?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-04  3:59         ` Yu Zhao
@ 2023-07-04  5:22           ` Yin, Fengwei
  -1 siblings, 0 replies; 167+ messages in thread
From: Yin, Fengwei @ 2023-07-04  5:22 UTC (permalink / raw)
  To: Yu Zhao, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm



On 7/4/2023 11:59 AM, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>
>>>
>>>
>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>> memory is suitably contiguous.
>>>>
>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>> allocate large folios for anonymous memory to reduce page faults and
>>>> other per-page operation costs.
>>>>
>>>> Here we add the default implementation of the function, used when the
>>>> architecture does not define it, which returns the order corresponding
>>>> to 64K.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/pgtable.h | 13 +++++++++++++
>>>>  1 file changed, 13 insertions(+)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index a661a17173fa..f7e38598f20b 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -13,6 +13,7 @@
>>>>  #include <linux/errno.h>
>>>>  #include <asm-generic/pgtable_uffd.h>
>>>>  #include <linux/page_table_check.h>
>>>> +#include <linux/sizes.h>
>>>>
>>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>>>>  }
>>>>  #endif
>>>>
>>>> +#ifndef arch_wants_pte_order
>>>> +/*
>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>> + * to be at least order-2.
>>>> + */
>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
>>>> +{
>>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
>>>
>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
>>
>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
>> s/w policy not a h/w preference. Besides, I don't think we can include
>> mmzone.h in pgtable.h.
> 
> I think we can make a compromise:
> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
> don't override arch_has_hw_pte_young(), or if its return value is too
> large to fit.
Do you mean arch_wants_pte_order()? Yes. This looks good to me. Thanks.


Regards
Yin, Fengwei

> This should also take care of the regression, right?

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-04  5:22           ` Yin, Fengwei
  0 siblings, 0 replies; 167+ messages in thread
From: Yin, Fengwei @ 2023-07-04  5:22 UTC (permalink / raw)
  To: Yu Zhao, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm



On 7/4/2023 11:59 AM, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>
>>>
>>>
>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>> memory is suitably contiguous.
>>>>
>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>> allocate large folios for anonymous memory to reduce page faults and
>>>> other per-page operation costs.
>>>>
>>>> Here we add the default implementation of the function, used when the
>>>> architecture does not define it, which returns the order corresponding
>>>> to 64K.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/pgtable.h | 13 +++++++++++++
>>>>  1 file changed, 13 insertions(+)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index a661a17173fa..f7e38598f20b 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -13,6 +13,7 @@
>>>>  #include <linux/errno.h>
>>>>  #include <asm-generic/pgtable_uffd.h>
>>>>  #include <linux/page_table_check.h>
>>>> +#include <linux/sizes.h>
>>>>
>>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>>>>  }
>>>>  #endif
>>>>
>>>> +#ifndef arch_wants_pte_order
>>>> +/*
>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>> + * to be at least order-2.
>>>> + */
>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
>>>> +{
>>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
>>>
>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
>>
>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
>> s/w policy not a h/w preference. Besides, I don't think we can include
>> mmzone.h in pgtable.h.
> 
> I think we can make a compromise:
> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
> don't override arch_has_hw_pte_young(), or if its return value is too
> large to fit.
Do you mean arch_wants_pte_order()? Yes. This looks good to me. Thanks.


Regards
Yin, Fengwei

> This should also take care of the regression, right?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-04  5:22           ` Yin, Fengwei
@ 2023-07-04  5:42             ` Yu Zhao
  -1 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-04  5:42 UTC (permalink / raw)
  To: Yin, Fengwei
  Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Mon, Jul 3, 2023 at 11:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 7/4/2023 11:59 AM, Yu Zhao wrote:
> > On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
> >>
> >> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>>
> >>>
> >>>
> >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> >>>> arch_wants_pte_order() can be overridden by the arch to return the
> >>>> preferred folio order for pte-mapped memory. This is useful as some
> >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >>>> memory is suitably contiguous.
> >>>>
> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to
> >>>> allocate large folios for anonymous memory to reduce page faults and
> >>>> other per-page operation costs.
> >>>>
> >>>> Here we add the default implementation of the function, used when the
> >>>> architecture does not define it, which returns the order corresponding
> >>>> to 64K.
> >>>>
> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>> ---
> >>>>  include/linux/pgtable.h | 13 +++++++++++++
> >>>>  1 file changed, 13 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>>> index a661a17173fa..f7e38598f20b 100644
> >>>> --- a/include/linux/pgtable.h
> >>>> +++ b/include/linux/pgtable.h
> >>>> @@ -13,6 +13,7 @@
> >>>>  #include <linux/errno.h>
> >>>>  #include <asm-generic/pgtable_uffd.h>
> >>>>  #include <linux/page_table_check.h>
> >>>> +#include <linux/sizes.h>
> >>>>
> >>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
> >>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> >>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
> >>>>  }
> >>>>  #endif
> >>>>
> >>>> +#ifndef arch_wants_pte_order
> >>>> +/*
> >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> >>>> + * to be at least order-2.
> >>>> + */
> >>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> >>>> +{
> >>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
> >>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
> >>>
> >>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
> >>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
> >>
> >> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
> >> s/w policy not a h/w preference. Besides, I don't think we can include
> >> mmzone.h in pgtable.h.
> >
> > I think we can make a compromise:
> > 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
> > 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
> > don't override arch_has_hw_pte_young(), or if its return value is too
> > large to fit.
> Do you mean arch_wants_pte_order()? Yes. This looks good to me. Thanks.

Sorry, copied the wrong function from above and pasted without looking...

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-04  5:42             ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-04  5:42 UTC (permalink / raw)
  To: Yin, Fengwei
  Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Mon, Jul 3, 2023 at 11:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 7/4/2023 11:59 AM, Yu Zhao wrote:
> > On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
> >>
> >> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>>
> >>>
> >>>
> >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> >>>> arch_wants_pte_order() can be overridden by the arch to return the
> >>>> preferred folio order for pte-mapped memory. This is useful as some
> >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >>>> memory is suitably contiguous.
> >>>>
> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to
> >>>> allocate large folios for anonymous memory to reduce page faults and
> >>>> other per-page operation costs.
> >>>>
> >>>> Here we add the default implementation of the function, used when the
> >>>> architecture does not define it, which returns the order corresponding
> >>>> to 64K.
> >>>>
> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>> ---
> >>>>  include/linux/pgtable.h | 13 +++++++++++++
> >>>>  1 file changed, 13 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>>> index a661a17173fa..f7e38598f20b 100644
> >>>> --- a/include/linux/pgtable.h
> >>>> +++ b/include/linux/pgtable.h
> >>>> @@ -13,6 +13,7 @@
> >>>>  #include <linux/errno.h>
> >>>>  #include <asm-generic/pgtable_uffd.h>
> >>>>  #include <linux/page_table_check.h>
> >>>> +#include <linux/sizes.h>
> >>>>
> >>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
> >>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> >>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
> >>>>  }
> >>>>  #endif
> >>>>
> >>>> +#ifndef arch_wants_pte_order
> >>>> +/*
> >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> >>>> + * to be at least order-2.
> >>>> + */
> >>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> >>>> +{
> >>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
> >>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
> >>>
> >>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
> >>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
> >>
> >> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
> >> s/w policy not a h/w preference. Besides, I don't think we can include
> >> mmzone.h in pgtable.h.
> >
> > I think we can make a compromise:
> > 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
> > 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
> > don't override arch_has_hw_pte_young(), or if its return value is too
> > large to fit.
> Do you mean arch_wants_pte_order()? Yes. This looks good to me. Thanks.

Sorry, copied the wrong function from above and pasted without looking...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-04  2:18   ` Yu Zhao
@ 2023-07-04  6:22     ` Yin, Fengwei
  -1 siblings, 0 replies; 167+ messages in thread
From: Yin, Fengwei @ 2023-07-04  6:22 UTC (permalink / raw)
  To: Yu Zhao, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm



On 7/4/2023 10:18 AM, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi All,
>>
>> This is v2 of a series to implement variable order, large folios for anonymous
>> memory. The objective of this is to improve performance by allocating larger
>> chunks of memory during anonymous page faults. See [1] for background.
> 
> Thanks for the quick response!
> 
>> I've significantly reworked and simplified the patch set based on comments from
>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
>> VARIABLE_THP, on Yu's advice.
>>
>> The last patch is for arm64 to explicitly override the default
>> arch_wants_pte_order() and is intended as an example. If this series is accepted
>> I suggest taking the first 4 patches through the mm tree and the arm64 change
>> could be handled through the arm64 tree separately. Neither has any build
>> dependency on the other.
>>
>> The one area where I haven't followed Yu's advice is in the determination of the
>> size of folio to use. It was suggested that I have a single preferred large
>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
>> being existing overlapping populated PTEs, etc) then fallback immediately to
>> order-0. It turned out that this approach caused a performance regression in the
>> Speedometer benchmark.
> 
> I suppose it's regression against the v1, not the unpatched kernel.
From the performance data Ryan shared, it's against unpatched kernel:

Speedometer 2.0:

| kernel                         |   runs_per_min |
|:-------------------------------|---------------:|
| baseline-4k                    |           0.0% |
| anonfolio-lkml-v1              |           0.7% |
| anonfolio-lkml-v2-simple-order |          -0.9% |
| anonfolio-lkml-v2              |           0.5% |


What if we use 32K or 16K instead of 64K as default anonymous folio size? I suspect
this app may have 32K or 16K anon folio as sweet spot.


Regards
Yin, Fengwei

> 
>> With my v1 patch, there were significant quantities of
>> memory which could not be placed in the 64K bucket and were instead being
>> allocated for the 32K and 16K buckets. With the proposed simplification, that
>> memory ended up using the 4K bucket, so page faults increased by 2.75x compared
>> to the v1 patch (although due to the 64K bucket, this number is still a bit
>> lower than the baseline). So instead, I continue to calculate a folio order that
>> is somewhere between the preferred order and 0. (See below for more details).
> 
> I suppose the benchmark wasn't running under memory pressure, which is
> uncommon for client devices. It could be easier the other way around:
> using 32/16KB shows regression whereas order-0 shows better
> performance under memory pressure.
> 
> I'm not sure we should use v1 as the baseline. Unpatched kernel sounds
> more reasonable at this point. If 32/16KB is proven to be better in
> most scenarios including under memory pressure, we can reintroduce
> that policy. I highly doubt this is the case: we tried 16KB base page
> size on client devices, and overall, the regressions outweighs the
> benefits.
> 
>> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
>> [2], which is a hard dependency. I have a branch at [3].
> 
> It's not clear to me why [2] is a hard dependency.
> 
> It seems to me we are getting close and I was hoping we could get into
> mm-unstable soon without depending on other series...
> 

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-04  6:22     ` Yin, Fengwei
  0 siblings, 0 replies; 167+ messages in thread
From: Yin, Fengwei @ 2023-07-04  6:22 UTC (permalink / raw)
  To: Yu Zhao, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm



On 7/4/2023 10:18 AM, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi All,
>>
>> This is v2 of a series to implement variable order, large folios for anonymous
>> memory. The objective of this is to improve performance by allocating larger
>> chunks of memory during anonymous page faults. See [1] for background.
> 
> Thanks for the quick response!
> 
>> I've significantly reworked and simplified the patch set based on comments from
>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
>> VARIABLE_THP, on Yu's advice.
>>
>> The last patch is for arm64 to explicitly override the default
>> arch_wants_pte_order() and is intended as an example. If this series is accepted
>> I suggest taking the first 4 patches through the mm tree and the arm64 change
>> could be handled through the arm64 tree separately. Neither has any build
>> dependency on the other.
>>
>> The one area where I haven't followed Yu's advice is in the determination of the
>> size of folio to use. It was suggested that I have a single preferred large
>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
>> being existing overlapping populated PTEs, etc) then fallback immediately to
>> order-0. It turned out that this approach caused a performance regression in the
>> Speedometer benchmark.
> 
> I suppose it's regression against the v1, not the unpatched kernel.
From the performance data Ryan shared, it's against unpatched kernel:

Speedometer 2.0:

| kernel                         |   runs_per_min |
|:-------------------------------|---------------:|
| baseline-4k                    |           0.0% |
| anonfolio-lkml-v1              |           0.7% |
| anonfolio-lkml-v2-simple-order |          -0.9% |
| anonfolio-lkml-v2              |           0.5% |


What if we use 32K or 16K instead of 64K as default anonymous folio size? I suspect
this app may have 32K or 16K anon folio as sweet spot.


Regards
Yin, Fengwei

> 
>> With my v1 patch, there were significant quantities of
>> memory which could not be placed in the 64K bucket and were instead being
>> allocated for the 32K and 16K buckets. With the proposed simplification, that
>> memory ended up using the 4K bucket, so page faults increased by 2.75x compared
>> to the v1 patch (although due to the 64K bucket, this number is still a bit
>> lower than the baseline). So instead, I continue to calculate a folio order that
>> is somewhere between the preferred order and 0. (See below for more details).
> 
> I suppose the benchmark wasn't running under memory pressure, which is
> uncommon for client devices. It could be easier the other way around:
> using 32/16KB shows regression whereas order-0 shows better
> performance under memory pressure.
> 
> I'm not sure we should use v1 as the baseline. Unpatched kernel sounds
> more reasonable at this point. If 32/16KB is proven to be better in
> most scenarios including under memory pressure, we can reintroduce
> that policy. I highly doubt this is the case: we tried 16KB base page
> size on client devices, and overall, the regressions outweighs the
> benefits.
> 
>> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
>> [2], which is a hard dependency. I have a branch at [3].
> 
> It's not clear to me why [2] is a hard dependency.
> 
> It seems to me we are getting close and I was hoping we could get into
> mm-unstable soon without depending on other series...
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-04  6:22     ` Yin, Fengwei
@ 2023-07-04  7:11       ` Yu Zhao
  -1 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-04  7:11 UTC (permalink / raw)
  To: Yin, Fengwei, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>
> On 7/4/2023 10:18 AM, Yu Zhao wrote:
> > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Hi All,
> >>
> >> This is v2 of a series to implement variable order, large folios for anonymous
> >> memory. The objective of this is to improve performance by allocating larger
> >> chunks of memory during anonymous page faults. See [1] for background.
> >
> > Thanks for the quick response!
> >
> >> I've significantly reworked and simplified the patch set based on comments from
> >> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> >> VARIABLE_THP, on Yu's advice.
> >>
> >> The last patch is for arm64 to explicitly override the default
> >> arch_wants_pte_order() and is intended as an example. If this series is accepted
> >> I suggest taking the first 4 patches through the mm tree and the arm64 change
> >> could be handled through the arm64 tree separately. Neither has any build
> >> dependency on the other.
> >>
> >> The one area where I haven't followed Yu's advice is in the determination of the
> >> size of folio to use. It was suggested that I have a single preferred large
> >> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> >> being existing overlapping populated PTEs, etc) then fallback immediately to
> >> order-0. It turned out that this approach caused a performance regression in the
> >> Speedometer benchmark.
> >
> > I suppose it's regression against the v1, not the unpatched kernel.
> From the performance data Ryan shared, it's against unpatched kernel:
>
> Speedometer 2.0:
>
> | kernel                         |   runs_per_min |
> |:-------------------------------|---------------:|
> | baseline-4k                    |           0.0% |
> | anonfolio-lkml-v1              |           0.7% |
> | anonfolio-lkml-v2-simple-order |          -0.9% |
> | anonfolio-lkml-v2              |           0.5% |

I see. Thanks.

A couple of questions:
1. Do we have a stddev?
2. Do we have a theory why it regressed?
Assuming no bugs, I don't see how a real regression could happen --
falling back to order-0 isn't different from the original behavior.
Ryan, could you `perf record` and `cat /proc/vmstat` and share them?

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-04  7:11       ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-04  7:11 UTC (permalink / raw)
  To: Yin, Fengwei, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>
> On 7/4/2023 10:18 AM, Yu Zhao wrote:
> > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Hi All,
> >>
> >> This is v2 of a series to implement variable order, large folios for anonymous
> >> memory. The objective of this is to improve performance by allocating larger
> >> chunks of memory during anonymous page faults. See [1] for background.
> >
> > Thanks for the quick response!
> >
> >> I've significantly reworked and simplified the patch set based on comments from
> >> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> >> VARIABLE_THP, on Yu's advice.
> >>
> >> The last patch is for arm64 to explicitly override the default
> >> arch_wants_pte_order() and is intended as an example. If this series is accepted
> >> I suggest taking the first 4 patches through the mm tree and the arm64 change
> >> could be handled through the arm64 tree separately. Neither has any build
> >> dependency on the other.
> >>
> >> The one area where I haven't followed Yu's advice is in the determination of the
> >> size of folio to use. It was suggested that I have a single preferred large
> >> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> >> being existing overlapping populated PTEs, etc) then fallback immediately to
> >> order-0. It turned out that this approach caused a performance regression in the
> >> Speedometer benchmark.
> >
> > I suppose it's regression against the v1, not the unpatched kernel.
> From the performance data Ryan shared, it's against unpatched kernel:
>
> Speedometer 2.0:
>
> | kernel                         |   runs_per_min |
> |:-------------------------------|---------------:|
> | baseline-4k                    |           0.0% |
> | anonfolio-lkml-v1              |           0.7% |
> | anonfolio-lkml-v2-simple-order |          -0.9% |
> | anonfolio-lkml-v2              |           0.5% |

I see. Thanks.

A couple of questions:
1. Do we have a stddev?
2. Do we have a theory why it regressed?
Assuming no bugs, I don't see how a real regression could happen --
falling back to order-0 isn't different from the original behavior.
Ryan, could you `perf record` and `cat /proc/vmstat` and share them?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
  2023-07-04  2:13       ` Yin, Fengwei
@ 2023-07-04 11:19         ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-04 11:19 UTC (permalink / raw)
  To: Yin, Fengwei, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 04/07/2023 03:13, Yin, Fengwei wrote:
> 
> 
> On 7/4/2023 3:05 AM, Yu Zhao wrote:
>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> In preparation for FLEXIBLE_THP support, improve
>>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
>>> passed to it. In this case, all contained pages are accounted using the
>>> "small" pages scheme.
>>
>> Nit: In this case, all *subpages*  are accounted using the *order-0
>> folio* (or base page) scheme.
> Matthew suggested not to use subpage with folio. Using page with folio:
> https://lore.kernel.org/linux-mm/Y9qiS%2FIxZOMx62t6@casper.infradead.org/

OK, I'll change this to "In this case, all contained pages are accounted using
the *order-0 folio* (or base page) scheme."

> 
>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>
>> Reviewed-by: Yu Zhao <yuzhao@google.com>

Thanks!

>>
>>>  mm/rmap.c | 26 +++++++++++++++++++-------
>>>  1 file changed, 19 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 1d8369549424..82ef5ba363d1 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>>>   * This means the inc-and-test can be bypassed.
>>>   * The folio does not have to be locked.
>>>   *
>>> - * If the folio is large, it is accounted as a THP.  As the folio
>>> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>>>   * is new, it's assumed to be mapped exclusively by a single process.
>>>   */
>>>  void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>>                 unsigned long address)
>>>  {
>>> -       int nr;
>>> +       int nr = folio_nr_pages(folio);
>>> +       int i;
>>> +       struct page *page;
>>>
>>> -       VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>>> +       VM_BUG_ON_VMA(address < vma->vm_start ||
>>> +                       address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>>         __folio_set_swapbacked(folio);
>>>
>>> -       if (likely(!folio_test_pmd_mappable(folio))) {
>>> +       if (!folio_test_large(folio)) {
>>>                 /* increment count (starts at -1) */
>>>                 atomic_set(&folio->_mapcount, 0);
>>> -               nr = 1;
>>> +               __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>> +       } else if (!folio_test_pmd_mappable(folio)) {
>>> +               /* increment count (starts at 0) */
>>> +               atomic_set(&folio->_nr_pages_mapped, nr);
>>> +
>>> +               page = &folio->page;
>>> +               for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) {
>>> +                       /* increment count (starts at -1) */
>>> +                       atomic_set(&page->_mapcount, 0);
>>> +                       __page_set_anon_rmap(folio, page, vma, address, 1);
>>> +               }
>>
>> Nit: use folio_page(), e.g.,

Yep, will change for v3.

>>
>>   } else if (!folio_test_pmd_mappable(folio)) {
>>     int i;
>>
>>     for (i = 0; i < nr; i++) {
>>       struct page *page = folio_page(folio, i);
>>
>>       /* increment count (starts at -1) */
>>       atomic_set(&page->_mapcount, 0);
>>       __page_set_anon_rmap(folio, page, vma, address + PAGE_SIZE * i, 1);
>>     }
>>     /* increment count (starts at 0) */
>>     atomic_set(&folio->_nr_pages_mapped, nr);
>>   } else {
>>
>>>         } else {
>>>                 /* increment count (starts at -1) */
>>>                 atomic_set(&folio->_entire_mapcount, 0);
>>>                 atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
>>> -               nr = folio_nr_pages(folio);
>>>                 __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
>>> +               __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>>         }
>>>
>>>         __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
>>> -       __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>>  }


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
@ 2023-07-04 11:19         ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-04 11:19 UTC (permalink / raw)
  To: Yin, Fengwei, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 04/07/2023 03:13, Yin, Fengwei wrote:
> 
> 
> On 7/4/2023 3:05 AM, Yu Zhao wrote:
>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> In preparation for FLEXIBLE_THP support, improve
>>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
>>> passed to it. In this case, all contained pages are accounted using the
>>> "small" pages scheme.
>>
>> Nit: In this case, all *subpages*  are accounted using the *order-0
>> folio* (or base page) scheme.
> Matthew suggested not to use subpage with folio. Using page with folio:
> https://lore.kernel.org/linux-mm/Y9qiS%2FIxZOMx62t6@casper.infradead.org/

OK, I'll change this to "In this case, all contained pages are accounted using
the *order-0 folio* (or base page) scheme."

> 
>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>
>> Reviewed-by: Yu Zhao <yuzhao@google.com>

Thanks!

>>
>>>  mm/rmap.c | 26 +++++++++++++++++++-------
>>>  1 file changed, 19 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 1d8369549424..82ef5ba363d1 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>>>   * This means the inc-and-test can be bypassed.
>>>   * The folio does not have to be locked.
>>>   *
>>> - * If the folio is large, it is accounted as a THP.  As the folio
>>> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>>>   * is new, it's assumed to be mapped exclusively by a single process.
>>>   */
>>>  void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>>                 unsigned long address)
>>>  {
>>> -       int nr;
>>> +       int nr = folio_nr_pages(folio);
>>> +       int i;
>>> +       struct page *page;
>>>
>>> -       VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>>> +       VM_BUG_ON_VMA(address < vma->vm_start ||
>>> +                       address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>>         __folio_set_swapbacked(folio);
>>>
>>> -       if (likely(!folio_test_pmd_mappable(folio))) {
>>> +       if (!folio_test_large(folio)) {
>>>                 /* increment count (starts at -1) */
>>>                 atomic_set(&folio->_mapcount, 0);
>>> -               nr = 1;
>>> +               __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>> +       } else if (!folio_test_pmd_mappable(folio)) {
>>> +               /* increment count (starts at 0) */
>>> +               atomic_set(&folio->_nr_pages_mapped, nr);
>>> +
>>> +               page = &folio->page;
>>> +               for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) {
>>> +                       /* increment count (starts at -1) */
>>> +                       atomic_set(&page->_mapcount, 0);
>>> +                       __page_set_anon_rmap(folio, page, vma, address, 1);
>>> +               }
>>
>> Nit: use folio_page(), e.g.,

Yep, will change for v3.

>>
>>   } else if (!folio_test_pmd_mappable(folio)) {
>>     int i;
>>
>>     for (i = 0; i < nr; i++) {
>>       struct page *page = folio_page(folio, i);
>>
>>       /* increment count (starts at -1) */
>>       atomic_set(&page->_mapcount, 0);
>>       __page_set_anon_rmap(folio, page, vma, address + PAGE_SIZE * i, 1);
>>     }
>>     /* increment count (starts at 0) */
>>     atomic_set(&folio->_nr_pages_mapped, nr);
>>   } else {
>>
>>>         } else {
>>>                 /* increment count (starts at -1) */
>>>                 atomic_set(&folio->_entire_mapcount, 0);
>>>                 atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
>>> -               nr = folio_nr_pages(folio);
>>>                 __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
>>> +               __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>>         }
>>>
>>>         __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
>>> -       __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>>  }


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-04  3:59         ` Yu Zhao
@ 2023-07-04 12:36           ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-04 12:36 UTC (permalink / raw)
  To: Yu Zhao, Yin, Fengwei
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 04/07/2023 04:59, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>
>>>
>>>
>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>> memory is suitably contiguous.
>>>>
>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>> allocate large folios for anonymous memory to reduce page faults and
>>>> other per-page operation costs.
>>>>
>>>> Here we add the default implementation of the function, used when the
>>>> architecture does not define it, which returns the order corresponding
>>>> to 64K.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/pgtable.h | 13 +++++++++++++
>>>>  1 file changed, 13 insertions(+)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index a661a17173fa..f7e38598f20b 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -13,6 +13,7 @@
>>>>  #include <linux/errno.h>
>>>>  #include <asm-generic/pgtable_uffd.h>
>>>>  #include <linux/page_table_check.h>
>>>> +#include <linux/sizes.h>
>>>>
>>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>>>>  }
>>>>  #endif
>>>>
>>>> +#ifndef arch_wants_pte_order
>>>> +/*
>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>> + * to be at least order-2.
>>>> + */
>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
>>>> +{
>>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
>>>
>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
>>
>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
>> s/w policy not a h/w preference. Besides, I don't think we can include
>> mmzone.h in pgtable.h.
> 
> I think we can make a compromise:
> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
> don't override arch_has_hw_pte_young(), or if its return value is too
> large to fit.
> This should also take care of the regression, right?

I think you are suggesting that we use 0 as a sentinel which we then translate
to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in
memory.c (actually it is currently a macro defined as arch_wants_pte_order()).

So it would become (I'll talk about the vma concern separately in the thread
where you raised it):

static inline int max_anon_folio_order(struct vm_area_struct *vma)
{
	int order = arch_wants_pte_order(vma);

	return order ? order : PAGE_ALLOC_COSTLY_ORDER;
}

Correct?

I don't see how it fixes the regression (assume you're talking about
Speedometer) though? On arm64 arch_wants_pte_order() will still be returning
order-4.


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-04 12:36           ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-04 12:36 UTC (permalink / raw)
  To: Yu Zhao, Yin, Fengwei
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 04/07/2023 04:59, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>
>>>
>>>
>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>> memory is suitably contiguous.
>>>>
>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>> allocate large folios for anonymous memory to reduce page faults and
>>>> other per-page operation costs.
>>>>
>>>> Here we add the default implementation of the function, used when the
>>>> architecture does not define it, which returns the order corresponding
>>>> to 64K.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/pgtable.h | 13 +++++++++++++
>>>>  1 file changed, 13 insertions(+)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index a661a17173fa..f7e38598f20b 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -13,6 +13,7 @@
>>>>  #include <linux/errno.h>
>>>>  #include <asm-generic/pgtable_uffd.h>
>>>>  #include <linux/page_table_check.h>
>>>> +#include <linux/sizes.h>
>>>>
>>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>>>>  }
>>>>  #endif
>>>>
>>>> +#ifndef arch_wants_pte_order
>>>> +/*
>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>> + * to be at least order-2.
>>>> + */
>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
>>>> +{
>>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
>>>
>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
>>
>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
>> s/w policy not a h/w preference. Besides, I don't think we can include
>> mmzone.h in pgtable.h.
> 
> I think we can make a compromise:
> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
> don't override arch_has_hw_pte_young(), or if its return value is too
> large to fit.
> This should also take care of the regression, right?

I think you are suggesting that we use 0 as a sentinel which we then translate
to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in
memory.c (actually it is currently a macro defined as arch_wants_pte_order()).

So it would become (I'll talk about the vma concern separately in the thread
where you raised it):

static inline int max_anon_folio_order(struct vm_area_struct *vma)
{
	int order = arch_wants_pte_order(vma);

	return order ? order : PAGE_ALLOC_COSTLY_ORDER;
}

Correct?

I don't see how it fixes the regression (assume you're talking about
Speedometer) though? On arm64 arch_wants_pte_order() will still be returning
order-4.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-03 19:50     ` Yu Zhao
@ 2023-07-04 13:20       ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-04 13:20 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 03/07/2023 20:50, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> arch_wants_pte_order() can be overridden by the arch to return the
>> preferred folio order for pte-mapped memory. This is useful as some
>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>> memory is suitably contiguous.
>>
>> The first user for this hint will be FLEXIBLE_THP, which aims to
>> allocate large folios for anonymous memory to reduce page faults and
>> other per-page operation costs.
>>
>> Here we add the default implementation of the function, used when the
>> architecture does not define it, which returns the order corresponding
>> to 64K.
> 
> I don't really mind a non-zero default value. But people would ask why
> non-zero and why 64KB. Probably you could argue this is the large size
> all known archs support if they have TLB coalescing. For x86, AMD CPUs
> would want to override this. I'll leave it to Fengwei to decide
> whether Intel wants a different default value.>
> Also I don't like the vma parameter because it makes
> arch_wants_pte_order() a mix of hw preference and vma policy. From my
> POV, the function should be only about the former; the latter should
> be decided by arch-independent MM code. However, I can live with it if
> ARM MM people think this is really what you want. ATM, I'm skeptical
> they do.

Here's the big picture for what I'm tryng to achieve:

 - In the common case, I'd like all programs to get a performance bump by
automatically and transparently using large anon folios - so no explicit
requirement on the process to opt-in.

 - On arm64, in the above case, I'd like the preferred folio size to be 64K;
from the (admittedly limitted) testing I've done that's about where the
performance knee is and it doesn't appear to increase the memory wastage very
much. It also has the benefits that for 4K base pages this is the contpte size
(order-4) so I can take full benefit of contpte mappings transparently to the
process. And for 16K this is the HPA size (order-2).

 - On arm64 when the process has marked the VMA for THP (or when
transparent_hugepage=always) but the VMA does not meet the requirements for a
PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
and for 64K this is 2M (order-5). The 64K base page case is very important since
the PMD size for that base page is 512MB which is almost impossible to allocate
in practice.

So one approach would be to define arch_wants_pte_order() as always returning
the contpte size (remove the vma parameter). Then max_anon_folio_order() in
memory.c could so this:


#define MAX_ANON_FOLIO_ORDER_NOTHP	ilog2(SZ_64K >> PAGE_SHIFT);

static inline int max_anon_folio_order(struct vm_area_struct *vma)
{
	int order = arch_wants_pte_order();

	// Fix up default case which returns 0 because PAGE_ALLOC_COSTLY_ORDER
	// can't be used directly in pgtable.h
	order = order ? order : PAGE_ALLOC_COSTLY_ORDER;

	if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
		return order;
	else
		return min(order, MAX_ANON_FOLIO_ORDER_NOTHP);
}


This moves the SW policy into memory.c and gives you PAGE_ALLOC_COSTLY_ORDER (or
whatever default we decide on) as the default for arches with no override, and
also meets all my goals above.

> 
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> 
> After another CPU vendor, e.g., Fengwei, and an ARM MM person, e.g.,
> Will give the green light:
> Reviewed-by: Yu Zhao <yuzhao@google.com>
> 
>> ---
>>  include/linux/pgtable.h | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index a661a17173fa..f7e38598f20b 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -13,6 +13,7 @@
>>  #include <linux/errno.h>
>>  #include <asm-generic/pgtable_uffd.h>
>>  #include <linux/page_table_check.h>
>> +#include <linux/sizes.h>
>>
>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>>         defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>>  }
>>  #endif
>>
>> +#ifndef arch_wants_pte_order
>> +/*
>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> 
> The warning is helpful.
> 
>> + * to be at least order-2.
>> + */
>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
>> +{
>> +       return ilog2(SZ_64K >> PAGE_SHIFT);
>> +}
>> +#endif
>> +
>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>                                        unsigned long address,


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-04 13:20       ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-04 13:20 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 03/07/2023 20:50, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> arch_wants_pte_order() can be overridden by the arch to return the
>> preferred folio order for pte-mapped memory. This is useful as some
>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>> memory is suitably contiguous.
>>
>> The first user for this hint will be FLEXIBLE_THP, which aims to
>> allocate large folios for anonymous memory to reduce page faults and
>> other per-page operation costs.
>>
>> Here we add the default implementation of the function, used when the
>> architecture does not define it, which returns the order corresponding
>> to 64K.
> 
> I don't really mind a non-zero default value. But people would ask why
> non-zero and why 64KB. Probably you could argue this is the large size
> all known archs support if they have TLB coalescing. For x86, AMD CPUs
> would want to override this. I'll leave it to Fengwei to decide
> whether Intel wants a different default value.>
> Also I don't like the vma parameter because it makes
> arch_wants_pte_order() a mix of hw preference and vma policy. From my
> POV, the function should be only about the former; the latter should
> be decided by arch-independent MM code. However, I can live with it if
> ARM MM people think this is really what you want. ATM, I'm skeptical
> they do.

Here's the big picture for what I'm tryng to achieve:

 - In the common case, I'd like all programs to get a performance bump by
automatically and transparently using large anon folios - so no explicit
requirement on the process to opt-in.

 - On arm64, in the above case, I'd like the preferred folio size to be 64K;
from the (admittedly limitted) testing I've done that's about where the
performance knee is and it doesn't appear to increase the memory wastage very
much. It also has the benefits that for 4K base pages this is the contpte size
(order-4) so I can take full benefit of contpte mappings transparently to the
process. And for 16K this is the HPA size (order-2).

 - On arm64 when the process has marked the VMA for THP (or when
transparent_hugepage=always) but the VMA does not meet the requirements for a
PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
and for 64K this is 2M (order-5). The 64K base page case is very important since
the PMD size for that base page is 512MB which is almost impossible to allocate
in practice.

So one approach would be to define arch_wants_pte_order() as always returning
the contpte size (remove the vma parameter). Then max_anon_folio_order() in
memory.c could so this:


#define MAX_ANON_FOLIO_ORDER_NOTHP	ilog2(SZ_64K >> PAGE_SHIFT);

static inline int max_anon_folio_order(struct vm_area_struct *vma)
{
	int order = arch_wants_pte_order();

	// Fix up default case which returns 0 because PAGE_ALLOC_COSTLY_ORDER
	// can't be used directly in pgtable.h
	order = order ? order : PAGE_ALLOC_COSTLY_ORDER;

	if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
		return order;
	else
		return min(order, MAX_ANON_FOLIO_ORDER_NOTHP);
}


This moves the SW policy into memory.c and gives you PAGE_ALLOC_COSTLY_ORDER (or
whatever default we decide on) as the default for arches with no override, and
also meets all my goals above.

> 
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> 
> After another CPU vendor, e.g., Fengwei, and an ARM MM person, e.g.,
> Will give the green light:
> Reviewed-by: Yu Zhao <yuzhao@google.com>
> 
>> ---
>>  include/linux/pgtable.h | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index a661a17173fa..f7e38598f20b 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -13,6 +13,7 @@
>>  #include <linux/errno.h>
>>  #include <asm-generic/pgtable_uffd.h>
>>  #include <linux/page_table_check.h>
>> +#include <linux/sizes.h>
>>
>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>>         defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>>  }
>>  #endif
>>
>> +#ifndef arch_wants_pte_order
>> +/*
>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> 
> The warning is helpful.
> 
>> + * to be at least order-2.
>> + */
>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
>> +{
>> +       return ilog2(SZ_64K >> PAGE_SHIFT);
>> +}
>> +#endif
>> +
>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>                                        unsigned long address,


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-04 12:36           ` Ryan Roberts
@ 2023-07-04 13:23             ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-04 13:23 UTC (permalink / raw)
  To: Yu Zhao, Yin, Fengwei
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 04/07/2023 13:36, Ryan Roberts wrote:
> On 04/07/2023 04:59, Yu Zhao wrote:
>> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
>>>
>>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>
>>>>
>>>>
>>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>>> memory is suitably contiguous.
>>>>>
>>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>>> allocate large folios for anonymous memory to reduce page faults and
>>>>> other per-page operation costs.
>>>>>
>>>>> Here we add the default implementation of the function, used when the
>>>>> architecture does not define it, which returns the order corresponding
>>>>> to 64K.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>  include/linux/pgtable.h | 13 +++++++++++++
>>>>>  1 file changed, 13 insertions(+)
>>>>>
>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>> index a661a17173fa..f7e38598f20b 100644
>>>>> --- a/include/linux/pgtable.h
>>>>> +++ b/include/linux/pgtable.h
>>>>> @@ -13,6 +13,7 @@
>>>>>  #include <linux/errno.h>
>>>>>  #include <asm-generic/pgtable_uffd.h>
>>>>>  #include <linux/page_table_check.h>
>>>>> +#include <linux/sizes.h>
>>>>>
>>>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>>>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
>>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>>>>>  }
>>>>>  #endif
>>>>>
>>>>> +#ifndef arch_wants_pte_order
>>>>> +/*
>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>>> + * to be at least order-2.
>>>>> + */
>>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
>>>>> +{
>>>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
>>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
>>>>
>>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
>>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
>>>
>>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
>>> s/w policy not a h/w preference. Besides, I don't think we can include
>>> mmzone.h in pgtable.h.
>>
>> I think we can make a compromise:
>> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
>> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
>> don't override arch_has_hw_pte_young(), or if its return value is too
>> large to fit.
>> This should also take care of the regression, right?
> 
> I think you are suggesting that we use 0 as a sentinel which we then translate
> to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in
> memory.c (actually it is currently a macro defined as arch_wants_pte_order()).
> 
> So it would become (I'll talk about the vma concern separately in the thread
> where you raised it):
> 
> static inline int max_anon_folio_order(struct vm_area_struct *vma)
> {
> 	int order = arch_wants_pte_order(vma);
> 
> 	return order ? order : PAGE_ALLOC_COSTLY_ORDER;
> }
> 
> Correct?

Actually, I'm not sure its a good idea to default to a fixed order. If running
on an arch with big base pages (e.g. powerpc with 64K pages?), that will soon
add up to a big chunk of memory, which could be wasteful?

PAGE_ALLOC_COSTLY_ORDER = 3 so with 64K base page, that 512K. Is that a concern?
Wouldn't it be better to define this as an absolute size? Or even the min of
PAGE_ALLOC_COSTLY_ORDER and an absolute size?


> 
> I don't see how it fixes the regression (assume you're talking about
> Speedometer) though? On arm64 arch_wants_pte_order() will still be returning
> order-4.
> 


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-04 13:23             ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-04 13:23 UTC (permalink / raw)
  To: Yu Zhao, Yin, Fengwei
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 04/07/2023 13:36, Ryan Roberts wrote:
> On 04/07/2023 04:59, Yu Zhao wrote:
>> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
>>>
>>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>
>>>>
>>>>
>>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>>> memory is suitably contiguous.
>>>>>
>>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>>> allocate large folios for anonymous memory to reduce page faults and
>>>>> other per-page operation costs.
>>>>>
>>>>> Here we add the default implementation of the function, used when the
>>>>> architecture does not define it, which returns the order corresponding
>>>>> to 64K.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>  include/linux/pgtable.h | 13 +++++++++++++
>>>>>  1 file changed, 13 insertions(+)
>>>>>
>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>> index a661a17173fa..f7e38598f20b 100644
>>>>> --- a/include/linux/pgtable.h
>>>>> +++ b/include/linux/pgtable.h
>>>>> @@ -13,6 +13,7 @@
>>>>>  #include <linux/errno.h>
>>>>>  #include <asm-generic/pgtable_uffd.h>
>>>>>  #include <linux/page_table_check.h>
>>>>> +#include <linux/sizes.h>
>>>>>
>>>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>>>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
>>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>>>>>  }
>>>>>  #endif
>>>>>
>>>>> +#ifndef arch_wants_pte_order
>>>>> +/*
>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>>> + * to be at least order-2.
>>>>> + */
>>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
>>>>> +{
>>>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
>>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
>>>>
>>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
>>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
>>>
>>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
>>> s/w policy not a h/w preference. Besides, I don't think we can include
>>> mmzone.h in pgtable.h.
>>
>> I think we can make a compromise:
>> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
>> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
>> don't override arch_has_hw_pte_young(), or if its return value is too
>> large to fit.
>> This should also take care of the regression, right?
> 
> I think you are suggesting that we use 0 as a sentinel which we then translate
> to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in
> memory.c (actually it is currently a macro defined as arch_wants_pte_order()).
> 
> So it would become (I'll talk about the vma concern separately in the thread
> where you raised it):
> 
> static inline int max_anon_folio_order(struct vm_area_struct *vma)
> {
> 	int order = arch_wants_pte_order(vma);
> 
> 	return order ? order : PAGE_ALLOC_COSTLY_ORDER;
> }
> 
> Correct?

Actually, I'm not sure its a good idea to default to a fixed order. If running
on an arch with big base pages (e.g. powerpc with 64K pages?), that will soon
add up to a big chunk of memory, which could be wasteful?

PAGE_ALLOC_COSTLY_ORDER = 3 so with 64K base page, that 512K. Is that a concern?
Wouldn't it be better to define this as an absolute size? Or even the min of
PAGE_ALLOC_COSTLY_ORDER and an absolute size?


> 
> I don't see how it fixes the regression (assume you're talking about
> Speedometer) though? On arm64 arch_wants_pte_order() will still be returning
> order-4.
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-04  1:35     ` Yu Zhao
@ 2023-07-04 14:08       ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-04 14:08 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 04/07/2023 02:35, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>> allocated in large folios of a specified order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
>>
>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>> defaults to disabled for now; there is a long list of todos to make
>> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
>> madvise ops, etc). These items will be tackled in subsequent patches.
>>
>> When enabled, the preferred folio order is as returned by
>> arch_wants_pte_order(), which may be overridden by the arch as it sees
>> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
> 
> coalesce

ACK

> 
>> contiguous set of ptes map physically contigious, naturally aligned
> 
> contiguous

ACK

> 
>> memory, so this mechanism allows the architecture to optimize as
>> required.
>>
>> If the preferred order can't be used (e.g. because the folio would
>> breach the bounds of the vma, or because ptes in the region are already
>> mapped) then we fall back to a suitable lower order.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/Kconfig  |  10 ++++
>>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
>>  2 files changed, 165 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 7672a22647b4..1c06b2c0a24e 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
>>           support of file THPs will be developed in the next few release
>>           cycles.
>>
>> +config FLEXIBLE_THP
>> +       bool "Flexible order THP"
>> +       depends on TRANSPARENT_HUGEPAGE
>> +       default n
> 
> The default value is already N.

Is there a coding standard for this? Personally I prefer to make it explicit.

> 
>> +       help
>> +         Use large (bigger than order-0) folios to back anonymous memory where
>> +         possible, even if the order of the folio is smaller than the PMD
>> +         order. This reduces the number of page faults, as well as other
>> +         per-page overheads to improve performance for many workloads.
>> +
>>  endif # TRANSPARENT_HUGEPAGE
>>
>>  #
>> diff --git a/mm/memory.c b/mm/memory.c
>> index fb30f7523550..abe2ea94f3f5 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>         return 0;
>>  }
>>
>> +#ifdef CONFIG_FLEXIBLE_THP
>> +/*
>> + * Allocates, zeros and returns a folio of the requested order for use as
>> + * anonymous memory.
>> + */
>> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
>> +                                     unsigned long addr, int order)
>> +{
>> +       gfp_t gfp;
>> +       struct folio *folio;
>> +
>> +       if (order == 0)
>> +               return vma_alloc_zeroed_movable_folio(vma, addr);
>> +
>> +       gfp = vma_thp_gfp_mask(vma);
>> +       folio = vma_alloc_folio(gfp, order, vma, addr, true);
>> +       if (folio)
>> +               clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
>> +
>> +       return folio;
>> +}
>> +
>> +/*
>> + * Preferred folio order to allocate for anonymous memory.
>> + */
>> +#define max_anon_folio_order(vma)      arch_wants_pte_order(vma)
>> +#else
>> +#define alloc_anon_folio(vma, addr, order) \
>> +                               vma_alloc_zeroed_movable_folio(vma, addr)
>> +#define max_anon_folio_order(vma)      0
>> +#endif
>> +
>> +/*
>> + * Returns index of first pte that is not none, or nr if all are none.
>> + */
>> +static inline int check_ptes_none(pte_t *pte, int nr)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < nr; i++) {
>> +               if (!pte_none(ptep_get(pte++)))
>> +                       return i;
>> +       }
>> +
>> +       return nr;
>> +}
>> +
>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
>> +{
>> +       /*
>> +        * The aim here is to determine what size of folio we should allocate
>> +        * for this fault. Factors include:
>> +        * - Order must not be higher than `order` upon entry
>> +        * - Folio must be naturally aligned within VA space
>> +        * - Folio must be fully contained inside one pmd entry
>> +        * - Folio must not breach boundaries of vma
>> +        * - Folio must not overlap any non-none ptes
>> +        *
>> +        * Additionally, we do not allow order-1 since this breaks assumptions
>> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
>> +        * store state up to the 3rd struct page subpage), and these pages must
>> +        * be THP in order to correctly use pre-existing THP infrastructure such
>> +        * as folio_split().
>> +        *
>> +        * Note that the caller may or may not choose to lock the pte. If
>> +        * unlocked, the result is racy and the user must re-check any overlap
>> +        * with non-none ptes under the lock.
>> +        */
>> +
>> +       struct vm_area_struct *vma = vmf->vma;
>> +       int nr;
>> +       unsigned long addr;
>> +       pte_t *pte;
>> +       pte_t *first_set = NULL;
>> +       int ret;
>> +
>> +       order = min(order, PMD_SHIFT - PAGE_SHIFT);
>> +
>> +       for (; order > 1; order--) {
> 
> I'm not sure how we can justify this policy. As an initial step, it'd
> be a lot easier to sell if we only considered the order of
> arch_wants_pte_order() and the order 0.

My justification is in the cover letter; I see performance regression (vs the
unpatched kernel) when using the policy you suggest. This policy performs much
better in my tests. (I'll reply directly to your follow up questions in the
cover letter shortly).

What are your technical concerns about this approach? It is pretty light weight
(I only touch each PTE once, regardless of the number of loops). If we have
strong technical reasons for reverting to the less performant approach then fair
enough, but I'd like to hear the rational first.

> 
>> +               nr = 1 << order;
>> +               addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
>> +               pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
>> +
>> +               /* Check vma bounds. */
>> +               if (addr < vma->vm_start ||
>> +                   addr + (nr << PAGE_SHIFT) > vma->vm_end)
>> +                       continue;
>> +
>> +               /* Ptes covered by order already known to be none. */
>> +               if (pte + nr <= first_set)
>> +                       break;
>> +
>> +               /* Already found set pte in range covered by order. */
>> +               if (pte <= first_set)
>> +                       continue;
>> +
>> +               /* Need to check if all the ptes are none. */
>> +               ret = check_ptes_none(pte, nr);
>> +               if (ret == nr)
>> +                       break;
>> +
>> +               first_set = pte + ret;
>> +       }
>> +
>> +       if (order == 1)
>> +               order = 0;
>> +
>> +       return order;
>> +}
> 
> Everything above can be simplified into two helpers:
> vmf_pte_range_changed() and alloc_anon_folio() (or whatever names you
> prefer). Details below.
> 
>>  /*
>>   * Handle write page faults for pages that can be reused in the current vma
>>   *
>> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>>                 goto oom;
>>
>>         if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
>> -               new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +               new_folio = alloc_anon_folio(vma, vmf->address, 0);
> 
> This seems unnecessary for now. Later on, we could fill in an aligned
> area with multiple write-protected zero pages during a read fault and
> then replace them with a large folio here.

I don't have a strong opinion. I thought that it would be neater to use the same
API everywhere, but happy to revert.

> 
>>                 if (!new_folio)
>>                         goto oom;
>>         } else {
>> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>         struct folio *folio;
>>         vm_fault_t ret = 0;
>>         pte_t entry;
>> +       int order;
>> +       int pgcount;
>> +       unsigned long addr;
>>
>>         /* File mapping without ->vm_ops ? */
>>         if (vma->vm_flags & VM_SHARED)
>> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
>>                         return handle_userfault(vmf, VM_UFFD_MISSING);
>>                 }
>> -               goto setpte;
>> +               if (uffd_wp)
>> +                       entry = pte_mkuffd_wp(entry);
>> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +
>> +               /* No need to invalidate - it was non-present before */
>> +               update_mmu_cache(vma, vmf->address, vmf->pte);
>> +               goto unlock;
>> +       }
> 
> Nor really needed IMO. Details below.
> 
> ===
> 
>> +       /*
>> +        * If allocating a large folio, determine the biggest suitable order for
>> +        * the VMA (e.g. it must not exceed the VMA's bounds, it must not
>> +        * overlap with any populated PTEs, etc). We are not under the ptl here
>> +        * so we will need to re-check that we are not overlapping any populated
>> +        * PTEs once we have the lock.
>> +        */
>> +       order = uffd_wp ? 0 : max_anon_folio_order(vma);
>> +       if (order > 0) {
>> +               vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> +               order = calc_anon_folio_order_alloc(vmf, order);
>> +               pte_unmap(vmf->pte);
>>         }
> 
> ===
> 
> The section above together with the section below should be wrapped in a helper.
> 
>> -       /* Allocate our own private page. */
>> +       /* Allocate our own private folio. */
>>         if (unlikely(anon_vma_prepare(vma)))
>>                 goto oom;
> 
> ===
> 
>> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +       folio = alloc_anon_folio(vma, vmf->address, order);
>> +       if (!folio && order > 0) {
>> +               order = 0;
>> +               folio = alloc_anon_folio(vma, vmf->address, order);
>> +       }
> 
> ===
> 
> One helper returns a folio of order arch_wants_pte_order(), or order 0
> if it fails to allocate that order, e.g.,
> 
> folio = alloc_anon_folio(vmf);
> 
> And if vmf_orig_pte_uffd_wp(vmf) is true, the helper allocates order 0
> regardless of arch_wants_pte_order(). Upon success, it can update
> vmf->address, since if we run into a race with another PF, we exit the
> fault handler and retry anyway.
> 
>>         if (!folio)
>>                 goto oom;
>>
>> +       pgcount = 1 << order;
>> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
> 
> As shown above, the helper already updates vmf->address. And mm/ never
> used pgcount before -- the convention is nr_pages = folio_nr_pages().

ACK

> 
>>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>                 goto oom_free_page;
>>         folio_throttle_swaprate(folio, GFP_KERNEL);
>>
>>         /*
>>          * The memory barrier inside __folio_mark_uptodate makes sure that
>> -        * preceding stores to the page contents become visible before
>> -        * the set_pte_at() write.
>> +        * preceding stores to the folio contents become visible before
>> +        * the set_ptes() write.
> 
> We don't have set_ptes() yet.

Indeed, that's why I listed the set_ptes() patch set as a hard dependency ;-)

> 
>>          */
>>         __folio_mark_uptodate(folio);
>>
>> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>         if (vma->vm_flags & VM_WRITE)
>>                 entry = pte_mkwrite(pte_mkdirty(entry));
>>
>> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> -                       &vmf->ptl);
>> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>         if (vmf_pte_changed(vmf)) {
>>                 update_mmu_tlb(vma, vmf->address, vmf->pte);
>>                 goto release;
>> +       } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
>> +               goto release;
>>         }
> 
> Need new helper:
> 
>   if (vmf_pte_range_changed(vmf, nr_pages)) {
>     for (i = 0; i < nr_pages; i++)
>       update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i);
>     goto release;
>   }
> 
> (It should be fine to call update_mmu_tlb() even if it's not really necessary.)
> 
>>         ret = check_stable_address_space(vma->vm_mm);
>> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>>         }
>>
>> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
>> +       folio_ref_add(folio, pgcount - 1);
>> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
>> +       folio_add_new_anon_rmap(folio, vma, addr);
>>         folio_add_lru_vma(folio, vma);
>> -setpte:
>> +
>>         if (uffd_wp)
>>                 entry = pte_mkuffd_wp(entry);
>> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
> 
> We would have to do it one by one for now.
> 
>>         /* No need to invalidate - it was non-present before */
>> -       update_mmu_cache(vma, vmf->address, vmf->pte);
>> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
> 
> Ditto.
> 
> How about this (by moving mk_pte()  and its friends here):
> ...
>         folio_add_lru_vma(folio, vma);
> 
>         for (i = 0; i < nr_pages; i++) {
>                 entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
>                 entry = pte_sw_mkyoung(entry);
>                 if (vma->vm_flags & VM_WRITE)
>                         entry = pte_mkwrite(pte_mkdirty(entry));
> setpte:
>                 if (uffd_wp)
>                         entry = pte_mkuffd_wp(entry);
>                 set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i,
> vmf->pte + i, entry);
> 
>                 /* No need to invalidate - it was non-present before */
>                 update_mmu_cache(vma, vmf->address + PAGE_SIZE * i,
> vmf->pte + i);
>         }
> 
>>  unlock:
>>         pte_unmap_unlock(vmf->pte, vmf->ptl);
>>         return ret;
> 
> Attaching a small patch in case anything above is not clear. Please
> take a look. Thanks.

OK, I'll take a look and rework for v3.



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-04 14:08       ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-04 14:08 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 04/07/2023 02:35, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>> allocated in large folios of a specified order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
>>
>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>> defaults to disabled for now; there is a long list of todos to make
>> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
>> madvise ops, etc). These items will be tackled in subsequent patches.
>>
>> When enabled, the preferred folio order is as returned by
>> arch_wants_pte_order(), which may be overridden by the arch as it sees
>> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
> 
> coalesce

ACK

> 
>> contiguous set of ptes map physically contigious, naturally aligned
> 
> contiguous

ACK

> 
>> memory, so this mechanism allows the architecture to optimize as
>> required.
>>
>> If the preferred order can't be used (e.g. because the folio would
>> breach the bounds of the vma, or because ptes in the region are already
>> mapped) then we fall back to a suitable lower order.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/Kconfig  |  10 ++++
>>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
>>  2 files changed, 165 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 7672a22647b4..1c06b2c0a24e 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
>>           support of file THPs will be developed in the next few release
>>           cycles.
>>
>> +config FLEXIBLE_THP
>> +       bool "Flexible order THP"
>> +       depends on TRANSPARENT_HUGEPAGE
>> +       default n
> 
> The default value is already N.

Is there a coding standard for this? Personally I prefer to make it explicit.

> 
>> +       help
>> +         Use large (bigger than order-0) folios to back anonymous memory where
>> +         possible, even if the order of the folio is smaller than the PMD
>> +         order. This reduces the number of page faults, as well as other
>> +         per-page overheads to improve performance for many workloads.
>> +
>>  endif # TRANSPARENT_HUGEPAGE
>>
>>  #
>> diff --git a/mm/memory.c b/mm/memory.c
>> index fb30f7523550..abe2ea94f3f5 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>         return 0;
>>  }
>>
>> +#ifdef CONFIG_FLEXIBLE_THP
>> +/*
>> + * Allocates, zeros and returns a folio of the requested order for use as
>> + * anonymous memory.
>> + */
>> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
>> +                                     unsigned long addr, int order)
>> +{
>> +       gfp_t gfp;
>> +       struct folio *folio;
>> +
>> +       if (order == 0)
>> +               return vma_alloc_zeroed_movable_folio(vma, addr);
>> +
>> +       gfp = vma_thp_gfp_mask(vma);
>> +       folio = vma_alloc_folio(gfp, order, vma, addr, true);
>> +       if (folio)
>> +               clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
>> +
>> +       return folio;
>> +}
>> +
>> +/*
>> + * Preferred folio order to allocate for anonymous memory.
>> + */
>> +#define max_anon_folio_order(vma)      arch_wants_pte_order(vma)
>> +#else
>> +#define alloc_anon_folio(vma, addr, order) \
>> +                               vma_alloc_zeroed_movable_folio(vma, addr)
>> +#define max_anon_folio_order(vma)      0
>> +#endif
>> +
>> +/*
>> + * Returns index of first pte that is not none, or nr if all are none.
>> + */
>> +static inline int check_ptes_none(pte_t *pte, int nr)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < nr; i++) {
>> +               if (!pte_none(ptep_get(pte++)))
>> +                       return i;
>> +       }
>> +
>> +       return nr;
>> +}
>> +
>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
>> +{
>> +       /*
>> +        * The aim here is to determine what size of folio we should allocate
>> +        * for this fault. Factors include:
>> +        * - Order must not be higher than `order` upon entry
>> +        * - Folio must be naturally aligned within VA space
>> +        * - Folio must be fully contained inside one pmd entry
>> +        * - Folio must not breach boundaries of vma
>> +        * - Folio must not overlap any non-none ptes
>> +        *
>> +        * Additionally, we do not allow order-1 since this breaks assumptions
>> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
>> +        * store state up to the 3rd struct page subpage), and these pages must
>> +        * be THP in order to correctly use pre-existing THP infrastructure such
>> +        * as folio_split().
>> +        *
>> +        * Note that the caller may or may not choose to lock the pte. If
>> +        * unlocked, the result is racy and the user must re-check any overlap
>> +        * with non-none ptes under the lock.
>> +        */
>> +
>> +       struct vm_area_struct *vma = vmf->vma;
>> +       int nr;
>> +       unsigned long addr;
>> +       pte_t *pte;
>> +       pte_t *first_set = NULL;
>> +       int ret;
>> +
>> +       order = min(order, PMD_SHIFT - PAGE_SHIFT);
>> +
>> +       for (; order > 1; order--) {
> 
> I'm not sure how we can justify this policy. As an initial step, it'd
> be a lot easier to sell if we only considered the order of
> arch_wants_pte_order() and the order 0.

My justification is in the cover letter; I see performance regression (vs the
unpatched kernel) when using the policy you suggest. This policy performs much
better in my tests. (I'll reply directly to your follow up questions in the
cover letter shortly).

What are your technical concerns about this approach? It is pretty light weight
(I only touch each PTE once, regardless of the number of loops). If we have
strong technical reasons for reverting to the less performant approach then fair
enough, but I'd like to hear the rational first.

> 
>> +               nr = 1 << order;
>> +               addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
>> +               pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
>> +
>> +               /* Check vma bounds. */
>> +               if (addr < vma->vm_start ||
>> +                   addr + (nr << PAGE_SHIFT) > vma->vm_end)
>> +                       continue;
>> +
>> +               /* Ptes covered by order already known to be none. */
>> +               if (pte + nr <= first_set)
>> +                       break;
>> +
>> +               /* Already found set pte in range covered by order. */
>> +               if (pte <= first_set)
>> +                       continue;
>> +
>> +               /* Need to check if all the ptes are none. */
>> +               ret = check_ptes_none(pte, nr);
>> +               if (ret == nr)
>> +                       break;
>> +
>> +               first_set = pte + ret;
>> +       }
>> +
>> +       if (order == 1)
>> +               order = 0;
>> +
>> +       return order;
>> +}
> 
> Everything above can be simplified into two helpers:
> vmf_pte_range_changed() and alloc_anon_folio() (or whatever names you
> prefer). Details below.
> 
>>  /*
>>   * Handle write page faults for pages that can be reused in the current vma
>>   *
>> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>>                 goto oom;
>>
>>         if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
>> -               new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +               new_folio = alloc_anon_folio(vma, vmf->address, 0);
> 
> This seems unnecessary for now. Later on, we could fill in an aligned
> area with multiple write-protected zero pages during a read fault and
> then replace them with a large folio here.

I don't have a strong opinion. I thought that it would be neater to use the same
API everywhere, but happy to revert.

> 
>>                 if (!new_folio)
>>                         goto oom;
>>         } else {
>> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>         struct folio *folio;
>>         vm_fault_t ret = 0;
>>         pte_t entry;
>> +       int order;
>> +       int pgcount;
>> +       unsigned long addr;
>>
>>         /* File mapping without ->vm_ops ? */
>>         if (vma->vm_flags & VM_SHARED)
>> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
>>                         return handle_userfault(vmf, VM_UFFD_MISSING);
>>                 }
>> -               goto setpte;
>> +               if (uffd_wp)
>> +                       entry = pte_mkuffd_wp(entry);
>> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +
>> +               /* No need to invalidate - it was non-present before */
>> +               update_mmu_cache(vma, vmf->address, vmf->pte);
>> +               goto unlock;
>> +       }
> 
> Nor really needed IMO. Details below.
> 
> ===
> 
>> +       /*
>> +        * If allocating a large folio, determine the biggest suitable order for
>> +        * the VMA (e.g. it must not exceed the VMA's bounds, it must not
>> +        * overlap with any populated PTEs, etc). We are not under the ptl here
>> +        * so we will need to re-check that we are not overlapping any populated
>> +        * PTEs once we have the lock.
>> +        */
>> +       order = uffd_wp ? 0 : max_anon_folio_order(vma);
>> +       if (order > 0) {
>> +               vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> +               order = calc_anon_folio_order_alloc(vmf, order);
>> +               pte_unmap(vmf->pte);
>>         }
> 
> ===
> 
> The section above together with the section below should be wrapped in a helper.
> 
>> -       /* Allocate our own private page. */
>> +       /* Allocate our own private folio. */
>>         if (unlikely(anon_vma_prepare(vma)))
>>                 goto oom;
> 
> ===
> 
>> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +       folio = alloc_anon_folio(vma, vmf->address, order);
>> +       if (!folio && order > 0) {
>> +               order = 0;
>> +               folio = alloc_anon_folio(vma, vmf->address, order);
>> +       }
> 
> ===
> 
> One helper returns a folio of order arch_wants_pte_order(), or order 0
> if it fails to allocate that order, e.g.,
> 
> folio = alloc_anon_folio(vmf);
> 
> And if vmf_orig_pte_uffd_wp(vmf) is true, the helper allocates order 0
> regardless of arch_wants_pte_order(). Upon success, it can update
> vmf->address, since if we run into a race with another PF, we exit the
> fault handler and retry anyway.
> 
>>         if (!folio)
>>                 goto oom;
>>
>> +       pgcount = 1 << order;
>> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
> 
> As shown above, the helper already updates vmf->address. And mm/ never
> used pgcount before -- the convention is nr_pages = folio_nr_pages().

ACK

> 
>>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>                 goto oom_free_page;
>>         folio_throttle_swaprate(folio, GFP_KERNEL);
>>
>>         /*
>>          * The memory barrier inside __folio_mark_uptodate makes sure that
>> -        * preceding stores to the page contents become visible before
>> -        * the set_pte_at() write.
>> +        * preceding stores to the folio contents become visible before
>> +        * the set_ptes() write.
> 
> We don't have set_ptes() yet.

Indeed, that's why I listed the set_ptes() patch set as a hard dependency ;-)

> 
>>          */
>>         __folio_mark_uptodate(folio);
>>
>> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>         if (vma->vm_flags & VM_WRITE)
>>                 entry = pte_mkwrite(pte_mkdirty(entry));
>>
>> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> -                       &vmf->ptl);
>> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>         if (vmf_pte_changed(vmf)) {
>>                 update_mmu_tlb(vma, vmf->address, vmf->pte);
>>                 goto release;
>> +       } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
>> +               goto release;
>>         }
> 
> Need new helper:
> 
>   if (vmf_pte_range_changed(vmf, nr_pages)) {
>     for (i = 0; i < nr_pages; i++)
>       update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i);
>     goto release;
>   }
> 
> (It should be fine to call update_mmu_tlb() even if it's not really necessary.)
> 
>>         ret = check_stable_address_space(vma->vm_mm);
>> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>>         }
>>
>> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
>> +       folio_ref_add(folio, pgcount - 1);
>> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
>> +       folio_add_new_anon_rmap(folio, vma, addr);
>>         folio_add_lru_vma(folio, vma);
>> -setpte:
>> +
>>         if (uffd_wp)
>>                 entry = pte_mkuffd_wp(entry);
>> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
> 
> We would have to do it one by one for now.
> 
>>         /* No need to invalidate - it was non-present before */
>> -       update_mmu_cache(vma, vmf->address, vmf->pte);
>> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
> 
> Ditto.
> 
> How about this (by moving mk_pte()  and its friends here):
> ...
>         folio_add_lru_vma(folio, vma);
> 
>         for (i = 0; i < nr_pages; i++) {
>                 entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
>                 entry = pte_sw_mkyoung(entry);
>                 if (vma->vm_flags & VM_WRITE)
>                         entry = pte_mkwrite(pte_mkdirty(entry));
> setpte:
>                 if (uffd_wp)
>                         entry = pte_mkuffd_wp(entry);
>                 set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i,
> vmf->pte + i, entry);
> 
>                 /* No need to invalidate - it was non-present before */
>                 update_mmu_cache(vma, vmf->address + PAGE_SIZE * i,
> vmf->pte + i);
>         }
> 
>>  unlock:
>>         pte_unmap_unlock(vmf->pte, vmf->ptl);
>>         return ret;
> 
> Attaching a small patch in case anything above is not clear. Please
> take a look. Thanks.

OK, I'll take a look and rework for v3.



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-04  3:45     ` Yin, Fengwei
@ 2023-07-04 14:20       ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-04 14:20 UTC (permalink / raw)
  To: Yin, Fengwei, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: linux-arm-kernel, linux-kernel, linux-mm

On 04/07/2023 04:45, Yin, Fengwei wrote:
> 
> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> THP is for huge page which is 2M size. We are not huge page here. But
> I don't have good name either.

Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K
base page, they are 512M. So huge pages already have a variable size. And they
sometimes get PTE-mapped. So can't we just think of this as an extension of the
THP feature?

> 
>> allocated in large folios of a specified order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
>>
>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>> defaults to disabled for now; there is a long list of todos to make
>> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
>> madvise ops, etc). These items will be tackled in subsequent patches.
>>
>> When enabled, the preferred folio order is as returned by
>> arch_wants_pte_order(), which may be overridden by the arch as it sees
>> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
>> contiguous set of ptes map physically contigious, naturally aligned
>> memory, so this mechanism allows the architecture to optimize as
>> required.
>>
>> If the preferred order can't be used (e.g. because the folio would
>> breach the bounds of the vma, or because ptes in the region are already
>> mapped) then we fall back to a suitable lower order.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/Kconfig  |  10 ++++
>>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
>>  2 files changed, 165 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 7672a22647b4..1c06b2c0a24e 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
>>  	  support of file THPs will be developed in the next few release
>>  	  cycles.
>>  
>> +config FLEXIBLE_THP
>> +	bool "Flexible order THP"
>> +	depends on TRANSPARENT_HUGEPAGE
>> +	default n
>> +	help
>> +	  Use large (bigger than order-0) folios to back anonymous memory where
>> +	  possible, even if the order of the folio is smaller than the PMD
>> +	  order. This reduces the number of page faults, as well as other
>> +	  per-page overheads to improve performance for many workloads.
>> +
>>  endif # TRANSPARENT_HUGEPAGE
>>  
>>  #
>> diff --git a/mm/memory.c b/mm/memory.c
>> index fb30f7523550..abe2ea94f3f5 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>  	return 0;
>>  }
>>  
>> +#ifdef CONFIG_FLEXIBLE_THP
>> +/*
>> + * Allocates, zeros and returns a folio of the requested order for use as
>> + * anonymous memory.
>> + */
>> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
>> +				      unsigned long addr, int order)
>> +{
>> +	gfp_t gfp;
>> +	struct folio *folio;
>> +
>> +	if (order == 0)
>> +		return vma_alloc_zeroed_movable_folio(vma, addr);
>> +
>> +	gfp = vma_thp_gfp_mask(vma);
>> +	folio = vma_alloc_folio(gfp, order, vma, addr, true);
>> +	if (folio)
>> +		clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
>> +
>> +	return folio;
>> +}
>> +
>> +/*
>> + * Preferred folio order to allocate for anonymous memory.
>> + */
>> +#define max_anon_folio_order(vma)	arch_wants_pte_order(vma)
>> +#else
>> +#define alloc_anon_folio(vma, addr, order) \
>> +				vma_alloc_zeroed_movable_folio(vma, addr)
>> +#define max_anon_folio_order(vma)	0
>> +#endif
>> +
>> +/*
>> + * Returns index of first pte that is not none, or nr if all are none.
>> + */
>> +static inline int check_ptes_none(pte_t *pte, int nr)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr; i++) {
>> +		if (!pte_none(ptep_get(pte++)))
>> +			return i;
>> +	}
>> +
>> +	return nr;
>> +}
>> +
>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
>> +{
>> +	/*
>> +	 * The aim here is to determine what size of folio we should allocate
>> +	 * for this fault. Factors include:
>> +	 * - Order must not be higher than `order` upon entry
>> +	 * - Folio must be naturally aligned within VA space
>> +	 * - Folio must be fully contained inside one pmd entry
>> +	 * - Folio must not breach boundaries of vma
>> +	 * - Folio must not overlap any non-none ptes
>> +	 *
>> +	 * Additionally, we do not allow order-1 since this breaks assumptions
>> +	 * elsewhere in the mm; THP pages must be at least order-2 (since they
>> +	 * store state up to the 3rd struct page subpage), and these pages must
>> +	 * be THP in order to correctly use pre-existing THP infrastructure such
>> +	 * as folio_split().
>> +	 *
>> +	 * Note that the caller may or may not choose to lock the pte. If
>> +	 * unlocked, the result is racy and the user must re-check any overlap
>> +	 * with non-none ptes under the lock.
>> +	 */
>> +
>> +	struct vm_area_struct *vma = vmf->vma;
>> +	int nr;
>> +	unsigned long addr;
>> +	pte_t *pte;
>> +	pte_t *first_set = NULL;
>> +	int ret;
>> +
>> +	order = min(order, PMD_SHIFT - PAGE_SHIFT);
>> +
>> +	for (; order > 1; order--) {
>> +		nr = 1 << order;
>> +		addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
>> +		pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
>> +
>> +		/* Check vma bounds. */
>> +		if (addr < vma->vm_start ||
>> +		    addr + (nr << PAGE_SHIFT) > vma->vm_end)
>> +			continue;
>> +
>> +		/* Ptes covered by order already known to be none. */
>> +		if (pte + nr <= first_set)
>> +			break;
>> +
>> +		/* Already found set pte in range covered by order. */
>> +		if (pte <= first_set)
>> +			continue;
>> +
>> +		/* Need to check if all the ptes are none. */
>> +		ret = check_ptes_none(pte, nr);
>> +		if (ret == nr)
>> +			break;
>> +
>> +		first_set = pte + ret;
>> +	}
>> +
>> +	if (order == 1)
>> +		order = 0;
>> +
>> +	return order;
>> +}
> The logic in above function should be kept is whether the order fit in vma range.
> 
> check_ptes_none() is not accurate here because no page table lock hold and concurrent
> fault could happen. So may just drop the check here? Check_ptes_none() is done after
> take the page table lock.

I agree it is just an estimate given the lock is not held; the comment at the
top says the same. But I don't think we can wait until after the lock is taken
to measure this. We can't hold the lock while allocating the folio and we need a
guess at what to allocate. If we don't guess here, we will allocate the biggest,
then take the lock, see that it doesn't fit, and exit. Then the system will
re-fault and we will follow the exact same path - ending up in live lock.

> 
> We pick the arch prefered order or order 0 now.
> 
>> +
>>  /*
>>   * Handle write page faults for pages that can be reused in the current vma
>>   *
>> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>>  		goto oom;
>>  
>>  	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
>> -		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +		new_folio = alloc_anon_folio(vma, vmf->address, 0);
>>  		if (!new_folio)
>>  			goto oom;
>>  	} else {
>> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  	struct folio *folio;
>>  	vm_fault_t ret = 0;
>>  	pte_t entry;
>> +	int order;
>> +	int pgcount;
>> +	unsigned long addr;
>>  
>>  	/* File mapping without ->vm_ops ? */
>>  	if (vma->vm_flags & VM_SHARED)
>> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  			return handle_userfault(vmf, VM_UFFD_MISSING);
>>  		}
>> -		goto setpte;
>> +		if (uffd_wp)
>> +			entry = pte_mkuffd_wp(entry);
>> +		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +
>> +		/* No need to invalidate - it was non-present before */
>> +		update_mmu_cache(vma, vmf->address, vmf->pte);
>> +		goto unlock;
>> +	}
>> +
>> +	/*
>> +	 * If allocating a large folio, determine the biggest suitable order for
>> +	 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
>> +	 * overlap with any populated PTEs, etc). We are not under the ptl here
>> +	 * so we will need to re-check that we are not overlapping any populated
>> +	 * PTEs once we have the lock.
>> +	 */
>> +	order = uffd_wp ? 0 : max_anon_folio_order(vma);
>> +	if (order > 0) {
>> +		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> +		order = calc_anon_folio_order_alloc(vmf, order);
>> +		pte_unmap(vmf->pte);
>>  	}
>>  
>> -	/* Allocate our own private page. */
>> +	/* Allocate our own private folio. */
>>  	if (unlikely(anon_vma_prepare(vma)))
>>  		goto oom;
>> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +	folio = alloc_anon_folio(vma, vmf->address, order);
>> +	if (!folio && order > 0) {
>> +		order = 0;
>> +		folio = alloc_anon_folio(vma, vmf->address, order);
>> +	}
>>  	if (!folio)
>>  		goto oom;
>>  
>> +	pgcount = 1 << order;
>> +	addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
>> +
>>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>  		goto oom_free_page;
>>  	folio_throttle_swaprate(folio, GFP_KERNEL);
>>  
>>  	/*
>>  	 * The memory barrier inside __folio_mark_uptodate makes sure that
>> -	 * preceding stores to the page contents become visible before
>> -	 * the set_pte_at() write.
>> +	 * preceding stores to the folio contents become visible before
>> +	 * the set_ptes() write.
>>  	 */
>>  	__folio_mark_uptodate(folio);
>>  
>> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  	if (vma->vm_flags & VM_WRITE)
>>  		entry = pte_mkwrite(pte_mkdirty(entry));
>>  
>> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> -			&vmf->ptl);
>> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>  	if (vmf_pte_changed(vmf)) {
>>  		update_mmu_tlb(vma, vmf->address, vmf->pte);
>>  		goto release;
>> +	} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
> This could be the case that we allocated order 4 page and find a neighbor PTE is
> filled by concurrent fault. Should we put current folio and fallback to order 0
> and try again immedately (goto order 0 allocation instead of return from this
> function which will go through some page fault path again)?

That's how it worked in v1, but I had review comments from Yang Shi asking me to
re-fault instead. This approach is certainly cleaner from a code point of view.
And I expect races of that nature will be rare.

> 
> 
> Regards
> Yin, Fengwei
> 
>> +		goto release;
>>  	}
>>  
>>  	ret = check_stable_address_space(vma->vm_mm);
>> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>>  	}
>>  
>> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
>> +	folio_ref_add(folio, pgcount - 1);
>> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
>> +	folio_add_new_anon_rmap(folio, vma, addr);
>>  	folio_add_lru_vma(folio, vma);
>> -setpte:
>> +
>>  	if (uffd_wp)
>>  		entry = pte_mkuffd_wp(entry);
>> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>>  
>>  	/* No need to invalidate - it was non-present before */
>> -	update_mmu_cache(vma, vmf->address, vmf->pte);
>> +	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
>>  unlock:
>>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  	return ret;


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-04 14:20       ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-04 14:20 UTC (permalink / raw)
  To: Yin, Fengwei, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: linux-arm-kernel, linux-kernel, linux-mm

On 04/07/2023 04:45, Yin, Fengwei wrote:
> 
> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> THP is for huge page which is 2M size. We are not huge page here. But
> I don't have good name either.

Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K
base page, they are 512M. So huge pages already have a variable size. And they
sometimes get PTE-mapped. So can't we just think of this as an extension of the
THP feature?

> 
>> allocated in large folios of a specified order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
>>
>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>> defaults to disabled for now; there is a long list of todos to make
>> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
>> madvise ops, etc). These items will be tackled in subsequent patches.
>>
>> When enabled, the preferred folio order is as returned by
>> arch_wants_pte_order(), which may be overridden by the arch as it sees
>> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
>> contiguous set of ptes map physically contigious, naturally aligned
>> memory, so this mechanism allows the architecture to optimize as
>> required.
>>
>> If the preferred order can't be used (e.g. because the folio would
>> breach the bounds of the vma, or because ptes in the region are already
>> mapped) then we fall back to a suitable lower order.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/Kconfig  |  10 ++++
>>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
>>  2 files changed, 165 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 7672a22647b4..1c06b2c0a24e 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
>>  	  support of file THPs will be developed in the next few release
>>  	  cycles.
>>  
>> +config FLEXIBLE_THP
>> +	bool "Flexible order THP"
>> +	depends on TRANSPARENT_HUGEPAGE
>> +	default n
>> +	help
>> +	  Use large (bigger than order-0) folios to back anonymous memory where
>> +	  possible, even if the order of the folio is smaller than the PMD
>> +	  order. This reduces the number of page faults, as well as other
>> +	  per-page overheads to improve performance for many workloads.
>> +
>>  endif # TRANSPARENT_HUGEPAGE
>>  
>>  #
>> diff --git a/mm/memory.c b/mm/memory.c
>> index fb30f7523550..abe2ea94f3f5 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>  	return 0;
>>  }
>>  
>> +#ifdef CONFIG_FLEXIBLE_THP
>> +/*
>> + * Allocates, zeros and returns a folio of the requested order for use as
>> + * anonymous memory.
>> + */
>> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
>> +				      unsigned long addr, int order)
>> +{
>> +	gfp_t gfp;
>> +	struct folio *folio;
>> +
>> +	if (order == 0)
>> +		return vma_alloc_zeroed_movable_folio(vma, addr);
>> +
>> +	gfp = vma_thp_gfp_mask(vma);
>> +	folio = vma_alloc_folio(gfp, order, vma, addr, true);
>> +	if (folio)
>> +		clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
>> +
>> +	return folio;
>> +}
>> +
>> +/*
>> + * Preferred folio order to allocate for anonymous memory.
>> + */
>> +#define max_anon_folio_order(vma)	arch_wants_pte_order(vma)
>> +#else
>> +#define alloc_anon_folio(vma, addr, order) \
>> +				vma_alloc_zeroed_movable_folio(vma, addr)
>> +#define max_anon_folio_order(vma)	0
>> +#endif
>> +
>> +/*
>> + * Returns index of first pte that is not none, or nr if all are none.
>> + */
>> +static inline int check_ptes_none(pte_t *pte, int nr)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr; i++) {
>> +		if (!pte_none(ptep_get(pte++)))
>> +			return i;
>> +	}
>> +
>> +	return nr;
>> +}
>> +
>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
>> +{
>> +	/*
>> +	 * The aim here is to determine what size of folio we should allocate
>> +	 * for this fault. Factors include:
>> +	 * - Order must not be higher than `order` upon entry
>> +	 * - Folio must be naturally aligned within VA space
>> +	 * - Folio must be fully contained inside one pmd entry
>> +	 * - Folio must not breach boundaries of vma
>> +	 * - Folio must not overlap any non-none ptes
>> +	 *
>> +	 * Additionally, we do not allow order-1 since this breaks assumptions
>> +	 * elsewhere in the mm; THP pages must be at least order-2 (since they
>> +	 * store state up to the 3rd struct page subpage), and these pages must
>> +	 * be THP in order to correctly use pre-existing THP infrastructure such
>> +	 * as folio_split().
>> +	 *
>> +	 * Note that the caller may or may not choose to lock the pte. If
>> +	 * unlocked, the result is racy and the user must re-check any overlap
>> +	 * with non-none ptes under the lock.
>> +	 */
>> +
>> +	struct vm_area_struct *vma = vmf->vma;
>> +	int nr;
>> +	unsigned long addr;
>> +	pte_t *pte;
>> +	pte_t *first_set = NULL;
>> +	int ret;
>> +
>> +	order = min(order, PMD_SHIFT - PAGE_SHIFT);
>> +
>> +	for (; order > 1; order--) {
>> +		nr = 1 << order;
>> +		addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
>> +		pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
>> +
>> +		/* Check vma bounds. */
>> +		if (addr < vma->vm_start ||
>> +		    addr + (nr << PAGE_SHIFT) > vma->vm_end)
>> +			continue;
>> +
>> +		/* Ptes covered by order already known to be none. */
>> +		if (pte + nr <= first_set)
>> +			break;
>> +
>> +		/* Already found set pte in range covered by order. */
>> +		if (pte <= first_set)
>> +			continue;
>> +
>> +		/* Need to check if all the ptes are none. */
>> +		ret = check_ptes_none(pte, nr);
>> +		if (ret == nr)
>> +			break;
>> +
>> +		first_set = pte + ret;
>> +	}
>> +
>> +	if (order == 1)
>> +		order = 0;
>> +
>> +	return order;
>> +}
> The logic in above function should be kept is whether the order fit in vma range.
> 
> check_ptes_none() is not accurate here because no page table lock hold and concurrent
> fault could happen. So may just drop the check here? Check_ptes_none() is done after
> take the page table lock.

I agree it is just an estimate given the lock is not held; the comment at the
top says the same. But I don't think we can wait until after the lock is taken
to measure this. We can't hold the lock while allocating the folio and we need a
guess at what to allocate. If we don't guess here, we will allocate the biggest,
then take the lock, see that it doesn't fit, and exit. Then the system will
re-fault and we will follow the exact same path - ending up in live lock.

> 
> We pick the arch prefered order or order 0 now.
> 
>> +
>>  /*
>>   * Handle write page faults for pages that can be reused in the current vma
>>   *
>> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>>  		goto oom;
>>  
>>  	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
>> -		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +		new_folio = alloc_anon_folio(vma, vmf->address, 0);
>>  		if (!new_folio)
>>  			goto oom;
>>  	} else {
>> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  	struct folio *folio;
>>  	vm_fault_t ret = 0;
>>  	pte_t entry;
>> +	int order;
>> +	int pgcount;
>> +	unsigned long addr;
>>  
>>  	/* File mapping without ->vm_ops ? */
>>  	if (vma->vm_flags & VM_SHARED)
>> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  			return handle_userfault(vmf, VM_UFFD_MISSING);
>>  		}
>> -		goto setpte;
>> +		if (uffd_wp)
>> +			entry = pte_mkuffd_wp(entry);
>> +		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +
>> +		/* No need to invalidate - it was non-present before */
>> +		update_mmu_cache(vma, vmf->address, vmf->pte);
>> +		goto unlock;
>> +	}
>> +
>> +	/*
>> +	 * If allocating a large folio, determine the biggest suitable order for
>> +	 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
>> +	 * overlap with any populated PTEs, etc). We are not under the ptl here
>> +	 * so we will need to re-check that we are not overlapping any populated
>> +	 * PTEs once we have the lock.
>> +	 */
>> +	order = uffd_wp ? 0 : max_anon_folio_order(vma);
>> +	if (order > 0) {
>> +		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> +		order = calc_anon_folio_order_alloc(vmf, order);
>> +		pte_unmap(vmf->pte);
>>  	}
>>  
>> -	/* Allocate our own private page. */
>> +	/* Allocate our own private folio. */
>>  	if (unlikely(anon_vma_prepare(vma)))
>>  		goto oom;
>> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +	folio = alloc_anon_folio(vma, vmf->address, order);
>> +	if (!folio && order > 0) {
>> +		order = 0;
>> +		folio = alloc_anon_folio(vma, vmf->address, order);
>> +	}
>>  	if (!folio)
>>  		goto oom;
>>  
>> +	pgcount = 1 << order;
>> +	addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
>> +
>>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>  		goto oom_free_page;
>>  	folio_throttle_swaprate(folio, GFP_KERNEL);
>>  
>>  	/*
>>  	 * The memory barrier inside __folio_mark_uptodate makes sure that
>> -	 * preceding stores to the page contents become visible before
>> -	 * the set_pte_at() write.
>> +	 * preceding stores to the folio contents become visible before
>> +	 * the set_ptes() write.
>>  	 */
>>  	__folio_mark_uptodate(folio);
>>  
>> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  	if (vma->vm_flags & VM_WRITE)
>>  		entry = pte_mkwrite(pte_mkdirty(entry));
>>  
>> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> -			&vmf->ptl);
>> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>  	if (vmf_pte_changed(vmf)) {
>>  		update_mmu_tlb(vma, vmf->address, vmf->pte);
>>  		goto release;
>> +	} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
> This could be the case that we allocated order 4 page and find a neighbor PTE is
> filled by concurrent fault. Should we put current folio and fallback to order 0
> and try again immedately (goto order 0 allocation instead of return from this
> function which will go through some page fault path again)?

That's how it worked in v1, but I had review comments from Yang Shi asking me to
re-fault instead. This approach is certainly cleaner from a code point of view.
And I expect races of that nature will be rare.

> 
> 
> Regards
> Yin, Fengwei
> 
>> +		goto release;
>>  	}
>>  
>>  	ret = check_stable_address_space(vma->vm_mm);
>> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>>  	}
>>  
>> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
>> +	folio_ref_add(folio, pgcount - 1);
>> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
>> +	folio_add_new_anon_rmap(folio, vma, addr);
>>  	folio_add_lru_vma(folio, vma);
>> -setpte:
>> +
>>  	if (uffd_wp)
>>  		entry = pte_mkuffd_wp(entry);
>> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>>  
>>  	/* No need to invalidate - it was non-present before */
>> -	update_mmu_cache(vma, vmf->address, vmf->pte);
>> +	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
>>  unlock:
>>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  	return ret;


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-04  7:11       ` Yu Zhao
@ 2023-07-04 15:36         ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-04 15:36 UTC (permalink / raw)
  To: Yu Zhao, Yin, Fengwei
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 04/07/2023 08:11, Yu Zhao wrote:
> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>
>> On 7/4/2023 10:18 AM, Yu Zhao wrote:
>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> This is v2 of a series to implement variable order, large folios for anonymous
>>>> memory. The objective of this is to improve performance by allocating larger
>>>> chunks of memory during anonymous page faults. See [1] for background.
>>>
>>> Thanks for the quick response!
>>>
>>>> I've significantly reworked and simplified the patch set based on comments from
>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
>>>> VARIABLE_THP, on Yu's advice.
>>>>
>>>> The last patch is for arm64 to explicitly override the default
>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted
>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change
>>>> could be handled through the arm64 tree separately. Neither has any build
>>>> dependency on the other.
>>>>
>>>> The one area where I haven't followed Yu's advice is in the determination of the
>>>> size of folio to use. It was suggested that I have a single preferred large
>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
>>>> being existing overlapping populated PTEs, etc) then fallback immediately to
>>>> order-0. It turned out that this approach caused a performance regression in the
>>>> Speedometer benchmark.
>>>
>>> I suppose it's regression against the v1, not the unpatched kernel.
>> From the performance data Ryan shared, it's against unpatched kernel:
>>
>> Speedometer 2.0:
>>
>> | kernel                         |   runs_per_min |
>> |:-------------------------------|---------------:|
>> | baseline-4k                    |           0.0% |
>> | anonfolio-lkml-v1              |           0.7% |
>> | anonfolio-lkml-v2-simple-order |          -0.9% |
>> | anonfolio-lkml-v2              |           0.5% |
> 
> I see. Thanks.
> 
> A couple of questions:
> 1. Do we have a stddev?

| kernel                    |   mean_abs |   std_abs |   mean_rel |   std_rel |
|:------------------------- |-----------:|----------:|-----------:|----------:|
| baseline-4k               |      117.4 |       0.8 |       0.0% |      0.7% |
| anonfolio-v1              |      118.2 |         1 |       0.7% |      0.9% |
| anonfolio-v2-simple-order |      116.4 |       1.1 |      -0.9% |      0.9% |
| anonfolio-v2              |        118 |       1.2 |       0.5% |      1.0% |

This is with 3 runs per reboot across 5 reboots, with first run after reboot
trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data
points per kernel in total.

I've rerun the test multiple times and see similar results each time.

I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I
see the same performance as baseline-4k.


> 2. Do we have a theory why it regressed?

I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that
mean when we fault, order-4 is often too big to fit in the VMA. So we fallback
to order-0. I guess this is happening so often for this workload that the cost
of doing the checks and fallback is outweighing the benefit of the memory that
does end up with order-4 folios.

I've sampled the memory in each bucket (once per second) while running and its
roughly:

64K: 25%
32K: 15%
16K: 15%
4K: 45%

32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order.
But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and
the 64K contents is more static - that's just a guess though.

> Assuming no bugs, I don't see how a real regression could happen --
> falling back to order-0 isn't different from the original behavior.
> Ryan, could you `perf record` and `cat /proc/vmstat` and share them?

I can, but it will have to be a bit later in the week. I'll do some more test
runs overnight so we have a larger number of runs - hopefully that might tell us
that this is noise to a certain extent.

I'd still like to hear a clear technical argument for why the bin-packing
approach is not the correct one!

Thanks,
Ryan




^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-04 15:36         ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-04 15:36 UTC (permalink / raw)
  To: Yu Zhao, Yin, Fengwei
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 04/07/2023 08:11, Yu Zhao wrote:
> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>
>> On 7/4/2023 10:18 AM, Yu Zhao wrote:
>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> This is v2 of a series to implement variable order, large folios for anonymous
>>>> memory. The objective of this is to improve performance by allocating larger
>>>> chunks of memory during anonymous page faults. See [1] for background.
>>>
>>> Thanks for the quick response!
>>>
>>>> I've significantly reworked and simplified the patch set based on comments from
>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
>>>> VARIABLE_THP, on Yu's advice.
>>>>
>>>> The last patch is for arm64 to explicitly override the default
>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted
>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change
>>>> could be handled through the arm64 tree separately. Neither has any build
>>>> dependency on the other.
>>>>
>>>> The one area where I haven't followed Yu's advice is in the determination of the
>>>> size of folio to use. It was suggested that I have a single preferred large
>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
>>>> being existing overlapping populated PTEs, etc) then fallback immediately to
>>>> order-0. It turned out that this approach caused a performance regression in the
>>>> Speedometer benchmark.
>>>
>>> I suppose it's regression against the v1, not the unpatched kernel.
>> From the performance data Ryan shared, it's against unpatched kernel:
>>
>> Speedometer 2.0:
>>
>> | kernel                         |   runs_per_min |
>> |:-------------------------------|---------------:|
>> | baseline-4k                    |           0.0% |
>> | anonfolio-lkml-v1              |           0.7% |
>> | anonfolio-lkml-v2-simple-order |          -0.9% |
>> | anonfolio-lkml-v2              |           0.5% |
> 
> I see. Thanks.
> 
> A couple of questions:
> 1. Do we have a stddev?

| kernel                    |   mean_abs |   std_abs |   mean_rel |   std_rel |
|:------------------------- |-----------:|----------:|-----------:|----------:|
| baseline-4k               |      117.4 |       0.8 |       0.0% |      0.7% |
| anonfolio-v1              |      118.2 |         1 |       0.7% |      0.9% |
| anonfolio-v2-simple-order |      116.4 |       1.1 |      -0.9% |      0.9% |
| anonfolio-v2              |        118 |       1.2 |       0.5% |      1.0% |

This is with 3 runs per reboot across 5 reboots, with first run after reboot
trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data
points per kernel in total.

I've rerun the test multiple times and see similar results each time.

I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I
see the same performance as baseline-4k.


> 2. Do we have a theory why it regressed?

I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that
mean when we fault, order-4 is often too big to fit in the VMA. So we fallback
to order-0. I guess this is happening so often for this workload that the cost
of doing the checks and fallback is outweighing the benefit of the memory that
does end up with order-4 folios.

I've sampled the memory in each bucket (once per second) while running and its
roughly:

64K: 25%
32K: 15%
16K: 15%
4K: 45%

32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order.
But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and
the 64K contents is more static - that's just a guess though.

> Assuming no bugs, I don't see how a real regression could happen --
> falling back to order-0 isn't different from the original behavior.
> Ryan, could you `perf record` and `cat /proc/vmstat` and share them?

I can, but it will have to be a bit later in the week. I'll do some more test
runs overnight so we have a larger number of runs - hopefully that might tell us
that this is noise to a certain extent.

I'd still like to hear a clear technical argument for why the bin-packing
approach is not the correct one!

Thanks,
Ryan




_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-04 14:20       ` Ryan Roberts
  (?)
@ 2023-07-04 23:35       ` Yin Fengwei
  -1 siblings, 0 replies; 167+ messages in thread
From: Yin Fengwei @ 2023-07-04 23:35 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: linux-arm-kernel, linux-kernel, linux-mm



On 7/4/23 22:20, Ryan Roberts wrote:
> On 04/07/2023 04:45, Yin, Fengwei wrote:
>>
>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>> THP is for huge page which is 2M size. We are not huge page here. But
>> I don't have good name either.
> 
> Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K
> base page, they are 512M. So huge pages already have a variable size. And they
> sometimes get PTE-mapped. So can't we just think of this as an extension of the
> THP feature?
My understanding is the THP has several fixed size on different arch.
The 32K or 16K which could be picked here are not THP size.

> 
>>
>>> allocated in large folios of a specified order. All pages of the large
>>> folio are pte-mapped during the same page fault, significantly reducing
>>> the number of page faults. The number of per-page operations (e.g. ref
>>> counting, rmap management lru list management) are also significantly
>>> reduced since those ops now become per-folio.
>>>
>>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>>> defaults to disabled for now; there is a long list of todos to make
>>> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
>>> madvise ops, etc). These items will be tackled in subsequent patches.
>>>
>>> When enabled, the preferred folio order is as returned by
>>> arch_wants_pte_order(), which may be overridden by the arch as it sees
>>> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
>>> contiguous set of ptes map physically contigious, naturally aligned
>>> memory, so this mechanism allows the architecture to optimize as
>>> required.
>>>
>>> If the preferred order can't be used (e.g. because the folio would
>>> breach the bounds of the vma, or because ptes in the region are already
>>> mapped) then we fall back to a suitable lower order.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  mm/Kconfig  |  10 ++++
>>>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
>>>  2 files changed, 165 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 7672a22647b4..1c06b2c0a24e 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
>>>  	  support of file THPs will be developed in the next few release
>>>  	  cycles.
>>>  
>>> +config FLEXIBLE_THP
>>> +	bool "Flexible order THP"
>>> +	depends on TRANSPARENT_HUGEPAGE
>>> +	default n
>>> +	help
>>> +	  Use large (bigger than order-0) folios to back anonymous memory where
>>> +	  possible, even if the order of the folio is smaller than the PMD
>>> +	  order. This reduces the number of page faults, as well as other
>>> +	  per-page overheads to improve performance for many workloads.
>>> +
>>>  endif # TRANSPARENT_HUGEPAGE
>>>  
>>>  #
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index fb30f7523550..abe2ea94f3f5 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>>  	return 0;
>>>  }
>>>  
>>> +#ifdef CONFIG_FLEXIBLE_THP
>>> +/*
>>> + * Allocates, zeros and returns a folio of the requested order for use as
>>> + * anonymous memory.
>>> + */
>>> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
>>> +				      unsigned long addr, int order)
>>> +{
>>> +	gfp_t gfp;
>>> +	struct folio *folio;
>>> +
>>> +	if (order == 0)
>>> +		return vma_alloc_zeroed_movable_folio(vma, addr);
>>> +
>>> +	gfp = vma_thp_gfp_mask(vma);
>>> +	folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>> +	if (folio)
>>> +		clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
>>> +
>>> +	return folio;
>>> +}
>>> +
>>> +/*
>>> + * Preferred folio order to allocate for anonymous memory.
>>> + */
>>> +#define max_anon_folio_order(vma)	arch_wants_pte_order(vma)
>>> +#else
>>> +#define alloc_anon_folio(vma, addr, order) \
>>> +				vma_alloc_zeroed_movable_folio(vma, addr)
>>> +#define max_anon_folio_order(vma)	0
>>> +#endif
>>> +
>>> +/*
>>> + * Returns index of first pte that is not none, or nr if all are none.
>>> + */
>>> +static inline int check_ptes_none(pte_t *pte, int nr)
>>> +{
>>> +	int i;
>>> +
>>> +	for (i = 0; i < nr; i++) {
>>> +		if (!pte_none(ptep_get(pte++)))
>>> +			return i;
>>> +	}
>>> +
>>> +	return nr;
>>> +}
>>> +
>>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
>>> +{
>>> +	/*
>>> +	 * The aim here is to determine what size of folio we should allocate
>>> +	 * for this fault. Factors include:
>>> +	 * - Order must not be higher than `order` upon entry
>>> +	 * - Folio must be naturally aligned within VA space
>>> +	 * - Folio must be fully contained inside one pmd entry
>>> +	 * - Folio must not breach boundaries of vma
>>> +	 * - Folio must not overlap any non-none ptes
>>> +	 *
>>> +	 * Additionally, we do not allow order-1 since this breaks assumptions
>>> +	 * elsewhere in the mm; THP pages must be at least order-2 (since they
>>> +	 * store state up to the 3rd struct page subpage), and these pages must
>>> +	 * be THP in order to correctly use pre-existing THP infrastructure such
>>> +	 * as folio_split().
>>> +	 *
>>> +	 * Note that the caller may or may not choose to lock the pte. If
>>> +	 * unlocked, the result is racy and the user must re-check any overlap
>>> +	 * with non-none ptes under the lock.
>>> +	 */
>>> +
>>> +	struct vm_area_struct *vma = vmf->vma;
>>> +	int nr;
>>> +	unsigned long addr;
>>> +	pte_t *pte;
>>> +	pte_t *first_set = NULL;
>>> +	int ret;
>>> +
>>> +	order = min(order, PMD_SHIFT - PAGE_SHIFT);
>>> +
>>> +	for (; order > 1; order--) {
>>> +		nr = 1 << order;
>>> +		addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
>>> +		pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
>>> +
>>> +		/* Check vma bounds. */
>>> +		if (addr < vma->vm_start ||
>>> +		    addr + (nr << PAGE_SHIFT) > vma->vm_end)
>>> +			continue;
>>> +
>>> +		/* Ptes covered by order already known to be none. */
>>> +		if (pte + nr <= first_set)
>>> +			break;
>>> +
>>> +		/* Already found set pte in range covered by order. */
>>> +		if (pte <= first_set)
>>> +			continue;
>>> +
>>> +		/* Need to check if all the ptes are none. */
>>> +		ret = check_ptes_none(pte, nr);
>>> +		if (ret == nr)
>>> +			break;
>>> +
>>> +		first_set = pte + ret;
>>> +	}
>>> +
>>> +	if (order == 1)
>>> +		order = 0;
>>> +
>>> +	return order;
>>> +}
>> The logic in above function should be kept is whether the order fit in vma range.
>>
>> check_ptes_none() is not accurate here because no page table lock hold and concurrent
>> fault could happen. So may just drop the check here? Check_ptes_none() is done after
>> take the page table lock.
> 
> I agree it is just an estimate given the lock is not held; the comment at the
> top says the same. But I don't think we can wait until after the lock is taken
> to measure this. We can't hold the lock while allocating the folio and we need a
> guess at what to allocate. If we don't guess here, we will allocate the biggest,
> then take the lock, see that it doesn't fit, and exit. Then the system will
> re-fault and we will follow the exact same path - ending up in live lock.
It will not if we try order0 immediately. But see my comments to the refault.

> 
>>
>> We pick the arch prefered order or order 0 now.
>>
>>> +
>>>  /*
>>>   * Handle write page faults for pages that can be reused in the current vma
>>>   *
>>> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>>>  		goto oom;
>>>  
>>>  	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
>>> -		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>> +		new_folio = alloc_anon_folio(vma, vmf->address, 0);
>>>  		if (!new_folio)
>>>  			goto oom;
>>>  	} else {
>>> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>  	struct folio *folio;
>>>  	vm_fault_t ret = 0;
>>>  	pte_t entry;
>>> +	int order;
>>> +	int pgcount;
>>> +	unsigned long addr;
>>>  
>>>  	/* File mapping without ->vm_ops ? */
>>>  	if (vma->vm_flags & VM_SHARED)
>>> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>  			return handle_userfault(vmf, VM_UFFD_MISSING);
>>>  		}
>>> -		goto setpte;
>>> +		if (uffd_wp)
>>> +			entry = pte_mkuffd_wp(entry);
>>> +		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>>> +
>>> +		/* No need to invalidate - it was non-present before */
>>> +		update_mmu_cache(vma, vmf->address, vmf->pte);
>>> +		goto unlock;
>>> +	}
>>> +
>>> +	/*
>>> +	 * If allocating a large folio, determine the biggest suitable order for
>>> +	 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
>>> +	 * overlap with any populated PTEs, etc). We are not under the ptl here
>>> +	 * so we will need to re-check that we are not overlapping any populated
>>> +	 * PTEs once we have the lock.
>>> +	 */
>>> +	order = uffd_wp ? 0 : max_anon_folio_order(vma);
>>> +	if (order > 0) {
>>> +		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>>> +		order = calc_anon_folio_order_alloc(vmf, order);
>>> +		pte_unmap(vmf->pte);
>>>  	}
>>>  
>>> -	/* Allocate our own private page. */
>>> +	/* Allocate our own private folio. */
>>>  	if (unlikely(anon_vma_prepare(vma)))
>>>  		goto oom;
>>> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>> +	folio = alloc_anon_folio(vma, vmf->address, order);
>>> +	if (!folio && order > 0) {
>>> +		order = 0;
>>> +		folio = alloc_anon_folio(vma, vmf->address, order);
>>> +	}
>>>  	if (!folio)
>>>  		goto oom;
>>>  
>>> +	pgcount = 1 << order;
>>> +	addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
>>> +
>>>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>>  		goto oom_free_page;
>>>  	folio_throttle_swaprate(folio, GFP_KERNEL);
>>>  
>>>  	/*
>>>  	 * The memory barrier inside __folio_mark_uptodate makes sure that
>>> -	 * preceding stores to the page contents become visible before
>>> -	 * the set_pte_at() write.
>>> +	 * preceding stores to the folio contents become visible before
>>> +	 * the set_ptes() write.
>>>  	 */
>>>  	__folio_mark_uptodate(folio);
>>>  
>>> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>  	if (vma->vm_flags & VM_WRITE)
>>>  		entry = pte_mkwrite(pte_mkdirty(entry));
>>>  
>>> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>>> -			&vmf->ptl);
>>> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>>  	if (vmf_pte_changed(vmf)) {
>>>  		update_mmu_tlb(vma, vmf->address, vmf->pte);
>>>  		goto release;
>>> +	} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
>> This could be the case that we allocated order 4 page and find a neighbor PTE is
>> filled by concurrent fault. Should we put current folio and fallback to order 0
>> and try again immedately (goto order 0 allocation instead of return from this
>> function which will go through some page fault path again)?
> 
> That's how it worked in v1, but I had review comments from Yang Shi asking me to
> re-fault instead. This approach is certainly cleaner from a code point of view.
> And I expect races of that nature will be rare.
I must miss that discussion in v1. My bad. I should jump in that discussion.
So I will drop my comment here even I still think we should avoid refault.
I don't want the comment back and forth.


Regards
Yin, Fengwei

> 
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>> +		goto release;
>>>  	}
>>>  
>>>  	ret = check_stable_address_space(vma->vm_mm);
>>> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>>>  	}
>>>  
>>> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
>>> +	folio_ref_add(folio, pgcount - 1);
>>> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
>>> +	folio_add_new_anon_rmap(folio, vma, addr);
>>>  	folio_add_lru_vma(folio, vma);
>>> -setpte:
>>> +
>>>  	if (uffd_wp)
>>>  		entry = pte_mkuffd_wp(entry);
>>> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>>> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>>>  
>>>  	/* No need to invalidate - it was non-present before */
>>> -	update_mmu_cache(vma, vmf->address, vmf->pte);
>>> +	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
>>>  unlock:
>>>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>  	return ret;
> 

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-04 14:08       ` Ryan Roberts
@ 2023-07-04 23:47         ` Yu Zhao
  -1 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-04 23:47 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Tue, Jul 4, 2023 at 8:08 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 04/07/2023 02:35, Yu Zhao wrote:
> > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> >> allocated in large folios of a specified order. All pages of the large
> >> folio are pte-mapped during the same page fault, significantly reducing
> >> the number of page faults. The number of per-page operations (e.g. ref
> >> counting, rmap management lru list management) are also significantly
> >> reduced since those ops now become per-folio.
> >>
> >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> >> defaults to disabled for now; there is a long list of todos to make
> >> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
> >> madvise ops, etc). These items will be tackled in subsequent patches.
> >>
> >> When enabled, the preferred folio order is as returned by
> >> arch_wants_pte_order(), which may be overridden by the arch as it sees
> >> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
> >
> > coalesce
>
> ACK
>
> >
> >> contiguous set of ptes map physically contigious, naturally aligned
> >
> > contiguous
>
> ACK
>
> >
> >> memory, so this mechanism allows the architecture to optimize as
> >> required.
> >>
> >> If the preferred order can't be used (e.g. because the folio would
> >> breach the bounds of the vma, or because ptes in the region are already
> >> mapped) then we fall back to a suitable lower order.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  mm/Kconfig  |  10 ++++
> >>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
> >>  2 files changed, 165 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/mm/Kconfig b/mm/Kconfig
> >> index 7672a22647b4..1c06b2c0a24e 100644
> >> --- a/mm/Kconfig
> >> +++ b/mm/Kconfig
> >> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
> >>           support of file THPs will be developed in the next few release
> >>           cycles.
> >>
> >> +config FLEXIBLE_THP
> >> +       bool "Flexible order THP"
> >> +       depends on TRANSPARENT_HUGEPAGE
> >> +       default n
> >
> > The default value is already N.
>
> Is there a coding standard for this? Personally I prefer to make it explicit.
>
> >
> >> +       help
> >> +         Use large (bigger than order-0) folios to back anonymous memory where
> >> +         possible, even if the order of the folio is smaller than the PMD
> >> +         order. This reduces the number of page faults, as well as other
> >> +         per-page overheads to improve performance for many workloads.
> >> +
> >>  endif # TRANSPARENT_HUGEPAGE
> >>
> >>  #
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index fb30f7523550..abe2ea94f3f5 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
> >>         return 0;
> >>  }
> >>
> >> +#ifdef CONFIG_FLEXIBLE_THP
> >> +/*
> >> + * Allocates, zeros and returns a folio of the requested order for use as
> >> + * anonymous memory.
> >> + */
> >> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
> >> +                                     unsigned long addr, int order)
> >> +{
> >> +       gfp_t gfp;
> >> +       struct folio *folio;
> >> +
> >> +       if (order == 0)
> >> +               return vma_alloc_zeroed_movable_folio(vma, addr);
> >> +
> >> +       gfp = vma_thp_gfp_mask(vma);
> >> +       folio = vma_alloc_folio(gfp, order, vma, addr, true);
> >> +       if (folio)
> >> +               clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
> >> +
> >> +       return folio;
> >> +}
> >> +
> >> +/*
> >> + * Preferred folio order to allocate for anonymous memory.
> >> + */
> >> +#define max_anon_folio_order(vma)      arch_wants_pte_order(vma)
> >> +#else
> >> +#define alloc_anon_folio(vma, addr, order) \
> >> +                               vma_alloc_zeroed_movable_folio(vma, addr)
> >> +#define max_anon_folio_order(vma)      0
> >> +#endif
> >> +
> >> +/*
> >> + * Returns index of first pte that is not none, or nr if all are none.
> >> + */
> >> +static inline int check_ptes_none(pte_t *pte, int nr)
> >> +{
> >> +       int i;
> >> +
> >> +       for (i = 0; i < nr; i++) {
> >> +               if (!pte_none(ptep_get(pte++)))
> >> +                       return i;
> >> +       }
> >> +
> >> +       return nr;
> >> +}
> >> +
> >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> >> +{
> >> +       /*
> >> +        * The aim here is to determine what size of folio we should allocate
> >> +        * for this fault. Factors include:
> >> +        * - Order must not be higher than `order` upon entry
> >> +        * - Folio must be naturally aligned within VA space
> >> +        * - Folio must be fully contained inside one pmd entry
> >> +        * - Folio must not breach boundaries of vma
> >> +        * - Folio must not overlap any non-none ptes
> >> +        *
> >> +        * Additionally, we do not allow order-1 since this breaks assumptions
> >> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
> >> +        * store state up to the 3rd struct page subpage), and these pages must
> >> +        * be THP in order to correctly use pre-existing THP infrastructure such
> >> +        * as folio_split().
> >> +        *
> >> +        * Note that the caller may or may not choose to lock the pte. If
> >> +        * unlocked, the result is racy and the user must re-check any overlap
> >> +        * with non-none ptes under the lock.
> >> +        */
> >> +
> >> +       struct vm_area_struct *vma = vmf->vma;
> >> +       int nr;
> >> +       unsigned long addr;
> >> +       pte_t *pte;
> >> +       pte_t *first_set = NULL;
> >> +       int ret;
> >> +
> >> +       order = min(order, PMD_SHIFT - PAGE_SHIFT);
> >> +
> >> +       for (; order > 1; order--) {
> >
> > I'm not sure how we can justify this policy. As an initial step, it'd
> > be a lot easier to sell if we only considered the order of
> > arch_wants_pte_order() and the order 0.
>
> My justification is in the cover letter; I see performance regression (vs the
> unpatched kernel) when using the policy you suggest. This policy performs much
> better in my tests. (I'll reply directly to your follow up questions in the
> cover letter shortly).
>
> What are your technical concerns about this approach? It is pretty light weight
> (I only touch each PTE once, regardless of the number of loops). If we have
> strong technical reasons for reverting to the less performant approach then fair
> enough, but I'd like to hear the rational first.

Yes, mainly from three different angles:
1. The engineering principle: we'd want to separate the mechanical
part and the policy part when attacking something large. This way it'd
be easier to root cause any regressions if they happen. In our case,
assuming the regression is real, it might actually prove my point
here: I really don't think the two checks (if a vma range fits and if
it does, which is unlikely according to your description, if all 64
PTEs are none) caused the regression. My theory is that 64KB itself
caused the regression, but smaller sizes made an improvement. If this
is really the case, I'd say the fallback policy masked the real
problem, which is that 64KB is too large to begin with.
2. The benchmark methodology: I appreciate your effort in doing it,
but we also need to consider that the setup is an uncommon scenario.
The common scenarios are devices that have been running for weeks
without reboots, generally having higher external fragmentation. In
addition, for client devices, they are often under memory pressure,
which makes fragmentation worse. So we should take the result with a
grain of salt, and for that matter, results from after refresh
reboots.
3. The technical concern: an ideal policy would consider all three
major factors: the h/w features, userspace behaviors and the page
allocator behavior. So far we only have the first one handy. The
second one is too challenging, so let's forget about it for now. The
third one is why I really don't like this best-fit policy. By falling
back to smaller orders, we can waste a limited number of physically
contiguous pages on wrong vmas (small vmas only), leading to failures
to serve large vmas which otherwise would have a higher overall ROI.
This can only be addressed within the page allocator: we need to
enlighten it to return the highest order available, i.e., not breaking
up any higher orders.

I'm not really saying we should never try this fallback policy. I'm
just thinking we can leave it for later, probably after we've
addressed all the concerns with basic functionality.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-04 23:47         ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-04 23:47 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Tue, Jul 4, 2023 at 8:08 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 04/07/2023 02:35, Yu Zhao wrote:
> > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> >> allocated in large folios of a specified order. All pages of the large
> >> folio are pte-mapped during the same page fault, significantly reducing
> >> the number of page faults. The number of per-page operations (e.g. ref
> >> counting, rmap management lru list management) are also significantly
> >> reduced since those ops now become per-folio.
> >>
> >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> >> defaults to disabled for now; there is a long list of todos to make
> >> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
> >> madvise ops, etc). These items will be tackled in subsequent patches.
> >>
> >> When enabled, the preferred folio order is as returned by
> >> arch_wants_pte_order(), which may be overridden by the arch as it sees
> >> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
> >
> > coalesce
>
> ACK
>
> >
> >> contiguous set of ptes map physically contigious, naturally aligned
> >
> > contiguous
>
> ACK
>
> >
> >> memory, so this mechanism allows the architecture to optimize as
> >> required.
> >>
> >> If the preferred order can't be used (e.g. because the folio would
> >> breach the bounds of the vma, or because ptes in the region are already
> >> mapped) then we fall back to a suitable lower order.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  mm/Kconfig  |  10 ++++
> >>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
> >>  2 files changed, 165 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/mm/Kconfig b/mm/Kconfig
> >> index 7672a22647b4..1c06b2c0a24e 100644
> >> --- a/mm/Kconfig
> >> +++ b/mm/Kconfig
> >> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
> >>           support of file THPs will be developed in the next few release
> >>           cycles.
> >>
> >> +config FLEXIBLE_THP
> >> +       bool "Flexible order THP"
> >> +       depends on TRANSPARENT_HUGEPAGE
> >> +       default n
> >
> > The default value is already N.
>
> Is there a coding standard for this? Personally I prefer to make it explicit.
>
> >
> >> +       help
> >> +         Use large (bigger than order-0) folios to back anonymous memory where
> >> +         possible, even if the order of the folio is smaller than the PMD
> >> +         order. This reduces the number of page faults, as well as other
> >> +         per-page overheads to improve performance for many workloads.
> >> +
> >>  endif # TRANSPARENT_HUGEPAGE
> >>
> >>  #
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index fb30f7523550..abe2ea94f3f5 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
> >>         return 0;
> >>  }
> >>
> >> +#ifdef CONFIG_FLEXIBLE_THP
> >> +/*
> >> + * Allocates, zeros and returns a folio of the requested order for use as
> >> + * anonymous memory.
> >> + */
> >> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
> >> +                                     unsigned long addr, int order)
> >> +{
> >> +       gfp_t gfp;
> >> +       struct folio *folio;
> >> +
> >> +       if (order == 0)
> >> +               return vma_alloc_zeroed_movable_folio(vma, addr);
> >> +
> >> +       gfp = vma_thp_gfp_mask(vma);
> >> +       folio = vma_alloc_folio(gfp, order, vma, addr, true);
> >> +       if (folio)
> >> +               clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
> >> +
> >> +       return folio;
> >> +}
> >> +
> >> +/*
> >> + * Preferred folio order to allocate for anonymous memory.
> >> + */
> >> +#define max_anon_folio_order(vma)      arch_wants_pte_order(vma)
> >> +#else
> >> +#define alloc_anon_folio(vma, addr, order) \
> >> +                               vma_alloc_zeroed_movable_folio(vma, addr)
> >> +#define max_anon_folio_order(vma)      0
> >> +#endif
> >> +
> >> +/*
> >> + * Returns index of first pte that is not none, or nr if all are none.
> >> + */
> >> +static inline int check_ptes_none(pte_t *pte, int nr)
> >> +{
> >> +       int i;
> >> +
> >> +       for (i = 0; i < nr; i++) {
> >> +               if (!pte_none(ptep_get(pte++)))
> >> +                       return i;
> >> +       }
> >> +
> >> +       return nr;
> >> +}
> >> +
> >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> >> +{
> >> +       /*
> >> +        * The aim here is to determine what size of folio we should allocate
> >> +        * for this fault. Factors include:
> >> +        * - Order must not be higher than `order` upon entry
> >> +        * - Folio must be naturally aligned within VA space
> >> +        * - Folio must be fully contained inside one pmd entry
> >> +        * - Folio must not breach boundaries of vma
> >> +        * - Folio must not overlap any non-none ptes
> >> +        *
> >> +        * Additionally, we do not allow order-1 since this breaks assumptions
> >> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
> >> +        * store state up to the 3rd struct page subpage), and these pages must
> >> +        * be THP in order to correctly use pre-existing THP infrastructure such
> >> +        * as folio_split().
> >> +        *
> >> +        * Note that the caller may or may not choose to lock the pte. If
> >> +        * unlocked, the result is racy and the user must re-check any overlap
> >> +        * with non-none ptes under the lock.
> >> +        */
> >> +
> >> +       struct vm_area_struct *vma = vmf->vma;
> >> +       int nr;
> >> +       unsigned long addr;
> >> +       pte_t *pte;
> >> +       pte_t *first_set = NULL;
> >> +       int ret;
> >> +
> >> +       order = min(order, PMD_SHIFT - PAGE_SHIFT);
> >> +
> >> +       for (; order > 1; order--) {
> >
> > I'm not sure how we can justify this policy. As an initial step, it'd
> > be a lot easier to sell if we only considered the order of
> > arch_wants_pte_order() and the order 0.
>
> My justification is in the cover letter; I see performance regression (vs the
> unpatched kernel) when using the policy you suggest. This policy performs much
> better in my tests. (I'll reply directly to your follow up questions in the
> cover letter shortly).
>
> What are your technical concerns about this approach? It is pretty light weight
> (I only touch each PTE once, regardless of the number of loops). If we have
> strong technical reasons for reverting to the less performant approach then fair
> enough, but I'd like to hear the rational first.

Yes, mainly from three different angles:
1. The engineering principle: we'd want to separate the mechanical
part and the policy part when attacking something large. This way it'd
be easier to root cause any regressions if they happen. In our case,
assuming the regression is real, it might actually prove my point
here: I really don't think the two checks (if a vma range fits and if
it does, which is unlikely according to your description, if all 64
PTEs are none) caused the regression. My theory is that 64KB itself
caused the regression, but smaller sizes made an improvement. If this
is really the case, I'd say the fallback policy masked the real
problem, which is that 64KB is too large to begin with.
2. The benchmark methodology: I appreciate your effort in doing it,
but we also need to consider that the setup is an uncommon scenario.
The common scenarios are devices that have been running for weeks
without reboots, generally having higher external fragmentation. In
addition, for client devices, they are often under memory pressure,
which makes fragmentation worse. So we should take the result with a
grain of salt, and for that matter, results from after refresh
reboots.
3. The technical concern: an ideal policy would consider all three
major factors: the h/w features, userspace behaviors and the page
allocator behavior. So far we only have the first one handy. The
second one is too challenging, so let's forget about it for now. The
third one is why I really don't like this best-fit policy. By falling
back to smaller orders, we can waste a limited number of physically
contiguous pages on wrong vmas (small vmas only), leading to failures
to serve large vmas which otherwise would have a higher overall ROI.
This can only be addressed within the page allocator: we need to
enlighten it to return the highest order available, i.e., not breaking
up any higher orders.

I'm not really saying we should never try this fallback policy. I'm
just thinking we can leave it for later, probably after we've
addressed all the concerns with basic functionality.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-04 15:36         ` Ryan Roberts
  (?)
@ 2023-07-04 23:52         ` Yin Fengwei
  2023-07-05  0:21             ` Yu Zhao
  -1 siblings, 1 reply; 167+ messages in thread
From: Yin Fengwei @ 2023-07-04 23:52 UTC (permalink / raw)
  To: Ryan Roberts, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm



On 7/4/23 23:36, Ryan Roberts wrote:
> On 04/07/2023 08:11, Yu Zhao wrote:
>> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>
>>> On 7/4/2023 10:18 AM, Yu Zhao wrote:
>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> This is v2 of a series to implement variable order, large folios for anonymous
>>>>> memory. The objective of this is to improve performance by allocating larger
>>>>> chunks of memory during anonymous page faults. See [1] for background.
>>>>
>>>> Thanks for the quick response!
>>>>
>>>>> I've significantly reworked and simplified the patch set based on comments from
>>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
>>>>> VARIABLE_THP, on Yu's advice.
>>>>>
>>>>> The last patch is for arm64 to explicitly override the default
>>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted
>>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change
>>>>> could be handled through the arm64 tree separately. Neither has any build
>>>>> dependency on the other.
>>>>>
>>>>> The one area where I haven't followed Yu's advice is in the determination of the
>>>>> size of folio to use. It was suggested that I have a single preferred large
>>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
>>>>> being existing overlapping populated PTEs, etc) then fallback immediately to
>>>>> order-0. It turned out that this approach caused a performance regression in the
>>>>> Speedometer benchmark.
>>>>
>>>> I suppose it's regression against the v1, not the unpatched kernel.
>>> From the performance data Ryan shared, it's against unpatched kernel:
>>>
>>> Speedometer 2.0:
>>>
>>> | kernel                         |   runs_per_min |
>>> |:-------------------------------|---------------:|
>>> | baseline-4k                    |           0.0% |
>>> | anonfolio-lkml-v1              |           0.7% |
>>> | anonfolio-lkml-v2-simple-order |          -0.9% |
>>> | anonfolio-lkml-v2              |           0.5% |
>>
>> I see. Thanks.
>>
>> A couple of questions:
>> 1. Do we have a stddev?
> 
> | kernel                    |   mean_abs |   std_abs |   mean_rel |   std_rel |
> |:------------------------- |-----------:|----------:|-----------:|----------:|
> | baseline-4k               |      117.4 |       0.8 |       0.0% |      0.7% |
> | anonfolio-v1              |      118.2 |         1 |       0.7% |      0.9% |
> | anonfolio-v2-simple-order |      116.4 |       1.1 |      -0.9% |      0.9% |
> | anonfolio-v2              |        118 |       1.2 |       0.5% |      1.0% |
> 
> This is with 3 runs per reboot across 5 reboots, with first run after reboot
> trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data
> points per kernel in total.
> 
> I've rerun the test multiple times and see similar results each time.
> 
> I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I
> see the same performance as baseline-4k.
> 
> 
>> 2. Do we have a theory why it regressed?
> 
> I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that
> mean when we fault, order-4 is often too big to fit in the VMA. So we fallback
> to order-0. I guess this is happening so often for this workload that the cost
> of doing the checks and fallback is outweighing the benefit of the memory that
> does end up with order-4 folios.
> 
> I've sampled the memory in each bucket (once per second) while running and its
> roughly:
> 
> 64K: 25%
> 32K: 15%
> 16K: 15%
> 4K: 45%
> 
> 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order.
> But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and
> the 64K contents is more static - that's just a guess though.
So this is like out of vma range thing.

> 
>> Assuming no bugs, I don't see how a real regression could happen --
>> falling back to order-0 isn't different from the original behavior.
>> Ryan, could you `perf record` and `cat /proc/vmstat` and share them?
> 
> I can, but it will have to be a bit later in the week. I'll do some more test
> runs overnight so we have a larger number of runs - hopefully that might tell us
> that this is noise to a certain extent.
> 
> I'd still like to hear a clear technical argument for why the bin-packing
> approach is not the correct one!
My understanding to Yu's (Yu, correct me if I am wrong) comments is that we
postpone this part of change and make basic anon large folio support in. Then
discuss which approach we should take. Maybe people will agree retry is the
choice, maybe other approach will be taken...

For example, for this out of VMA range case, per VMA order should be considered.
We don't need make decision that the retry should be taken now.


Regards
Yin, Fengwei

> 
> Thanks,
> Ryan
> 
> 
> 

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-04 14:20       ` Ryan Roberts
@ 2023-07-04 23:57         ` Matthew Wilcox
  -1 siblings, 0 replies; 167+ messages in thread
From: Matthew Wilcox @ 2023-07-04 23:57 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Yin, Fengwei, Andrew Morton, Kirill A. Shutemov,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Tue, Jul 04, 2023 at 03:20:35PM +0100, Ryan Roberts wrote:
> On 04/07/2023 04:45, Yin, Fengwei wrote:
> > 
> > On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> > THP is for huge page which is 2M size. We are not huge page here. But
> > I don't have good name either.
> 
> Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K
> base page, they are 512M. So huge pages already have a variable size. And they
> sometimes get PTE-mapped. So can't we just think of this as an extension of the
> THP feature?

The confusing thing is that we have counters for the number of THP
allocated (and number of THP mapped), and for those we always use
PMD-size folios.

If we must have a config option, then this is ANON_LARGE_FOLIOS.

But why do we need a config option?  We don't have one for the
page cache, and we're better off for it.  Yes, it depends on
CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental
heritage, and it'd be great to do away with that dependency eventually.

Hardware support isn't needed.  Large folios benefit us from a software
point of view.  if we need a chicken bit, we can edit the source code
to not create anon folios larger than order 0.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-04 23:57         ` Matthew Wilcox
  0 siblings, 0 replies; 167+ messages in thread
From: Matthew Wilcox @ 2023-07-04 23:57 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Yin, Fengwei, Andrew Morton, Kirill A. Shutemov,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Tue, Jul 04, 2023 at 03:20:35PM +0100, Ryan Roberts wrote:
> On 04/07/2023 04:45, Yin, Fengwei wrote:
> > 
> > On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> > THP is for huge page which is 2M size. We are not huge page here. But
> > I don't have good name either.
> 
> Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K
> base page, they are 512M. So huge pages already have a variable size. And they
> sometimes get PTE-mapped. So can't we just think of this as an extension of the
> THP feature?

The confusing thing is that we have counters for the number of THP
allocated (and number of THP mapped), and for those we always use
PMD-size folios.

If we must have a config option, then this is ANON_LARGE_FOLIOS.

But why do we need a config option?  We don't have one for the
page cache, and we're better off for it.  Yes, it depends on
CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental
heritage, and it'd be great to do away with that dependency eventually.

Hardware support isn't needed.  Large folios benefit us from a software
point of view.  if we need a chicken bit, we can edit the source code
to not create anon folios larger than order 0.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-04 23:52         ` Yin Fengwei
@ 2023-07-05  0:21             ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-05  0:21 UTC (permalink / raw)
  To: Yin Fengwei, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 7/4/23 23:36, Ryan Roberts wrote:
> > On 04/07/2023 08:11, Yu Zhao wrote:
> >> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>>
> >>> On 7/4/2023 10:18 AM, Yu Zhao wrote:
> >>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>> This is v2 of a series to implement variable order, large folios for anonymous
> >>>>> memory. The objective of this is to improve performance by allocating larger
> >>>>> chunks of memory during anonymous page faults. See [1] for background.
> >>>>
> >>>> Thanks for the quick response!
> >>>>
> >>>>> I've significantly reworked and simplified the patch set based on comments from
> >>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> >>>>> VARIABLE_THP, on Yu's advice.
> >>>>>
> >>>>> The last patch is for arm64 to explicitly override the default
> >>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted
> >>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change
> >>>>> could be handled through the arm64 tree separately. Neither has any build
> >>>>> dependency on the other.
> >>>>>
> >>>>> The one area where I haven't followed Yu's advice is in the determination of the
> >>>>> size of folio to use. It was suggested that I have a single preferred large
> >>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> >>>>> being existing overlapping populated PTEs, etc) then fallback immediately to
> >>>>> order-0. It turned out that this approach caused a performance regression in the
> >>>>> Speedometer benchmark.
> >>>>
> >>>> I suppose it's regression against the v1, not the unpatched kernel.
> >>> From the performance data Ryan shared, it's against unpatched kernel:
> >>>
> >>> Speedometer 2.0:
> >>>
> >>> | kernel                         |   runs_per_min |
> >>> |:-------------------------------|---------------:|
> >>> | baseline-4k                    |           0.0% |
> >>> | anonfolio-lkml-v1              |           0.7% |
> >>> | anonfolio-lkml-v2-simple-order |          -0.9% |
> >>> | anonfolio-lkml-v2              |           0.5% |
> >>
> >> I see. Thanks.
> >>
> >> A couple of questions:
> >> 1. Do we have a stddev?
> >
> > | kernel                    |   mean_abs |   std_abs |   mean_rel |   std_rel |
> > |:------------------------- |-----------:|----------:|-----------:|----------:|
> > | baseline-4k               |      117.4 |       0.8 |       0.0% |      0.7% |
> > | anonfolio-v1              |      118.2 |         1 |       0.7% |      0.9% |
> > | anonfolio-v2-simple-order |      116.4 |       1.1 |      -0.9% |      0.9% |
> > | anonfolio-v2              |        118 |       1.2 |       0.5% |      1.0% |
> >
> > This is with 3 runs per reboot across 5 reboots, with first run after reboot
> > trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data
> > points per kernel in total.
> >
> > I've rerun the test multiple times and see similar results each time.
> >
> > I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I
> > see the same performance as baseline-4k.
> >
> >
> >> 2. Do we have a theory why it regressed?
> >
> > I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that
> > mean when we fault, order-4 is often too big to fit in the VMA. So we fallback
> > to order-0. I guess this is happening so often for this workload that the cost
> > of doing the checks and fallback is outweighing the benefit of the memory that
> > does end up with order-4 folios.
> >
> > I've sampled the memory in each bucket (once per second) while running and its
> > roughly:
> >
> > 64K: 25%
> > 32K: 15%
> > 16K: 15%
> > 4K: 45%
> >
> > 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order.
> > But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and
> > the 64K contents is more static - that's just a guess though.
> So this is like out of vma range thing.
>
> >
> >> Assuming no bugs, I don't see how a real regression could happen --
> >> falling back to order-0 isn't different from the original behavior.
> >> Ryan, could you `perf record` and `cat /proc/vmstat` and share them?
> >
> > I can, but it will have to be a bit later in the week. I'll do some more test
> > runs overnight so we have a larger number of runs - hopefully that might tell us
> > that this is noise to a certain extent.
> >
> > I'd still like to hear a clear technical argument for why the bin-packing
> > approach is not the correct one!
> My understanding to Yu's (Yu, correct me if I am wrong) comments is that we
> postpone this part of change and make basic anon large folio support in. Then
> discuss which approach we should take. Maybe people will agree retry is the
> choice, maybe other approach will be taken...
>
> For example, for this out of VMA range case, per VMA order should be considered.
> We don't need make decision that the retry should be taken now.

I've articulated the reasons in another email. Just summarize the most
important point here:
using more fallback orders makes a system reach equilibrium faster, at
which point it can't allocate the order of arch_wants_pte_order()
anymore. IOW, this best-fit policy can reduce the number of folios of
the h/w prefered order for a system running long enough.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-05  0:21             ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-05  0:21 UTC (permalink / raw)
  To: Yin Fengwei, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 7/4/23 23:36, Ryan Roberts wrote:
> > On 04/07/2023 08:11, Yu Zhao wrote:
> >> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>>
> >>> On 7/4/2023 10:18 AM, Yu Zhao wrote:
> >>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>> This is v2 of a series to implement variable order, large folios for anonymous
> >>>>> memory. The objective of this is to improve performance by allocating larger
> >>>>> chunks of memory during anonymous page faults. See [1] for background.
> >>>>
> >>>> Thanks for the quick response!
> >>>>
> >>>>> I've significantly reworked and simplified the patch set based on comments from
> >>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> >>>>> VARIABLE_THP, on Yu's advice.
> >>>>>
> >>>>> The last patch is for arm64 to explicitly override the default
> >>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted
> >>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change
> >>>>> could be handled through the arm64 tree separately. Neither has any build
> >>>>> dependency on the other.
> >>>>>
> >>>>> The one area where I haven't followed Yu's advice is in the determination of the
> >>>>> size of folio to use. It was suggested that I have a single preferred large
> >>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> >>>>> being existing overlapping populated PTEs, etc) then fallback immediately to
> >>>>> order-0. It turned out that this approach caused a performance regression in the
> >>>>> Speedometer benchmark.
> >>>>
> >>>> I suppose it's regression against the v1, not the unpatched kernel.
> >>> From the performance data Ryan shared, it's against unpatched kernel:
> >>>
> >>> Speedometer 2.0:
> >>>
> >>> | kernel                         |   runs_per_min |
> >>> |:-------------------------------|---------------:|
> >>> | baseline-4k                    |           0.0% |
> >>> | anonfolio-lkml-v1              |           0.7% |
> >>> | anonfolio-lkml-v2-simple-order |          -0.9% |
> >>> | anonfolio-lkml-v2              |           0.5% |
> >>
> >> I see. Thanks.
> >>
> >> A couple of questions:
> >> 1. Do we have a stddev?
> >
> > | kernel                    |   mean_abs |   std_abs |   mean_rel |   std_rel |
> > |:------------------------- |-----------:|----------:|-----------:|----------:|
> > | baseline-4k               |      117.4 |       0.8 |       0.0% |      0.7% |
> > | anonfolio-v1              |      118.2 |         1 |       0.7% |      0.9% |
> > | anonfolio-v2-simple-order |      116.4 |       1.1 |      -0.9% |      0.9% |
> > | anonfolio-v2              |        118 |       1.2 |       0.5% |      1.0% |
> >
> > This is with 3 runs per reboot across 5 reboots, with first run after reboot
> > trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data
> > points per kernel in total.
> >
> > I've rerun the test multiple times and see similar results each time.
> >
> > I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I
> > see the same performance as baseline-4k.
> >
> >
> >> 2. Do we have a theory why it regressed?
> >
> > I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that
> > mean when we fault, order-4 is often too big to fit in the VMA. So we fallback
> > to order-0. I guess this is happening so often for this workload that the cost
> > of doing the checks and fallback is outweighing the benefit of the memory that
> > does end up with order-4 folios.
> >
> > I've sampled the memory in each bucket (once per second) while running and its
> > roughly:
> >
> > 64K: 25%
> > 32K: 15%
> > 16K: 15%
> > 4K: 45%
> >
> > 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order.
> > But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and
> > the 64K contents is more static - that's just a guess though.
> So this is like out of vma range thing.
>
> >
> >> Assuming no bugs, I don't see how a real regression could happen --
> >> falling back to order-0 isn't different from the original behavior.
> >> Ryan, could you `perf record` and `cat /proc/vmstat` and share them?
> >
> > I can, but it will have to be a bit later in the week. I'll do some more test
> > runs overnight so we have a larger number of runs - hopefully that might tell us
> > that this is noise to a certain extent.
> >
> > I'd still like to hear a clear technical argument for why the bin-packing
> > approach is not the correct one!
> My understanding to Yu's (Yu, correct me if I am wrong) comments is that we
> postpone this part of change and make basic anon large folio support in. Then
> discuss which approach we should take. Maybe people will agree retry is the
> choice, maybe other approach will be taken...
>
> For example, for this out of VMA range case, per VMA order should be considered.
> We don't need make decision that the retry should be taken now.

I've articulated the reasons in another email. Just summarize the most
important point here:
using more fallback orders makes a system reach equilibrium faster, at
which point it can't allocate the order of arch_wants_pte_order()
anymore. IOW, this best-fit policy can reduce the number of folios of
the h/w prefered order for a system running long enough.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-04 12:36           ` Ryan Roberts
@ 2023-07-05  1:23             ` Yu Zhao
  -1 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-05  1:23 UTC (permalink / raw)
  To: Ryan Roberts, Yin, Fengwei
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 4742 bytes --]

On Tue, Jul 4, 2023 at 6:36 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 04/07/2023 04:59, Yu Zhao wrote:
> > On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
> >>
> >> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>>
> >>>
> >>>
> >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> >>>> arch_wants_pte_order() can be overridden by the arch to return the
> >>>> preferred folio order for pte-mapped memory. This is useful as some
> >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >>>> memory is suitably contiguous.
> >>>>
> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to
> >>>> allocate large folios for anonymous memory to reduce page faults and
> >>>> other per-page operation costs.
> >>>>
> >>>> Here we add the default implementation of the function, used when the
> >>>> architecture does not define it, which returns the order corresponding
> >>>> to 64K.
> >>>>
> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>> ---
> >>>>  include/linux/pgtable.h | 13 +++++++++++++
> >>>>  1 file changed, 13 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>>> index a661a17173fa..f7e38598f20b 100644
> >>>> --- a/include/linux/pgtable.h
> >>>> +++ b/include/linux/pgtable.h
> >>>> @@ -13,6 +13,7 @@
> >>>>  #include <linux/errno.h>
> >>>>  #include <asm-generic/pgtable_uffd.h>
> >>>>  #include <linux/page_table_check.h>
> >>>> +#include <linux/sizes.h>
> >>>>
> >>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
> >>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> >>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
> >>>>  }
> >>>>  #endif
> >>>>
> >>>> +#ifndef arch_wants_pte_order
> >>>> +/*
> >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> >>>> + * to be at least order-2.
> >>>> + */
> >>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> >>>> +{
> >>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
> >>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
> >>>
> >>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
> >>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
> >>
> >> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
> >> s/w policy not a h/w preference. Besides, I don't think we can include
> >> mmzone.h in pgtable.h.
> >
> > I think we can make a compromise:
> > 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
> > 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
> > don't override arch_has_hw_pte_young(), or if its return value is too
> > large to fit.
> > This should also take care of the regression, right?
>
> I think you are suggesting that we use 0 as a sentinel which we then translate
> to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in
> memory.c (actually it is currently a macro defined as arch_wants_pte_order()).
>
> So it would become (I'll talk about the vma concern separately in the thread
> where you raised it):
>
> static inline int max_anon_folio_order(struct vm_area_struct *vma)
> {
>         int order = arch_wants_pte_order(vma);
>
>         return order ? order : PAGE_ALLOC_COSTLY_ORDER;
> }
>
> Correct?
>
> I don't see how it fixes the regression (assume you're talking about
> Speedometer) though? On arm64 arch_wants_pte_order() will still be returning
> order-4.

Here is what I was actually suggesting -- I think the problem was
because contpte is a bit too large for that benchmark and for the page
allocator too, unfortunately. The following allows one retry (32KB)
before fallback to order 0 when using contpte (64KB). There is no
retry for HPA (16KB) and other archs.

+       int preferred = arch_wants_pte_order(vma) ? : PAGE_ALLOC_COSTLY_ORDER;
+       int orders[] = {
+               preferred,
+               preferred > PAGE_ALLOC_COSTLY_ORDER ?
PAGE_ALLOC_COSTLY_ORDER : 0,
+               0,
+       };

I'm attaching a patch which fills in the two helpers I left empty here [1].

Would the above work for Intel, Fengwei?

(AMD wouldn't need to override arch_wants_pte_order() since PTE
coalescing on Zen is also PAGE_ALLOC_COSTLY_ORDER.)

[1] https://lore.kernel.org/linux-mm/CAOUHufaK82K8Sa35T7z3=gkm4GB0cWD3aqeZF6mYx82v7cOTeA@mail.gmail.com/2-anon_folios.patch

[-- Attachment #2: fallback.patch --]
[-- Type: application/octet-stream, Size: 1950 bytes --]

diff --git a/mm/memory.c b/mm/memory.c
index f69fbc251198..c19cbba60d04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4023,6 +4023,75 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	return ret;
 }
 
+static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+{
+	int i;
+
+	if (nr_pages == 1)
+		return vmf_pte_changed(vmf);
+
+	for (i = 0; i < nr_pages; i++) {
+		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
+			return true;
+	}
+
+	return false;
+}
+
+#ifdef CONFIG_FLEXIBLE_THP
+static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+{
+	int i;
+	unsigned long addr;
+	struct vm_area_struct *vma = vmf->vma;
+	int preferred = arch_wants_pte_order(vma) ? : PAGE_ALLOC_COSTLY_ORDER;
+	int orders[] = {
+		preferred,
+		preferred > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
+		0,
+	};
+
+	if (vmf_orig_pte_uffd_wp(vmf))
+		goto fallback;
+
+	for (i = 0; orders[i]; i++) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+		if (addr >= vma->vm_start && addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
+			break;
+	}
+
+	if (!orders[i])
+		goto fallback;
+
+	vmf->pte = pte_offset_map(vmf->pmd, addr);
+
+	for (; orders[i]; i++) {
+		if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
+			break;
+	}
+
+	pte_unmap(vmf->pte);
+	vmf->pte = NULL;
+
+	for (; orders[i]; i++) {
+		struct folio *folio
+		gfp_t gfp = vma_thp_gfp_mask(vma);
+
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+		folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
+		if (folio) {
+			clear_huge_page(&folio->page, addr, 1 << orders[i]);
+			vmf->address = addr;
+			return folio;
+		}
+	}
+fallback:
+	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
+}
+#else
+#define alloc_anon_folio(vmf) vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
+#endif
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-05  1:23             ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-05  1:23 UTC (permalink / raw)
  To: Ryan Roberts, Yin, Fengwei
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 4742 bytes --]

On Tue, Jul 4, 2023 at 6:36 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 04/07/2023 04:59, Yu Zhao wrote:
> > On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
> >>
> >> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>>
> >>>
> >>>
> >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> >>>> arch_wants_pte_order() can be overridden by the arch to return the
> >>>> preferred folio order for pte-mapped memory. This is useful as some
> >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >>>> memory is suitably contiguous.
> >>>>
> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to
> >>>> allocate large folios for anonymous memory to reduce page faults and
> >>>> other per-page operation costs.
> >>>>
> >>>> Here we add the default implementation of the function, used when the
> >>>> architecture does not define it, which returns the order corresponding
> >>>> to 64K.
> >>>>
> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>> ---
> >>>>  include/linux/pgtable.h | 13 +++++++++++++
> >>>>  1 file changed, 13 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>>> index a661a17173fa..f7e38598f20b 100644
> >>>> --- a/include/linux/pgtable.h
> >>>> +++ b/include/linux/pgtable.h
> >>>> @@ -13,6 +13,7 @@
> >>>>  #include <linux/errno.h>
> >>>>  #include <asm-generic/pgtable_uffd.h>
> >>>>  #include <linux/page_table_check.h>
> >>>> +#include <linux/sizes.h>
> >>>>
> >>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
> >>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> >>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
> >>>>  }
> >>>>  #endif
> >>>>
> >>>> +#ifndef arch_wants_pte_order
> >>>> +/*
> >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> >>>> + * to be at least order-2.
> >>>> + */
> >>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> >>>> +{
> >>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
> >>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
> >>>
> >>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
> >>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
> >>
> >> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
> >> s/w policy not a h/w preference. Besides, I don't think we can include
> >> mmzone.h in pgtable.h.
> >
> > I think we can make a compromise:
> > 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
> > 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
> > don't override arch_has_hw_pte_young(), or if its return value is too
> > large to fit.
> > This should also take care of the regression, right?
>
> I think you are suggesting that we use 0 as a sentinel which we then translate
> to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in
> memory.c (actually it is currently a macro defined as arch_wants_pte_order()).
>
> So it would become (I'll talk about the vma concern separately in the thread
> where you raised it):
>
> static inline int max_anon_folio_order(struct vm_area_struct *vma)
> {
>         int order = arch_wants_pte_order(vma);
>
>         return order ? order : PAGE_ALLOC_COSTLY_ORDER;
> }
>
> Correct?
>
> I don't see how it fixes the regression (assume you're talking about
> Speedometer) though? On arm64 arch_wants_pte_order() will still be returning
> order-4.

Here is what I was actually suggesting -- I think the problem was
because contpte is a bit too large for that benchmark and for the page
allocator too, unfortunately. The following allows one retry (32KB)
before fallback to order 0 when using contpte (64KB). There is no
retry for HPA (16KB) and other archs.

+       int preferred = arch_wants_pte_order(vma) ? : PAGE_ALLOC_COSTLY_ORDER;
+       int orders[] = {
+               preferred,
+               preferred > PAGE_ALLOC_COSTLY_ORDER ?
PAGE_ALLOC_COSTLY_ORDER : 0,
+               0,
+       };

I'm attaching a patch which fills in the two helpers I left empty here [1].

Would the above work for Intel, Fengwei?

(AMD wouldn't need to override arch_wants_pte_order() since PTE
coalescing on Zen is also PAGE_ALLOC_COSTLY_ORDER.)

[1] https://lore.kernel.org/linux-mm/CAOUHufaK82K8Sa35T7z3=gkm4GB0cWD3aqeZF6mYx82v7cOTeA@mail.gmail.com/2-anon_folios.patch

[-- Attachment #2: fallback.patch --]
[-- Type: application/octet-stream, Size: 1950 bytes --]

diff --git a/mm/memory.c b/mm/memory.c
index f69fbc251198..c19cbba60d04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4023,6 +4023,75 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	return ret;
 }
 
+static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+{
+	int i;
+
+	if (nr_pages == 1)
+		return vmf_pte_changed(vmf);
+
+	for (i = 0; i < nr_pages; i++) {
+		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
+			return true;
+	}
+
+	return false;
+}
+
+#ifdef CONFIG_FLEXIBLE_THP
+static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+{
+	int i;
+	unsigned long addr;
+	struct vm_area_struct *vma = vmf->vma;
+	int preferred = arch_wants_pte_order(vma) ? : PAGE_ALLOC_COSTLY_ORDER;
+	int orders[] = {
+		preferred,
+		preferred > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
+		0,
+	};
+
+	if (vmf_orig_pte_uffd_wp(vmf))
+		goto fallback;
+
+	for (i = 0; orders[i]; i++) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+		if (addr >= vma->vm_start && addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
+			break;
+	}
+
+	if (!orders[i])
+		goto fallback;
+
+	vmf->pte = pte_offset_map(vmf->pmd, addr);
+
+	for (; orders[i]; i++) {
+		if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
+			break;
+	}
+
+	pte_unmap(vmf->pte);
+	vmf->pte = NULL;
+
+	for (; orders[i]; i++) {
+		struct folio *folio
+		gfp_t gfp = vma_thp_gfp_mask(vma);
+
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+		folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
+		if (folio) {
+			clear_huge_page(&folio->page, addr, 1 << orders[i]);
+			vmf->address = addr;
+			return folio;
+		}
+	}
+fallback:
+	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
+}
+#else
+#define alloc_anon_folio(vmf) vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
+#endif
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.

[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-04 13:23             ` Ryan Roberts
@ 2023-07-05  1:40               ` Yu Zhao
  -1 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-05  1:40 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Yin, Fengwei, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Tue, Jul 4, 2023 at 7:23 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 04/07/2023 13:36, Ryan Roberts wrote:
> > On 04/07/2023 04:59, Yu Zhao wrote:
> >> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
> >>>
> >>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> >>>>> arch_wants_pte_order() can be overridden by the arch to return the
> >>>>> preferred folio order for pte-mapped memory. This is useful as some
> >>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >>>>> memory is suitably contiguous.
> >>>>>
> >>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
> >>>>> allocate large folios for anonymous memory to reduce page faults and
> >>>>> other per-page operation costs.
> >>>>>
> >>>>> Here we add the default implementation of the function, used when the
> >>>>> architecture does not define it, which returns the order corresponding
> >>>>> to 64K.
> >>>>>
> >>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>>> ---
> >>>>>  include/linux/pgtable.h | 13 +++++++++++++
> >>>>>  1 file changed, 13 insertions(+)
> >>>>>
> >>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>>>> index a661a17173fa..f7e38598f20b 100644
> >>>>> --- a/include/linux/pgtable.h
> >>>>> +++ b/include/linux/pgtable.h
> >>>>> @@ -13,6 +13,7 @@
> >>>>>  #include <linux/errno.h>
> >>>>>  #include <asm-generic/pgtable_uffd.h>
> >>>>>  #include <linux/page_table_check.h>
> >>>>> +#include <linux/sizes.h>
> >>>>>
> >>>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
> >>>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> >>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
> >>>>>  }
> >>>>>  #endif
> >>>>>
> >>>>> +#ifndef arch_wants_pte_order
> >>>>> +/*
> >>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> >>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> >>>>> + * to be at least order-2.
> >>>>> + */
> >>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> >>>>> +{
> >>>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
> >>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
> >>>>
> >>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
> >>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
> >>>
> >>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
> >>> s/w policy not a h/w preference. Besides, I don't think we can include
> >>> mmzone.h in pgtable.h.
> >>
> >> I think we can make a compromise:
> >> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
> >> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
> >> don't override arch_has_hw_pte_young(), or if its return value is too
> >> large to fit.
> >> This should also take care of the regression, right?
> >
> > I think you are suggesting that we use 0 as a sentinel which we then translate
> > to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in
> > memory.c (actually it is currently a macro defined as arch_wants_pte_order()).
> >
> > So it would become (I'll talk about the vma concern separately in the thread
> > where you raised it):
> >
> > static inline int max_anon_folio_order(struct vm_area_struct *vma)
> > {
> >       int order = arch_wants_pte_order(vma);
> >
> >       return order ? order : PAGE_ALLOC_COSTLY_ORDER;
> > }
> >
> > Correct?
>
> Actually, I'm not sure its a good idea to default to a fixed order. If running
> on an arch with big base pages (e.g. powerpc with 64K pages?), that will soon
> add up to a big chunk of memory, which could be wasteful?
>
> PAGE_ALLOC_COSTLY_ORDER = 3 so with 64K base page, that 512K. Is that a concern?
> Wouldn't it be better to define this as an absolute size? Or even the min of
> PAGE_ALLOC_COSTLY_ORDER and an absolute size?

For my POV, not at all. POWER can use smaller page sizes if they
wanted to -- I don't think they do: at least the distros I use on my
POWER9 all have THP=always by default (2MB).

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-05  1:40               ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-05  1:40 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Yin, Fengwei, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Tue, Jul 4, 2023 at 7:23 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 04/07/2023 13:36, Ryan Roberts wrote:
> > On 04/07/2023 04:59, Yu Zhao wrote:
> >> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
> >>>
> >>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> >>>>> arch_wants_pte_order() can be overridden by the arch to return the
> >>>>> preferred folio order for pte-mapped memory. This is useful as some
> >>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >>>>> memory is suitably contiguous.
> >>>>>
> >>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
> >>>>> allocate large folios for anonymous memory to reduce page faults and
> >>>>> other per-page operation costs.
> >>>>>
> >>>>> Here we add the default implementation of the function, used when the
> >>>>> architecture does not define it, which returns the order corresponding
> >>>>> to 64K.
> >>>>>
> >>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>>> ---
> >>>>>  include/linux/pgtable.h | 13 +++++++++++++
> >>>>>  1 file changed, 13 insertions(+)
> >>>>>
> >>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>>>> index a661a17173fa..f7e38598f20b 100644
> >>>>> --- a/include/linux/pgtable.h
> >>>>> +++ b/include/linux/pgtable.h
> >>>>> @@ -13,6 +13,7 @@
> >>>>>  #include <linux/errno.h>
> >>>>>  #include <asm-generic/pgtable_uffd.h>
> >>>>>  #include <linux/page_table_check.h>
> >>>>> +#include <linux/sizes.h>
> >>>>>
> >>>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
> >>>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> >>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
> >>>>>  }
> >>>>>  #endif
> >>>>>
> >>>>> +#ifndef arch_wants_pte_order
> >>>>> +/*
> >>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> >>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> >>>>> + * to be at least order-2.
> >>>>> + */
> >>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> >>>>> +{
> >>>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
> >>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
> >>>>
> >>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
> >>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
> >>>
> >>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
> >>> s/w policy not a h/w preference. Besides, I don't think we can include
> >>> mmzone.h in pgtable.h.
> >>
> >> I think we can make a compromise:
> >> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
> >> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
> >> don't override arch_has_hw_pte_young(), or if its return value is too
> >> large to fit.
> >> This should also take care of the regression, right?
> >
> > I think you are suggesting that we use 0 as a sentinel which we then translate
> > to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in
> > memory.c (actually it is currently a macro defined as arch_wants_pte_order()).
> >
> > So it would become (I'll talk about the vma concern separately in the thread
> > where you raised it):
> >
> > static inline int max_anon_folio_order(struct vm_area_struct *vma)
> > {
> >       int order = arch_wants_pte_order(vma);
> >
> >       return order ? order : PAGE_ALLOC_COSTLY_ORDER;
> > }
> >
> > Correct?
>
> Actually, I'm not sure its a good idea to default to a fixed order. If running
> on an arch with big base pages (e.g. powerpc with 64K pages?), that will soon
> add up to a big chunk of memory, which could be wasteful?
>
> PAGE_ALLOC_COSTLY_ORDER = 3 so with 64K base page, that 512K. Is that a concern?
> Wouldn't it be better to define this as an absolute size? Or even the min of
> PAGE_ALLOC_COSTLY_ORDER and an absolute size?

For my POV, not at all. POWER can use smaller page sizes if they
wanted to -- I don't think they do: at least the distros I use on my
POWER9 all have THP=always by default (2MB).

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-04 13:20       ` Ryan Roberts
@ 2023-07-05  2:07         ` Yu Zhao
  -1 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-05  2:07 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 03/07/2023 20:50, Yu Zhao wrote:
> > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> arch_wants_pte_order() can be overridden by the arch to return the
> >> preferred folio order for pte-mapped memory. This is useful as some
> >> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >> memory is suitably contiguous.
> >>
> >> The first user for this hint will be FLEXIBLE_THP, which aims to
> >> allocate large folios for anonymous memory to reduce page faults and
> >> other per-page operation costs.
> >>
> >> Here we add the default implementation of the function, used when the
> >> architecture does not define it, which returns the order corresponding
> >> to 64K.
> >
> > I don't really mind a non-zero default value. But people would ask why
> > non-zero and why 64KB. Probably you could argue this is the large size
> > all known archs support if they have TLB coalescing. For x86, AMD CPUs
> > would want to override this. I'll leave it to Fengwei to decide
> > whether Intel wants a different default value.>
> > Also I don't like the vma parameter because it makes
> > arch_wants_pte_order() a mix of hw preference and vma policy. From my
> > POV, the function should be only about the former; the latter should
> > be decided by arch-independent MM code. However, I can live with it if
> > ARM MM people think this is really what you want. ATM, I'm skeptical
> > they do.
>
> Here's the big picture for what I'm tryng to achieve:
>
>  - In the common case, I'd like all programs to get a performance bump by
> automatically and transparently using large anon folios - so no explicit
> requirement on the process to opt-in.

We all agree on this :)

>  - On arm64, in the above case, I'd like the preferred folio size to be 64K;
> from the (admittedly limitted) testing I've done that's about where the
> performance knee is and it doesn't appear to increase the memory wastage very
> much. It also has the benefits that for 4K base pages this is the contpte size
> (order-4) so I can take full benefit of contpte mappings transparently to the
> process. And for 16K this is the HPA size (order-2).

My highest priority is to get 16KB proven first because it would
benefit both client and server devices. So it may be different from
yours but I don't see any conflict.

>  - On arm64 when the process has marked the VMA for THP (or when
> transparent_hugepage=always) but the VMA does not meet the requirements for a
> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
> and for 64K this is 2M (order-5). The 64K base page case is very important since
> the PMD size for that base page is 512MB which is almost impossible to allocate
> in practice.

Which case (server or client) are you focusing on here? For our client
devices, I can confidently say that 64KB has to be after 16KB, if it
happens at all. For servers in general, I don't know of any major
memory-intensive workloads that are not THP-aware, i.e., I don't think
"VMA does not meet the requirements" is a concern.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-05  2:07         ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-05  2:07 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 03/07/2023 20:50, Yu Zhao wrote:
> > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> arch_wants_pte_order() can be overridden by the arch to return the
> >> preferred folio order for pte-mapped memory. This is useful as some
> >> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >> memory is suitably contiguous.
> >>
> >> The first user for this hint will be FLEXIBLE_THP, which aims to
> >> allocate large folios for anonymous memory to reduce page faults and
> >> other per-page operation costs.
> >>
> >> Here we add the default implementation of the function, used when the
> >> architecture does not define it, which returns the order corresponding
> >> to 64K.
> >
> > I don't really mind a non-zero default value. But people would ask why
> > non-zero and why 64KB. Probably you could argue this is the large size
> > all known archs support if they have TLB coalescing. For x86, AMD CPUs
> > would want to override this. I'll leave it to Fengwei to decide
> > whether Intel wants a different default value.>
> > Also I don't like the vma parameter because it makes
> > arch_wants_pte_order() a mix of hw preference and vma policy. From my
> > POV, the function should be only about the former; the latter should
> > be decided by arch-independent MM code. However, I can live with it if
> > ARM MM people think this is really what you want. ATM, I'm skeptical
> > they do.
>
> Here's the big picture for what I'm tryng to achieve:
>
>  - In the common case, I'd like all programs to get a performance bump by
> automatically and transparently using large anon folios - so no explicit
> requirement on the process to opt-in.

We all agree on this :)

>  - On arm64, in the above case, I'd like the preferred folio size to be 64K;
> from the (admittedly limitted) testing I've done that's about where the
> performance knee is and it doesn't appear to increase the memory wastage very
> much. It also has the benefits that for 4K base pages this is the contpte size
> (order-4) so I can take full benefit of contpte mappings transparently to the
> process. And for 16K this is the HPA size (order-2).

My highest priority is to get 16KB proven first because it would
benefit both client and server devices. So it may be different from
yours but I don't see any conflict.

>  - On arm64 when the process has marked the VMA for THP (or when
> transparent_hugepage=always) but the VMA does not meet the requirements for a
> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
> and for 64K this is 2M (order-5). The 64K base page case is very important since
> the PMD size for that base page is 512MB which is almost impossible to allocate
> in practice.

Which case (server or client) are you focusing on here? For our client
devices, I can confidently say that 64KB has to be after 16KB, if it
happens at all. For servers in general, I don't know of any major
memory-intensive workloads that are not THP-aware, i.e., I don't think
"VMA does not meet the requirements" is a concern.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-05  1:23             ` Yu Zhao
@ 2023-07-05  2:18               ` Yin Fengwei
  -1 siblings, 0 replies; 167+ messages in thread
From: Yin Fengwei @ 2023-07-05  2:18 UTC (permalink / raw)
  To: Yu Zhao, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm



On 7/5/23 09:23, Yu Zhao wrote:
> On Tue, Jul 4, 2023 at 6:36 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 04/07/2023 04:59, Yu Zhao wrote:
>>> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
>>>>
>>>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>>>> memory is suitably contiguous.
>>>>>>
>>>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>>>> allocate large folios for anonymous memory to reduce page faults and
>>>>>> other per-page operation costs.
>>>>>>
>>>>>> Here we add the default implementation of the function, used when the
>>>>>> architecture does not define it, which returns the order corresponding
>>>>>> to 64K.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> ---
>>>>>>  include/linux/pgtable.h | 13 +++++++++++++
>>>>>>  1 file changed, 13 insertions(+)
>>>>>>
>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>> index a661a17173fa..f7e38598f20b 100644
>>>>>> --- a/include/linux/pgtable.h
>>>>>> +++ b/include/linux/pgtable.h
>>>>>> @@ -13,6 +13,7 @@
>>>>>>  #include <linux/errno.h>
>>>>>>  #include <asm-generic/pgtable_uffd.h>
>>>>>>  #include <linux/page_table_check.h>
>>>>>> +#include <linux/sizes.h>
>>>>>>
>>>>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>>>>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
>>>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>>>>>>  }
>>>>>>  #endif
>>>>>>
>>>>>> +#ifndef arch_wants_pte_order
>>>>>> +/*
>>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>>>> + * to be at least order-2.
>>>>>> + */
>>>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
>>>>>> +{
>>>>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
>>>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
>>>>>
>>>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
>>>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
>>>>
>>>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
>>>> s/w policy not a h/w preference. Besides, I don't think we can include
>>>> mmzone.h in pgtable.h.
>>>
>>> I think we can make a compromise:
>>> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
>>> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
>>> don't override arch_has_hw_pte_young(), or if its return value is too
>>> large to fit.
>>> This should also take care of the regression, right?
>>
>> I think you are suggesting that we use 0 as a sentinel which we then translate
>> to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in
>> memory.c (actually it is currently a macro defined as arch_wants_pte_order()).
>>
>> So it would become (I'll talk about the vma concern separately in the thread
>> where you raised it):
>>
>> static inline int max_anon_folio_order(struct vm_area_struct *vma)
>> {
>>         int order = arch_wants_pte_order(vma);
>>
>>         return order ? order : PAGE_ALLOC_COSTLY_ORDER;
>> }
>>
>> Correct?
>>
>> I don't see how it fixes the regression (assume you're talking about
>> Speedometer) though? On arm64 arch_wants_pte_order() will still be returning
>> order-4.
> 
> Here is what I was actually suggesting -- I think the problem was
> because contpte is a bit too large for that benchmark and for the page
> allocator too, unfortunately. The following allows one retry (32KB)
> before fallback to order 0 when using contpte (64KB). There is no
> retry for HPA (16KB) and other archs.
> 
> +       int preferred = arch_wants_pte_order(vma) ? : PAGE_ALLOC_COSTLY_ORDER;
> +       int orders[] = {
> +               preferred,
> +               preferred > PAGE_ALLOC_COSTLY_ORDER ?
> PAGE_ALLOC_COSTLY_ORDER : 0,
> +               0,
> +       };
> 
> I'm attaching a patch which fills in the two helpers I left empty here [1].
> 
> Would the above work for Intel, Fengwei?
PAGE_ALLOC_COSTLY_ORDER is Intel preferred because it fits the most common
Intel system. So yes. This works for Intel.


Regards
Yin, Fengwei

> 
> (AMD wouldn't need to override arch_wants_pte_order() since PTE
> coalescing on Zen is also PAGE_ALLOC_COSTLY_ORDER.)
> 
> [1] https://lore.kernel.org/linux-mm/CAOUHufaK82K8Sa35T7z3=gkm4GB0cWD3aqeZF6mYx82v7cOTeA@mail.gmail.com/2-anon_folios.patch

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-05  2:18               ` Yin Fengwei
  0 siblings, 0 replies; 167+ messages in thread
From: Yin Fengwei @ 2023-07-05  2:18 UTC (permalink / raw)
  To: Yu Zhao, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm



On 7/5/23 09:23, Yu Zhao wrote:
> On Tue, Jul 4, 2023 at 6:36 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 04/07/2023 04:59, Yu Zhao wrote:
>>> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
>>>>
>>>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>>>> memory is suitably contiguous.
>>>>>>
>>>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>>>> allocate large folios for anonymous memory to reduce page faults and
>>>>>> other per-page operation costs.
>>>>>>
>>>>>> Here we add the default implementation of the function, used when the
>>>>>> architecture does not define it, which returns the order corresponding
>>>>>> to 64K.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> ---
>>>>>>  include/linux/pgtable.h | 13 +++++++++++++
>>>>>>  1 file changed, 13 insertions(+)
>>>>>>
>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>> index a661a17173fa..f7e38598f20b 100644
>>>>>> --- a/include/linux/pgtable.h
>>>>>> +++ b/include/linux/pgtable.h
>>>>>> @@ -13,6 +13,7 @@
>>>>>>  #include <linux/errno.h>
>>>>>>  #include <asm-generic/pgtable_uffd.h>
>>>>>>  #include <linux/page_table_check.h>
>>>>>> +#include <linux/sizes.h>
>>>>>>
>>>>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>>>>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
>>>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>>>>>>  }
>>>>>>  #endif
>>>>>>
>>>>>> +#ifndef arch_wants_pte_order
>>>>>> +/*
>>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>>>> + * to be at least order-2.
>>>>>> + */
>>>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
>>>>>> +{
>>>>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
>>>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
>>>>>
>>>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
>>>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
>>>>
>>>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
>>>> s/w policy not a h/w preference. Besides, I don't think we can include
>>>> mmzone.h in pgtable.h.
>>>
>>> I think we can make a compromise:
>>> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
>>> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
>>> don't override arch_has_hw_pte_young(), or if its return value is too
>>> large to fit.
>>> This should also take care of the regression, right?
>>
>> I think you are suggesting that we use 0 as a sentinel which we then translate
>> to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in
>> memory.c (actually it is currently a macro defined as arch_wants_pte_order()).
>>
>> So it would become (I'll talk about the vma concern separately in the thread
>> where you raised it):
>>
>> static inline int max_anon_folio_order(struct vm_area_struct *vma)
>> {
>>         int order = arch_wants_pte_order(vma);
>>
>>         return order ? order : PAGE_ALLOC_COSTLY_ORDER;
>> }
>>
>> Correct?
>>
>> I don't see how it fixes the regression (assume you're talking about
>> Speedometer) though? On arm64 arch_wants_pte_order() will still be returning
>> order-4.
> 
> Here is what I was actually suggesting -- I think the problem was
> because contpte is a bit too large for that benchmark and for the page
> allocator too, unfortunately. The following allows one retry (32KB)
> before fallback to order 0 when using contpte (64KB). There is no
> retry for HPA (16KB) and other archs.
> 
> +       int preferred = arch_wants_pte_order(vma) ? : PAGE_ALLOC_COSTLY_ORDER;
> +       int orders[] = {
> +               preferred,
> +               preferred > PAGE_ALLOC_COSTLY_ORDER ?
> PAGE_ALLOC_COSTLY_ORDER : 0,
> +               0,
> +       };
> 
> I'm attaching a patch which fills in the two helpers I left empty here [1].
> 
> Would the above work for Intel, Fengwei?
PAGE_ALLOC_COSTLY_ORDER is Intel preferred because it fits the most common
Intel system. So yes. This works for Intel.


Regards
Yin, Fengwei

> 
> (AMD wouldn't need to override arch_wants_pte_order() since PTE
> coalescing on Zen is also PAGE_ALLOC_COSTLY_ORDER.)
> 
> [1] https://lore.kernel.org/linux-mm/CAOUHufaK82K8Sa35T7z3=gkm4GB0cWD3aqeZF6mYx82v7cOTeA@mail.gmail.com/2-anon_folios.patch

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-05  2:07         ` Yu Zhao
@ 2023-07-05  9:11           ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-05  9:11 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 05/07/2023 03:07, Yu Zhao wrote:
> On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 03/07/2023 20:50, Yu Zhao wrote:
>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>> memory is suitably contiguous.
>>>>
>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>> allocate large folios for anonymous memory to reduce page faults and
>>>> other per-page operation costs.
>>>>
>>>> Here we add the default implementation of the function, used when the
>>>> architecture does not define it, which returns the order corresponding
>>>> to 64K.
>>>
>>> I don't really mind a non-zero default value. But people would ask why
>>> non-zero and why 64KB. Probably you could argue this is the large size
>>> all known archs support if they have TLB coalescing. For x86, AMD CPUs
>>> would want to override this. I'll leave it to Fengwei to decide
>>> whether Intel wants a different default value.>
>>> Also I don't like the vma parameter because it makes
>>> arch_wants_pte_order() a mix of hw preference and vma policy. From my
>>> POV, the function should be only about the former; the latter should
>>> be decided by arch-independent MM code. However, I can live with it if
>>> ARM MM people think this is really what you want. ATM, I'm skeptical
>>> they do.
>>
>> Here's the big picture for what I'm tryng to achieve:
>>
>>  - In the common case, I'd like all programs to get a performance bump by
>> automatically and transparently using large anon folios - so no explicit
>> requirement on the process to opt-in.
> 
> We all agree on this :)
> 
>>  - On arm64, in the above case, I'd like the preferred folio size to be 64K;
>> from the (admittedly limitted) testing I've done that's about where the
>> performance knee is and it doesn't appear to increase the memory wastage very
>> much. It also has the benefits that for 4K base pages this is the contpte size
>> (order-4) so I can take full benefit of contpte mappings transparently to the
>> process. And for 16K this is the HPA size (order-2).
> 
> My highest priority is to get 16KB proven first because it would
> benefit both client and server devices. So it may be different from
> yours but I don't see any conflict.

Do you mean 16K folios on a 4K base page system, or large folios on a 16K base
page system? I thought your focus was on speeding up 4K base page client systems
but this statement has got me wondering?

> 
>>  - On arm64 when the process has marked the VMA for THP (or when
>> transparent_hugepage=always) but the VMA does not meet the requirements for a
>> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
>> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
>> and for 64K this is 2M (order-5). The 64K base page case is very important since
>> the PMD size for that base page is 512MB which is almost impossible to allocate
>> in practice.
> 
> Which case (server or client) are you focusing on here? For our client
> devices, I can confidently say that 64KB has to be after 16KB, if it
> happens at all. For servers in general, I don't know of any major
> memory-intensive workloads that are not THP-aware, i.e., I don't think
> "VMA does not meet the requirements" is a concern.

For the 64K base page case, the focus is server. The problem reported by our
partner is that the 512M huge page size is too big to reliably allocate and so
the fauls always fall back to 64K base pages in practice. I would also speculate
(happy to be proved wrong) that there are many THP-aware workloads that assume
the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M
huge page when running on 64K base page system.

But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base
page system is a very real requirement. Our intent is that this will be the
mechanism we use to enable it.




^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-05  9:11           ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-05  9:11 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 05/07/2023 03:07, Yu Zhao wrote:
> On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 03/07/2023 20:50, Yu Zhao wrote:
>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>> memory is suitably contiguous.
>>>>
>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>> allocate large folios for anonymous memory to reduce page faults and
>>>> other per-page operation costs.
>>>>
>>>> Here we add the default implementation of the function, used when the
>>>> architecture does not define it, which returns the order corresponding
>>>> to 64K.
>>>
>>> I don't really mind a non-zero default value. But people would ask why
>>> non-zero and why 64KB. Probably you could argue this is the large size
>>> all known archs support if they have TLB coalescing. For x86, AMD CPUs
>>> would want to override this. I'll leave it to Fengwei to decide
>>> whether Intel wants a different default value.>
>>> Also I don't like the vma parameter because it makes
>>> arch_wants_pte_order() a mix of hw preference and vma policy. From my
>>> POV, the function should be only about the former; the latter should
>>> be decided by arch-independent MM code. However, I can live with it if
>>> ARM MM people think this is really what you want. ATM, I'm skeptical
>>> they do.
>>
>> Here's the big picture for what I'm tryng to achieve:
>>
>>  - In the common case, I'd like all programs to get a performance bump by
>> automatically and transparently using large anon folios - so no explicit
>> requirement on the process to opt-in.
> 
> We all agree on this :)
> 
>>  - On arm64, in the above case, I'd like the preferred folio size to be 64K;
>> from the (admittedly limitted) testing I've done that's about where the
>> performance knee is and it doesn't appear to increase the memory wastage very
>> much. It also has the benefits that for 4K base pages this is the contpte size
>> (order-4) so I can take full benefit of contpte mappings transparently to the
>> process. And for 16K this is the HPA size (order-2).
> 
> My highest priority is to get 16KB proven first because it would
> benefit both client and server devices. So it may be different from
> yours but I don't see any conflict.

Do you mean 16K folios on a 4K base page system, or large folios on a 16K base
page system? I thought your focus was on speeding up 4K base page client systems
but this statement has got me wondering?

> 
>>  - On arm64 when the process has marked the VMA for THP (or when
>> transparent_hugepage=always) but the VMA does not meet the requirements for a
>> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
>> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
>> and for 64K this is 2M (order-5). The 64K base page case is very important since
>> the PMD size for that base page is 512MB which is almost impossible to allocate
>> in practice.
> 
> Which case (server or client) are you focusing on here? For our client
> devices, I can confidently say that 64KB has to be after 16KB, if it
> happens at all. For servers in general, I don't know of any major
> memory-intensive workloads that are not THP-aware, i.e., I don't think
> "VMA does not meet the requirements" is a concern.

For the 64K base page case, the focus is server. The problem reported by our
partner is that the 512M huge page size is too big to reliably allocate and so
the fauls always fall back to 64K base pages in practice. I would also speculate
(happy to be proved wrong) that there are many THP-aware workloads that assume
the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M
huge page when running on 64K base page system.

But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base
page system is a very real requirement. Our intent is that this will be the
mechanism we use to enable it.




_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-04 23:57         ` Matthew Wilcox
@ 2023-07-05  9:54           ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-05  9:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yin, Fengwei, Andrew Morton, Kirill A. Shutemov,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 05/07/2023 00:57, Matthew Wilcox wrote:
> On Tue, Jul 04, 2023 at 03:20:35PM +0100, Ryan Roberts wrote:
>> On 04/07/2023 04:45, Yin, Fengwei wrote:
>>>
>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>> THP is for huge page which is 2M size. We are not huge page here. But
>>> I don't have good name either.
>>
>> Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K
>> base page, they are 512M. So huge pages already have a variable size. And they
>> sometimes get PTE-mapped. So can't we just think of this as an extension of the
>> THP feature?
> 
> The confusing thing is that we have counters for the number of THP
> allocated (and number of THP mapped), and for those we always use
> PMD-size folios.

OK fair point. I really don't have a strong opinion on the name - I changed it
from LARGE_ANON_FOLIO because Yu was suggesting it should be tied to THP. So I'm
happy to change it back to LARGE_ANON_FOLIO (or something else) if that's the
concensus. But I expect I'll end up in a game of ping-pong. So I'm going to keep
this name for now and focus on converging the actual implementation to something
that is agreeable. Once we are there, we can argue about the name.

> 
> If we must have a config option, then this is ANON_LARGE_FOLIOS.
> 
> But why do we need a config option?  We don't have one for the
> page cache, and we're better off for it.  Yes, it depends on
> CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental
> heritage, and it'd be great to do away with that dependency eventually.
> 
> Hardware support isn't needed.  Large folios benefit us from a software
> point of view.  if we need a chicken bit, we can edit the source code
> to not create anon folios larger than order 0.

From my PoV it's about managing risk; there are currently parts of the mm that
will interact poorly with large pte-mapped folios (madvise, compaction, ...). We
want to incrementally fix that stuff, but until it's all fixed, we can't deploy
this as always-on. Further down the line when things are more complete and there
is more test coverage, we could remove the Kconfig or default it to enabled.


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-05  9:54           ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-05  9:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yin, Fengwei, Andrew Morton, Kirill A. Shutemov,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 05/07/2023 00:57, Matthew Wilcox wrote:
> On Tue, Jul 04, 2023 at 03:20:35PM +0100, Ryan Roberts wrote:
>> On 04/07/2023 04:45, Yin, Fengwei wrote:
>>>
>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>> THP is for huge page which is 2M size. We are not huge page here. But
>>> I don't have good name either.
>>
>> Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K
>> base page, they are 512M. So huge pages already have a variable size. And they
>> sometimes get PTE-mapped. So can't we just think of this as an extension of the
>> THP feature?
> 
> The confusing thing is that we have counters for the number of THP
> allocated (and number of THP mapped), and for those we always use
> PMD-size folios.

OK fair point. I really don't have a strong opinion on the name - I changed it
from LARGE_ANON_FOLIO because Yu was suggesting it should be tied to THP. So I'm
happy to change it back to LARGE_ANON_FOLIO (or something else) if that's the
concensus. But I expect I'll end up in a game of ping-pong. So I'm going to keep
this name for now and focus on converging the actual implementation to something
that is agreeable. Once we are there, we can argue about the name.

> 
> If we must have a config option, then this is ANON_LARGE_FOLIOS.
> 
> But why do we need a config option?  We don't have one for the
> page cache, and we're better off for it.  Yes, it depends on
> CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental
> heritage, and it'd be great to do away with that dependency eventually.
> 
> Hardware support isn't needed.  Large folios benefit us from a software
> point of view.  if we need a chicken bit, we can edit the source code
> to not create anon folios larger than order 0.

From my PoV it's about managing risk; there are currently parts of the mm that
will interact poorly with large pte-mapped folios (madvise, compaction, ...). We
want to incrementally fix that stuff, but until it's all fixed, we can't deploy
this as always-on. Further down the line when things are more complete and there
is more test coverage, we could remove the Kconfig or default it to enabled.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-05  0:21             ` Yu Zhao
@ 2023-07-05 10:16               ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-05 10:16 UTC (permalink / raw)
  To: Yu Zhao, Yin Fengwei
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 05/07/2023 01:21, Yu Zhao wrote:
> On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>
>>
>>
>> On 7/4/23 23:36, Ryan Roberts wrote:
>>> On 04/07/2023 08:11, Yu Zhao wrote:
>>>> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>
>>>>> On 7/4/2023 10:18 AM, Yu Zhao wrote:
>>>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> This is v2 of a series to implement variable order, large folios for anonymous
>>>>>>> memory. The objective of this is to improve performance by allocating larger
>>>>>>> chunks of memory during anonymous page faults. See [1] for background.
>>>>>>
>>>>>> Thanks for the quick response!
>>>>>>
>>>>>>> I've significantly reworked and simplified the patch set based on comments from
>>>>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
>>>>>>> VARIABLE_THP, on Yu's advice.
>>>>>>>
>>>>>>> The last patch is for arm64 to explicitly override the default
>>>>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted
>>>>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change
>>>>>>> could be handled through the arm64 tree separately. Neither has any build
>>>>>>> dependency on the other.
>>>>>>>
>>>>>>> The one area where I haven't followed Yu's advice is in the determination of the
>>>>>>> size of folio to use. It was suggested that I have a single preferred large
>>>>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
>>>>>>> being existing overlapping populated PTEs, etc) then fallback immediately to
>>>>>>> order-0. It turned out that this approach caused a performance regression in the
>>>>>>> Speedometer benchmark.
>>>>>>
>>>>>> I suppose it's regression against the v1, not the unpatched kernel.
>>>>> From the performance data Ryan shared, it's against unpatched kernel:
>>>>>
>>>>> Speedometer 2.0:
>>>>>
>>>>> | kernel                         |   runs_per_min |
>>>>> |:-------------------------------|---------------:|
>>>>> | baseline-4k                    |           0.0% |
>>>>> | anonfolio-lkml-v1              |           0.7% |
>>>>> | anonfolio-lkml-v2-simple-order |          -0.9% |
>>>>> | anonfolio-lkml-v2              |           0.5% |
>>>>
>>>> I see. Thanks.
>>>>
>>>> A couple of questions:
>>>> 1. Do we have a stddev?
>>>
>>> | kernel                    |   mean_abs |   std_abs |   mean_rel |   std_rel |
>>> |:------------------------- |-----------:|----------:|-----------:|----------:|
>>> | baseline-4k               |      117.4 |       0.8 |       0.0% |      0.7% |
>>> | anonfolio-v1              |      118.2 |         1 |       0.7% |      0.9% |
>>> | anonfolio-v2-simple-order |      116.4 |       1.1 |      -0.9% |      0.9% |
>>> | anonfolio-v2              |        118 |       1.2 |       0.5% |      1.0% |
>>>
>>> This is with 3 runs per reboot across 5 reboots, with first run after reboot
>>> trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data
>>> points per kernel in total.
>>>
>>> I've rerun the test multiple times and see similar results each time.
>>>
>>> I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I
>>> see the same performance as baseline-4k.
>>>
>>>
>>>> 2. Do we have a theory why it regressed?
>>>
>>> I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that
>>> mean when we fault, order-4 is often too big to fit in the VMA. So we fallback
>>> to order-0. I guess this is happening so often for this workload that the cost
>>> of doing the checks and fallback is outweighing the benefit of the memory that
>>> does end up with order-4 folios.
>>>
>>> I've sampled the memory in each bucket (once per second) while running and its
>>> roughly:
>>>
>>> 64K: 25%
>>> 32K: 15%
>>> 16K: 15%
>>> 4K: 45%
>>>
>>> 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order.
>>> But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and
>>> the 64K contents is more static - that's just a guess though.
>> So this is like out of vma range thing.
>>
>>>
>>>> Assuming no bugs, I don't see how a real regression could happen --
>>>> falling back to order-0 isn't different from the original behavior.
>>>> Ryan, could you `perf record` and `cat /proc/vmstat` and share them?
>>>
>>> I can, but it will have to be a bit later in the week. I'll do some more test
>>> runs overnight so we have a larger number of runs - hopefully that might tell us
>>> that this is noise to a certain extent.
>>>
>>> I'd still like to hear a clear technical argument for why the bin-packing
>>> approach is not the correct one!
>> My understanding to Yu's (Yu, correct me if I am wrong) comments is that we
>> postpone this part of change and make basic anon large folio support in. Then
>> discuss which approach we should take. Maybe people will agree retry is the
>> choice, maybe other approach will be taken...
>>
>> For example, for this out of VMA range case, per VMA order should be considered.
>> We don't need make decision that the retry should be taken now.
> 
> I've articulated the reasons in another email. Just summarize the most
> important point here:
> using more fallback orders makes a system reach equilibrium faster, at
> which point it can't allocate the order of arch_wants_pte_order()
> anymore. IOW, this best-fit policy can reduce the number of folios of
> the h/w prefered order for a system running long enough.

Thanks for taking the time to write all the arguments down. I understand what
you are saying. If we are considering the whole system, then we also need to
think about the page cache though, and that will allocate multiple orders, so
you are still going to suffer fragmentation from that user.

That said, I like the proposal patch posted where we have up to 3 orders that we
try in order of preference; hw-preferred, PAGE_ALLOC_COSTLY_ORDER and 0. That
feels like a good compromise that allows me to fulfil my objectives. I'm going
to pull this together into a v3 patch set and aim to post towards the end of the
week.

Are you ok for me to add a Suggested-by: for you? (submitting-patches.rst says I
need your explicit permission).

On the regression front, I've done a much bigger test run and see the regression
is still present (although the mean has shifted a little bit). I've also built a
kernel based on anonfolio-lkml-v2 but where arch_wants_pte_order() returns
order-3. The aim was to test your hypothesis that 64K allocation is slow. This
kernel is performing even better, so I think that confirms your hypothesis:

| kernel                         |   runs_per_min |   runs |   sessions |
|:-------------------------------|---------------:|-------:|-----------:|
| baseline-4k                    |           0.0% |     75 |         15 |
| anonfolio-lkml-v1              |           1.0% |     75 |         15 |
| anonfolio-lkml-v2-simple-order |          -0.4% |     75 |         15 |
| anonfolio-lkml-v2              |           0.9% |     75 |         15 |
| anonfolio-lkml-v2-32k          |           1.4% |     10 |          5 |

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-05 10:16               ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-05 10:16 UTC (permalink / raw)
  To: Yu Zhao, Yin Fengwei
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 05/07/2023 01:21, Yu Zhao wrote:
> On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>
>>
>>
>> On 7/4/23 23:36, Ryan Roberts wrote:
>>> On 04/07/2023 08:11, Yu Zhao wrote:
>>>> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>
>>>>> On 7/4/2023 10:18 AM, Yu Zhao wrote:
>>>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> This is v2 of a series to implement variable order, large folios for anonymous
>>>>>>> memory. The objective of this is to improve performance by allocating larger
>>>>>>> chunks of memory during anonymous page faults. See [1] for background.
>>>>>>
>>>>>> Thanks for the quick response!
>>>>>>
>>>>>>> I've significantly reworked and simplified the patch set based on comments from
>>>>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
>>>>>>> VARIABLE_THP, on Yu's advice.
>>>>>>>
>>>>>>> The last patch is for arm64 to explicitly override the default
>>>>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted
>>>>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change
>>>>>>> could be handled through the arm64 tree separately. Neither has any build
>>>>>>> dependency on the other.
>>>>>>>
>>>>>>> The one area where I haven't followed Yu's advice is in the determination of the
>>>>>>> size of folio to use. It was suggested that I have a single preferred large
>>>>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
>>>>>>> being existing overlapping populated PTEs, etc) then fallback immediately to
>>>>>>> order-0. It turned out that this approach caused a performance regression in the
>>>>>>> Speedometer benchmark.
>>>>>>
>>>>>> I suppose it's regression against the v1, not the unpatched kernel.
>>>>> From the performance data Ryan shared, it's against unpatched kernel:
>>>>>
>>>>> Speedometer 2.0:
>>>>>
>>>>> | kernel                         |   runs_per_min |
>>>>> |:-------------------------------|---------------:|
>>>>> | baseline-4k                    |           0.0% |
>>>>> | anonfolio-lkml-v1              |           0.7% |
>>>>> | anonfolio-lkml-v2-simple-order |          -0.9% |
>>>>> | anonfolio-lkml-v2              |           0.5% |
>>>>
>>>> I see. Thanks.
>>>>
>>>> A couple of questions:
>>>> 1. Do we have a stddev?
>>>
>>> | kernel                    |   mean_abs |   std_abs |   mean_rel |   std_rel |
>>> |:------------------------- |-----------:|----------:|-----------:|----------:|
>>> | baseline-4k               |      117.4 |       0.8 |       0.0% |      0.7% |
>>> | anonfolio-v1              |      118.2 |         1 |       0.7% |      0.9% |
>>> | anonfolio-v2-simple-order |      116.4 |       1.1 |      -0.9% |      0.9% |
>>> | anonfolio-v2              |        118 |       1.2 |       0.5% |      1.0% |
>>>
>>> This is with 3 runs per reboot across 5 reboots, with first run after reboot
>>> trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data
>>> points per kernel in total.
>>>
>>> I've rerun the test multiple times and see similar results each time.
>>>
>>> I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I
>>> see the same performance as baseline-4k.
>>>
>>>
>>>> 2. Do we have a theory why it regressed?
>>>
>>> I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that
>>> mean when we fault, order-4 is often too big to fit in the VMA. So we fallback
>>> to order-0. I guess this is happening so often for this workload that the cost
>>> of doing the checks and fallback is outweighing the benefit of the memory that
>>> does end up with order-4 folios.
>>>
>>> I've sampled the memory in each bucket (once per second) while running and its
>>> roughly:
>>>
>>> 64K: 25%
>>> 32K: 15%
>>> 16K: 15%
>>> 4K: 45%
>>>
>>> 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order.
>>> But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and
>>> the 64K contents is more static - that's just a guess though.
>> So this is like out of vma range thing.
>>
>>>
>>>> Assuming no bugs, I don't see how a real regression could happen --
>>>> falling back to order-0 isn't different from the original behavior.
>>>> Ryan, could you `perf record` and `cat /proc/vmstat` and share them?
>>>
>>> I can, but it will have to be a bit later in the week. I'll do some more test
>>> runs overnight so we have a larger number of runs - hopefully that might tell us
>>> that this is noise to a certain extent.
>>>
>>> I'd still like to hear a clear technical argument for why the bin-packing
>>> approach is not the correct one!
>> My understanding to Yu's (Yu, correct me if I am wrong) comments is that we
>> postpone this part of change and make basic anon large folio support in. Then
>> discuss which approach we should take. Maybe people will agree retry is the
>> choice, maybe other approach will be taken...
>>
>> For example, for this out of VMA range case, per VMA order should be considered.
>> We don't need make decision that the retry should be taken now.
> 
> I've articulated the reasons in another email. Just summarize the most
> important point here:
> using more fallback orders makes a system reach equilibrium faster, at
> which point it can't allocate the order of arch_wants_pte_order()
> anymore. IOW, this best-fit policy can reduce the number of folios of
> the h/w prefered order for a system running long enough.

Thanks for taking the time to write all the arguments down. I understand what
you are saying. If we are considering the whole system, then we also need to
think about the page cache though, and that will allocate multiple orders, so
you are still going to suffer fragmentation from that user.

That said, I like the proposal patch posted where we have up to 3 orders that we
try in order of preference; hw-preferred, PAGE_ALLOC_COSTLY_ORDER and 0. That
feels like a good compromise that allows me to fulfil my objectives. I'm going
to pull this together into a v3 patch set and aim to post towards the end of the
week.

Are you ok for me to add a Suggested-by: for you? (submitting-patches.rst says I
need your explicit permission).

On the regression front, I've done a much bigger test run and see the regression
is still present (although the mean has shifted a little bit). I've also built a
kernel based on anonfolio-lkml-v2 but where arch_wants_pte_order() returns
order-3. The aim was to test your hypothesis that 64K allocation is slow. This
kernel is performing even better, so I think that confirms your hypothesis:

| kernel                         |   runs_per_min |   runs |   sessions |
|:-------------------------------|---------------:|-------:|-----------:|
| baseline-4k                    |           0.0% |     75 |         15 |
| anonfolio-lkml-v1              |           1.0% |     75 |         15 |
| anonfolio-lkml-v2-simple-order |          -0.4% |     75 |         15 |
| anonfolio-lkml-v2              |           0.9% |     75 |         15 |
| anonfolio-lkml-v2-32k          |           1.4% |     10 |          5 |

Thanks,
Ryan


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-05  9:54           ` Ryan Roberts
@ 2023-07-05 12:08             ` Matthew Wilcox
  -1 siblings, 0 replies; 167+ messages in thread
From: Matthew Wilcox @ 2023-07-05 12:08 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Yin, Fengwei, Andrew Morton, Kirill A. Shutemov,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Wed, Jul 05, 2023 at 10:54:30AM +0100, Ryan Roberts wrote:
> On 05/07/2023 00:57, Matthew Wilcox wrote:
> > The confusing thing is that we have counters for the number of THP
> > allocated (and number of THP mapped), and for those we always use
> > PMD-size folios.
> 
> OK fair point. I really don't have a strong opinion on the name - I changed it
> from LARGE_ANON_FOLIO because Yu was suggesting it should be tied to THP. So I'm
> happy to change it back to LARGE_ANON_FOLIO (or something else) if that's the
> concensus. But I expect I'll end up in a game of ping-pong. So I'm going to keep
> this name for now and focus on converging the actual implementation to something
> that is agreeable. Once we are there, we can argue about the name.

I didn't see Yu arguing for changing the name of the config options,
just having far fewer of them.

> > If we must have a config option, then this is ANON_LARGE_FOLIOS.
> > 
> > But why do we need a config option?  We don't have one for the
> > page cache, and we're better off for it.  Yes, it depends on
> > CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental
> > heritage, and it'd be great to do away with that dependency eventually.
> > 
> > Hardware support isn't needed.  Large folios benefit us from a software
> > point of view.  if we need a chicken bit, we can edit the source code
> > to not create anon folios larger than order 0.
> 
> >From my PoV it's about managing risk; there are currently parts of the mm that
> will interact poorly with large pte-mapped folios (madvise, compaction, ...). We
> want to incrementally fix that stuff, but until it's all fixed, we can't deploy
> this as always-on. Further down the line when things are more complete and there
> is more test coverage, we could remove the Kconfig or default it to enabled.

We have to fix those places with the bad interactions, not merge a
Kconfig option that lets you turn it on to experiment.  That's how you
get a bad reputation and advice to disable a config option.  We had that
for years with CONFIG_TRANSPARENT_HUGEPAGE; people tried it out early on,
found the performance problems, and all these years later we still have
articles being published that say to turn it off.

By all means, we can have a golden patchset that we all agree is the
one to use for finding problems, and we can merge the pre-enabling work
"We don't have large anonymous folios yet, but when we do, this will
need to iterate over each page in the folio".

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-05 12:08             ` Matthew Wilcox
  0 siblings, 0 replies; 167+ messages in thread
From: Matthew Wilcox @ 2023-07-05 12:08 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Yin, Fengwei, Andrew Morton, Kirill A. Shutemov,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Wed, Jul 05, 2023 at 10:54:30AM +0100, Ryan Roberts wrote:
> On 05/07/2023 00:57, Matthew Wilcox wrote:
> > The confusing thing is that we have counters for the number of THP
> > allocated (and number of THP mapped), and for those we always use
> > PMD-size folios.
> 
> OK fair point. I really don't have a strong opinion on the name - I changed it
> from LARGE_ANON_FOLIO because Yu was suggesting it should be tied to THP. So I'm
> happy to change it back to LARGE_ANON_FOLIO (or something else) if that's the
> concensus. But I expect I'll end up in a game of ping-pong. So I'm going to keep
> this name for now and focus on converging the actual implementation to something
> that is agreeable. Once we are there, we can argue about the name.

I didn't see Yu arguing for changing the name of the config options,
just having far fewer of them.

> > If we must have a config option, then this is ANON_LARGE_FOLIOS.
> > 
> > But why do we need a config option?  We don't have one for the
> > page cache, and we're better off for it.  Yes, it depends on
> > CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental
> > heritage, and it'd be great to do away with that dependency eventually.
> > 
> > Hardware support isn't needed.  Large folios benefit us from a software
> > point of view.  if we need a chicken bit, we can edit the source code
> > to not create anon folios larger than order 0.
> 
> >From my PoV it's about managing risk; there are currently parts of the mm that
> will interact poorly with large pte-mapped folios (madvise, compaction, ...). We
> want to incrementally fix that stuff, but until it's all fixed, we can't deploy
> this as always-on. Further down the line when things are more complete and there
> is more test coverage, we could remove the Kconfig or default it to enabled.

We have to fix those places with the bad interactions, not merge a
Kconfig option that lets you turn it on to experiment.  That's how you
get a bad reputation and advice to disable a config option.  We had that
for years with CONFIG_TRANSPARENT_HUGEPAGE; people tried it out early on,
found the performance problems, and all these years later we still have
articles being published that say to turn it off.

By all means, we can have a golden patchset that we all agree is the
one to use for finding problems, and we can merge the pre-enabling work
"We don't have large anonymous folios yet, but when we do, this will
need to iterate over each page in the folio".

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-05  9:11           ` Ryan Roberts
@ 2023-07-05 17:24             ` Yu Zhao
  -1 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-05 17:24 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Wed, Jul 5, 2023 at 3:11 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/07/2023 03:07, Yu Zhao wrote:
> > On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 03/07/2023 20:50, Yu Zhao wrote:
> >>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> arch_wants_pte_order() can be overridden by the arch to return the
> >>>> preferred folio order for pte-mapped memory. This is useful as some
> >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >>>> memory is suitably contiguous.
> >>>>
> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to
> >>>> allocate large folios for anonymous memory to reduce page faults and
> >>>> other per-page operation costs.
> >>>>
> >>>> Here we add the default implementation of the function, used when the
> >>>> architecture does not define it, which returns the order corresponding
> >>>> to 64K.
> >>>
> >>> I don't really mind a non-zero default value. But people would ask why
> >>> non-zero and why 64KB. Probably you could argue this is the large size
> >>> all known archs support if they have TLB coalescing. For x86, AMD CPUs
> >>> would want to override this. I'll leave it to Fengwei to decide
> >>> whether Intel wants a different default value.>
> >>> Also I don't like the vma parameter because it makes
> >>> arch_wants_pte_order() a mix of hw preference and vma policy. From my
> >>> POV, the function should be only about the former; the latter should
> >>> be decided by arch-independent MM code. However, I can live with it if
> >>> ARM MM people think this is really what you want. ATM, I'm skeptical
> >>> they do.
> >>
> >> Here's the big picture for what I'm tryng to achieve:
> >>
> >>  - In the common case, I'd like all programs to get a performance bump by
> >> automatically and transparently using large anon folios - so no explicit
> >> requirement on the process to opt-in.
> >
> > We all agree on this :)
> >
> >>  - On arm64, in the above case, I'd like the preferred folio size to be 64K;
> >> from the (admittedly limitted) testing I've done that's about where the
> >> performance knee is and it doesn't appear to increase the memory wastage very
> >> much. It also has the benefits that for 4K base pages this is the contpte size
> >> (order-4) so I can take full benefit of contpte mappings transparently to the
> >> process. And for 16K this is the HPA size (order-2).
> >
> > My highest priority is to get 16KB proven first because it would
> > benefit both client and server devices. So it may be different from
> > yours but I don't see any conflict.
>
> Do you mean 16K folios on a 4K base page system

Yes.

> or large folios on a 16K base
> page system? I thought your focus was on speeding up 4K base page client systems
> but this statement has got me wondering?

Sorry, I should have said 4x4KB.

> >>  - On arm64 when the process has marked the VMA for THP (or when
> >> transparent_hugepage=always) but the VMA does not meet the requirements for a
> >> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
> >> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
> >> and for 64K this is 2M (order-5). The 64K base page case is very important since
> >> the PMD size for that base page is 512MB which is almost impossible to allocate
> >> in practice.
> >
> > Which case (server or client) are you focusing on here? For our client
> > devices, I can confidently say that 64KB has to be after 16KB, if it
> > happens at all. For servers in general, I don't know of any major
> > memory-intensive workloads that are not THP-aware, i.e., I don't think
> > "VMA does not meet the requirements" is a concern.
>
> For the 64K base page case, the focus is server. The problem reported by our
> partner is that the 512M huge page size is too big to reliably allocate and so
> the fauls always fall back to 64K base pages in practice. I would also speculate
> (happy to be proved wrong) that there are many THP-aware workloads that assume
> the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M
> huge page when running on 64K base page system.

Interesting. When you have something ready to share, I might be able
to try it on our ARM servers as well.

> But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base
> page system is a very real requirement. Our intent is that this will be the
> mechanism we use to enable it.

Yes, contpte makes more sense for what you described. It'd fit in a
lot better in the hugetlb case, but I guess your partner uses anon.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-05 17:24             ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-05 17:24 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Wed, Jul 5, 2023 at 3:11 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/07/2023 03:07, Yu Zhao wrote:
> > On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 03/07/2023 20:50, Yu Zhao wrote:
> >>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> arch_wants_pte_order() can be overridden by the arch to return the
> >>>> preferred folio order for pte-mapped memory. This is useful as some
> >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >>>> memory is suitably contiguous.
> >>>>
> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to
> >>>> allocate large folios for anonymous memory to reduce page faults and
> >>>> other per-page operation costs.
> >>>>
> >>>> Here we add the default implementation of the function, used when the
> >>>> architecture does not define it, which returns the order corresponding
> >>>> to 64K.
> >>>
> >>> I don't really mind a non-zero default value. But people would ask why
> >>> non-zero and why 64KB. Probably you could argue this is the large size
> >>> all known archs support if they have TLB coalescing. For x86, AMD CPUs
> >>> would want to override this. I'll leave it to Fengwei to decide
> >>> whether Intel wants a different default value.>
> >>> Also I don't like the vma parameter because it makes
> >>> arch_wants_pte_order() a mix of hw preference and vma policy. From my
> >>> POV, the function should be only about the former; the latter should
> >>> be decided by arch-independent MM code. However, I can live with it if
> >>> ARM MM people think this is really what you want. ATM, I'm skeptical
> >>> they do.
> >>
> >> Here's the big picture for what I'm tryng to achieve:
> >>
> >>  - In the common case, I'd like all programs to get a performance bump by
> >> automatically and transparently using large anon folios - so no explicit
> >> requirement on the process to opt-in.
> >
> > We all agree on this :)
> >
> >>  - On arm64, in the above case, I'd like the preferred folio size to be 64K;
> >> from the (admittedly limitted) testing I've done that's about where the
> >> performance knee is and it doesn't appear to increase the memory wastage very
> >> much. It also has the benefits that for 4K base pages this is the contpte size
> >> (order-4) so I can take full benefit of contpte mappings transparently to the
> >> process. And for 16K this is the HPA size (order-2).
> >
> > My highest priority is to get 16KB proven first because it would
> > benefit both client and server devices. So it may be different from
> > yours but I don't see any conflict.
>
> Do you mean 16K folios on a 4K base page system

Yes.

> or large folios on a 16K base
> page system? I thought your focus was on speeding up 4K base page client systems
> but this statement has got me wondering?

Sorry, I should have said 4x4KB.

> >>  - On arm64 when the process has marked the VMA for THP (or when
> >> transparent_hugepage=always) but the VMA does not meet the requirements for a
> >> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
> >> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
> >> and for 64K this is 2M (order-5). The 64K base page case is very important since
> >> the PMD size for that base page is 512MB which is almost impossible to allocate
> >> in practice.
> >
> > Which case (server or client) are you focusing on here? For our client
> > devices, I can confidently say that 64KB has to be after 16KB, if it
> > happens at all. For servers in general, I don't know of any major
> > memory-intensive workloads that are not THP-aware, i.e., I don't think
> > "VMA does not meet the requirements" is a concern.
>
> For the 64K base page case, the focus is server. The problem reported by our
> partner is that the 512M huge page size is too big to reliably allocate and so
> the fauls always fall back to 64K base pages in practice. I would also speculate
> (happy to be proved wrong) that there are many THP-aware workloads that assume
> the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M
> huge page when running on 64K base page system.

Interesting. When you have something ready to share, I might be able
to try it on our ARM servers as well.

> But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base
> page system is a very real requirement. Our intent is that this will be the
> mechanism we use to enable it.

Yes, contpte makes more sense for what you described. It'd fit in a
lot better in the hugetlb case, but I guess your partner uses anon.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-05 17:24             ` Yu Zhao
@ 2023-07-05 18:01               ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-05 18:01 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 05/07/2023 18:24, Yu Zhao wrote:
> On Wed, Jul 5, 2023 at 3:11 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 05/07/2023 03:07, Yu Zhao wrote:
>>> On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 03/07/2023 20:50, Yu Zhao wrote:
>>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>>>> memory is suitably contiguous.
>>>>>>
>>>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>>>> allocate large folios for anonymous memory to reduce page faults and
>>>>>> other per-page operation costs.
>>>>>>
>>>>>> Here we add the default implementation of the function, used when the
>>>>>> architecture does not define it, which returns the order corresponding
>>>>>> to 64K.
>>>>>
>>>>> I don't really mind a non-zero default value. But people would ask why
>>>>> non-zero and why 64KB. Probably you could argue this is the large size
>>>>> all known archs support if they have TLB coalescing. For x86, AMD CPUs
>>>>> would want to override this. I'll leave it to Fengwei to decide
>>>>> whether Intel wants a different default value.>
>>>>> Also I don't like the vma parameter because it makes
>>>>> arch_wants_pte_order() a mix of hw preference and vma policy. From my
>>>>> POV, the function should be only about the former; the latter should
>>>>> be decided by arch-independent MM code. However, I can live with it if
>>>>> ARM MM people think this is really what you want. ATM, I'm skeptical
>>>>> they do.
>>>>
>>>> Here's the big picture for what I'm tryng to achieve:
>>>>
>>>>  - In the common case, I'd like all programs to get a performance bump by
>>>> automatically and transparently using large anon folios - so no explicit
>>>> requirement on the process to opt-in.
>>>
>>> We all agree on this :)
>>>
>>>>  - On arm64, in the above case, I'd like the preferred folio size to be 64K;
>>>> from the (admittedly limitted) testing I've done that's about where the
>>>> performance knee is and it doesn't appear to increase the memory wastage very
>>>> much. It also has the benefits that for 4K base pages this is the contpte size
>>>> (order-4) so I can take full benefit of contpte mappings transparently to the
>>>> process. And for 16K this is the HPA size (order-2).
>>>
>>> My highest priority is to get 16KB proven first because it would
>>> benefit both client and server devices. So it may be different from
>>> yours but I don't see any conflict.
>>
>> Do you mean 16K folios on a 4K base page system
> 
> Yes.
> 
>> or large folios on a 16K base
>> page system? I thought your focus was on speeding up 4K base page client systems
>> but this statement has got me wondering?
> 
> Sorry, I should have said 4x4KB.

OK. Be aware that a number of Arm CPUs that support HPA don't have it enabled by
default (or at least don't have it enabled in the mode that you would want it to
see best performance with large anon folios). You would need EL3 access to
reconfigure it.

> 
>>>>  - On arm64 when the process has marked the VMA for THP (or when
>>>> transparent_hugepage=always) but the VMA does not meet the requirements for a
>>>> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
>>>> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
>>>> and for 64K this is 2M (order-5). The 64K base page case is very important since
>>>> the PMD size for that base page is 512MB which is almost impossible to allocate
>>>> in practice.
>>>
>>> Which case (server or client) are you focusing on here? For our client
>>> devices, I can confidently say that 64KB has to be after 16KB, if it
>>> happens at all. For servers in general, I don't know of any major
>>> memory-intensive workloads that are not THP-aware, i.e., I don't think
>>> "VMA does not meet the requirements" is a concern.
>>
>> For the 64K base page case, the focus is server. The problem reported by our
>> partner is that the 512M huge page size is too big to reliably allocate and so
>> the fauls always fall back to 64K base pages in practice. I would also speculate
>> (happy to be proved wrong) that there are many THP-aware workloads that assume
>> the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M
>> huge page when running on 64K base page system.
> 
> Interesting. When you have something ready to share, I might be able
> to try it on our ARM servers as well.

That would be really helpful. I'm currently updating my branch that collates
everything to reflect the review comments in this patch set and the contpte
patch set. I'll share it in a couple of weeks.

> 
>> But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base
>> page system is a very real requirement. Our intent is that this will be the
>> mechanism we use to enable it.
> 
> Yes, contpte makes more sense for what you described. It'd fit in a
> lot better in the hugetlb case, but I guess your partner uses anon.

arm64 already supports contpte for hugetlb, but they need it to work with anon
memory using THP.



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-05 18:01               ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-05 18:01 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 05/07/2023 18:24, Yu Zhao wrote:
> On Wed, Jul 5, 2023 at 3:11 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 05/07/2023 03:07, Yu Zhao wrote:
>>> On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 03/07/2023 20:50, Yu Zhao wrote:
>>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>>>> memory is suitably contiguous.
>>>>>>
>>>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>>>> allocate large folios for anonymous memory to reduce page faults and
>>>>>> other per-page operation costs.
>>>>>>
>>>>>> Here we add the default implementation of the function, used when the
>>>>>> architecture does not define it, which returns the order corresponding
>>>>>> to 64K.
>>>>>
>>>>> I don't really mind a non-zero default value. But people would ask why
>>>>> non-zero and why 64KB. Probably you could argue this is the large size
>>>>> all known archs support if they have TLB coalescing. For x86, AMD CPUs
>>>>> would want to override this. I'll leave it to Fengwei to decide
>>>>> whether Intel wants a different default value.>
>>>>> Also I don't like the vma parameter because it makes
>>>>> arch_wants_pte_order() a mix of hw preference and vma policy. From my
>>>>> POV, the function should be only about the former; the latter should
>>>>> be decided by arch-independent MM code. However, I can live with it if
>>>>> ARM MM people think this is really what you want. ATM, I'm skeptical
>>>>> they do.
>>>>
>>>> Here's the big picture for what I'm tryng to achieve:
>>>>
>>>>  - In the common case, I'd like all programs to get a performance bump by
>>>> automatically and transparently using large anon folios - so no explicit
>>>> requirement on the process to opt-in.
>>>
>>> We all agree on this :)
>>>
>>>>  - On arm64, in the above case, I'd like the preferred folio size to be 64K;
>>>> from the (admittedly limitted) testing I've done that's about where the
>>>> performance knee is and it doesn't appear to increase the memory wastage very
>>>> much. It also has the benefits that for 4K base pages this is the contpte size
>>>> (order-4) so I can take full benefit of contpte mappings transparently to the
>>>> process. And for 16K this is the HPA size (order-2).
>>>
>>> My highest priority is to get 16KB proven first because it would
>>> benefit both client and server devices. So it may be different from
>>> yours but I don't see any conflict.
>>
>> Do you mean 16K folios on a 4K base page system
> 
> Yes.
> 
>> or large folios on a 16K base
>> page system? I thought your focus was on speeding up 4K base page client systems
>> but this statement has got me wondering?
> 
> Sorry, I should have said 4x4KB.

OK. Be aware that a number of Arm CPUs that support HPA don't have it enabled by
default (or at least don't have it enabled in the mode that you would want it to
see best performance with large anon folios). You would need EL3 access to
reconfigure it.

> 
>>>>  - On arm64 when the process has marked the VMA for THP (or when
>>>> transparent_hugepage=always) but the VMA does not meet the requirements for a
>>>> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
>>>> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
>>>> and for 64K this is 2M (order-5). The 64K base page case is very important since
>>>> the PMD size for that base page is 512MB which is almost impossible to allocate
>>>> in practice.
>>>
>>> Which case (server or client) are you focusing on here? For our client
>>> devices, I can confidently say that 64KB has to be after 16KB, if it
>>> happens at all. For servers in general, I don't know of any major
>>> memory-intensive workloads that are not THP-aware, i.e., I don't think
>>> "VMA does not meet the requirements" is a concern.
>>
>> For the 64K base page case, the focus is server. The problem reported by our
>> partner is that the 512M huge page size is too big to reliably allocate and so
>> the fauls always fall back to 64K base pages in practice. I would also speculate
>> (happy to be proved wrong) that there are many THP-aware workloads that assume
>> the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M
>> huge page when running on 64K base page system.
> 
> Interesting. When you have something ready to share, I might be able
> to try it on our ARM servers as well.

That would be really helpful. I'm currently updating my branch that collates
everything to reflect the review comments in this patch set and the contpte
patch set. I'll share it in a couple of weeks.

> 
>> But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base
>> page system is a very real requirement. Our intent is that this will be the
>> mechanism we use to enable it.
> 
> Yes, contpte makes more sense for what you described. It'd fit in a
> lot better in the hugetlb case, but I guess your partner uses anon.

arm64 already supports contpte for hugetlb, but they need it to work with anon
memory using THP.



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-05 10:16               ` Ryan Roberts
@ 2023-07-05 19:00                 ` Yu Zhao
  -1 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-05 19:00 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Yin Fengwei, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 8903 bytes --]

On Wed, Jul 5, 2023 at 4:16 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/07/2023 01:21, Yu Zhao wrote:
> > On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>
> >>
> >>
> >> On 7/4/23 23:36, Ryan Roberts wrote:
> >>> On 04/07/2023 08:11, Yu Zhao wrote:
> >>>> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>>>>
> >>>>> On 7/4/2023 10:18 AM, Yu Zhao wrote:
> >>>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>
> >>>>>>> Hi All,
> >>>>>>>
> >>>>>>> This is v2 of a series to implement variable order, large folios for anonymous
> >>>>>>> memory. The objective of this is to improve performance by allocating larger
> >>>>>>> chunks of memory during anonymous page faults. See [1] for background.
> >>>>>>
> >>>>>> Thanks for the quick response!
> >>>>>>
> >>>>>>> I've significantly reworked and simplified the patch set based on comments from
> >>>>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> >>>>>>> VARIABLE_THP, on Yu's advice.
> >>>>>>>
> >>>>>>> The last patch is for arm64 to explicitly override the default
> >>>>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted
> >>>>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change
> >>>>>>> could be handled through the arm64 tree separately. Neither has any build
> >>>>>>> dependency on the other.
> >>>>>>>
> >>>>>>> The one area where I haven't followed Yu's advice is in the determination of the
> >>>>>>> size of folio to use. It was suggested that I have a single preferred large
> >>>>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> >>>>>>> being existing overlapping populated PTEs, etc) then fallback immediately to
> >>>>>>> order-0. It turned out that this approach caused a performance regression in the
> >>>>>>> Speedometer benchmark.
> >>>>>>
> >>>>>> I suppose it's regression against the v1, not the unpatched kernel.
> >>>>> From the performance data Ryan shared, it's against unpatched kernel:
> >>>>>
> >>>>> Speedometer 2.0:
> >>>>>
> >>>>> | kernel                         |   runs_per_min |
> >>>>> |:-------------------------------|---------------:|
> >>>>> | baseline-4k                    |           0.0% |
> >>>>> | anonfolio-lkml-v1              |           0.7% |
> >>>>> | anonfolio-lkml-v2-simple-order |          -0.9% |
> >>>>> | anonfolio-lkml-v2              |           0.5% |
> >>>>
> >>>> I see. Thanks.
> >>>>
> >>>> A couple of questions:
> >>>> 1. Do we have a stddev?
> >>>
> >>> | kernel                    |   mean_abs |   std_abs |   mean_rel |   std_rel |
> >>> |:------------------------- |-----------:|----------:|-----------:|----------:|
> >>> | baseline-4k               |      117.4 |       0.8 |       0.0% |      0.7% |
> >>> | anonfolio-v1              |      118.2 |         1 |       0.7% |      0.9% |
> >>> | anonfolio-v2-simple-order |      116.4 |       1.1 |      -0.9% |      0.9% |
> >>> | anonfolio-v2              |        118 |       1.2 |       0.5% |      1.0% |
> >>>
> >>> This is with 3 runs per reboot across 5 reboots, with first run after reboot
> >>> trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data
> >>> points per kernel in total.
> >>>
> >>> I've rerun the test multiple times and see similar results each time.
> >>>
> >>> I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I
> >>> see the same performance as baseline-4k.
> >>>
> >>>
> >>>> 2. Do we have a theory why it regressed?
> >>>
> >>> I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that
> >>> mean when we fault, order-4 is often too big to fit in the VMA. So we fallback
> >>> to order-0. I guess this is happening so often for this workload that the cost
> >>> of doing the checks and fallback is outweighing the benefit of the memory that
> >>> does end up with order-4 folios.
> >>>
> >>> I've sampled the memory in each bucket (once per second) while running and its
> >>> roughly:
> >>>
> >>> 64K: 25%
> >>> 32K: 15%
> >>> 16K: 15%
> >>> 4K: 45%
> >>>
> >>> 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order.
> >>> But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and
> >>> the 64K contents is more static - that's just a guess though.
> >> So this is like out of vma range thing.
> >>
> >>>
> >>>> Assuming no bugs, I don't see how a real regression could happen --
> >>>> falling back to order-0 isn't different from the original behavior.
> >>>> Ryan, could you `perf record` and `cat /proc/vmstat` and share them?
> >>>
> >>> I can, but it will have to be a bit later in the week. I'll do some more test
> >>> runs overnight so we have a larger number of runs - hopefully that might tell us
> >>> that this is noise to a certain extent.
> >>>
> >>> I'd still like to hear a clear technical argument for why the bin-packing
> >>> approach is not the correct one!
> >> My understanding to Yu's (Yu, correct me if I am wrong) comments is that we
> >> postpone this part of change and make basic anon large folio support in. Then
> >> discuss which approach we should take. Maybe people will agree retry is the
> >> choice, maybe other approach will be taken...
> >>
> >> For example, for this out of VMA range case, per VMA order should be considered.
> >> We don't need make decision that the retry should be taken now.
> >
> > I've articulated the reasons in another email. Just summarize the most
> > important point here:
> > using more fallback orders makes a system reach equilibrium faster, at
> > which point it can't allocate the order of arch_wants_pte_order()
> > anymore. IOW, this best-fit policy can reduce the number of folios of
> > the h/w prefered order for a system running long enough.
>
> Thanks for taking the time to write all the arguments down. I understand what
> you are saying. If we are considering the whole system, then we also need to
> think about the page cache though, and that will allocate multiple orders, so
> you are still going to suffer fragmentation from that user.

1. page cache doesn't use the best-fit policy -- it has the advantage
of having RA hit/miss numbers -- IOW, it doesn't try all orders
without an estimated ROI.
2. page cache causes far less fragmentation in my experience: clean
page cache gets reclaimed first under memory; unmapped page cache is
less costly to migrate. Neither is true for anon, and what makes it
worse is that heavy anon users usually enable zram/zswap: allocating
memory (to store compressed data) under memory pressure makes
reclaim/compaction even harder.

> That said, I like the proposal patch posted where we have up to 3 orders that we
> try in order of preference; hw-preferred, PAGE_ALLOC_COSTLY_ORDER and 0. That
> feels like a good compromise that allows me to fulfil my objectives. I'm going
> to pull this together into a v3 patch set and aim to post towards the end of the
> week.
>
> Are you ok for me to add a Suggested-by: for you? (submitting-patches.rst says I
> need your explicit permission).

Thanks for asking. No need to worry about it -- it's been a great team
work with you, Fengwei, Yang et al.

I'm attaching a single patch containing all pieces I spelled
out/implied/forgot to mention. It doesn't depend on other series -- I
just stress-tested it on top of the latest mm-unstable. Please feel
free to reuse any bits you see fit. Again no need to worry about
Suggested-by.

> On the regression front, I've done a much bigger test run and see the regression
> is still present (although the mean has shifted a little bit). I've also built a
> kernel based on anonfolio-lkml-v2 but where arch_wants_pte_order() returns
> order-3. The aim was to test your hypothesis that 64K allocation is slow. This
> kernel is performing even better, so I think that confirms your hypothesis:

Great, thanks for confirming.

> | kernel                         |   runs_per_min |   runs |   sessions |
> |:-------------------------------|---------------:|-------:|-----------:|
> | baseline-4k                    |           0.0% |     75 |         15 |
> | anonfolio-lkml-v1              |           1.0% |     75 |         15 |
> | anonfolio-lkml-v2-simple-order |          -0.4% |     75 |         15 |
> | anonfolio-lkml-v2              |           0.9% |     75 |         15 |
> | anonfolio-lkml-v2-32k          |           1.4% |     10 |          5 |

Since we are all committed to the effort long term, the last number is
good enough for the initial step to conclude. Hopefully v3 can address
all pending comments and get into mm-unstable.

[-- Attachment #2: large_anon.patch --]
[-- Type: application/octet-stream, Size: 7081 bytes --]

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5063b482e34f..113d35d993ce 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -313,6 +313,13 @@ static inline bool arch_has_hw_pte_young(void)
 }
 #endif
 
+#ifndef arch_wants_pte_order
+static inline int arch_wants_pte_order(void)
+{
+	return 0;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index cf4ae87b1563..238f1a2ffbff 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4059,6 +4059,81 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	return ret;
 }
 
+static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+{
+	int i;
+
+	if (nr_pages == 1)
+		return vmf_pte_changed(vmf);
+
+	for (i = 0; i < nr_pages; i++) {
+		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
+			return true;
+	}
+
+	return false;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+{
+	int i;
+	gfp_t gfp;
+	pte_t *pte;
+	unsigned long addr;
+	struct vm_area_struct *vma = vmf->vma;
+	int preferred = arch_wants_pte_order() ? : PAGE_ALLOC_COSTLY_ORDER;
+	int orders[] = {
+		preferred,
+		preferred > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
+		0,
+	};
+
+	if (vmf_orig_pte_uffd_wp(vmf))
+		goto fallback;
+
+	for (i = 0; orders[i]; i++) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+		if (addr >= vma->vm_start && addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
+			break;
+	}
+
+	if (!orders[i])
+		goto fallback;
+
+	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+	VM_WARN_ON_ONCE(vmf->pte);
+
+	for (; orders[i]; i++) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+		vmf->pte = pte + pte_index(addr);
+		if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
+			break;
+	}
+
+	vmf->pte = NULL;
+	pte_unmap(pte);
+
+	gfp = vma_thp_gfp_mask(vma);
+
+	for (; orders[i]; i++) {
+		struct folio *folio;
+
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+		folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
+		if (folio) {
+			clear_huge_page(&folio->page, addr, 1 << orders[i]);
+			vmf->address = addr;
+			return folio;
+		}
+	}
+fallback:
+	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
+}
+#else
+#define alloc_anon_folio(vmf) vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
+#endif
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4066,6 +4141,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
  */
 static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 {
+	int i = 0;
+	int nr_pages = 1;
 	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
@@ -4110,10 +4187,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	folio = alloc_anon_folio(vmf);
 	if (!folio)
 		goto oom;
 
+	nr_pages = folio_nr_pages(folio);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
 	folio_throttle_swaprate(folio, GFP_KERNEL);
@@ -4125,17 +4204,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	 */
 	__folio_mark_uptodate(folio);
 
-	entry = mk_pte(&folio->page, vma->vm_page_prot);
-	entry = pte_sw_mkyoung(entry);
-	if (vma->vm_flags & VM_WRITE)
-		entry = pte_mkwrite(pte_mkdirty(entry));
-
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
 	if (!vmf->pte)
 		goto release;
-	if (vmf_pte_changed(vmf)) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
+	if (vmf_pte_range_changed(vmf, nr_pages)) {
+		for (i = 0; i < nr_pages; i++)
+			update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i);
 		goto release;
 	}
 
@@ -4150,16 +4225,24 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
+	folio_ref_add(folio, nr_pages - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
 	folio_add_new_anon_rmap(folio, vma, vmf->address);
 	folio_add_lru_vma(folio, vma);
+
+	for (i = 0; i < nr_pages; i++) {
+		entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
+		entry = pte_sw_mkyoung(entry);
+		if (vma->vm_flags & VM_WRITE)
+			entry = pte_mkwrite(pte_mkdirty(entry));
 setpte:
-	if (uffd_wp)
-		entry = pte_mkuffd_wp(entry);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+		if (uffd_wp)
+			entry = pte_mkuffd_wp(entry);
+		set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i, vmf->pte + i, entry);
 
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, vmf->address, vmf->pte);
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i);
+	}
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
diff --git a/mm/rmap.c b/mm/rmap.c
index 0c0d8857dfce..fb120c8717ec 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1284,25 +1284,36 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 		unsigned long address)
 {
-	int nr;
+	int nr = folio_nr_pages(folio);
 
-	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	VM_BUG_ON_VMA(address < vma->vm_start || address + PAGE_SIZE * nr > vma->vm_end, vma);
 	__folio_set_swapbacked(folio);
 
-	if (likely(!folio_test_pmd_mappable(folio))) {
+	if (!folio_test_large(folio)) {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_mapcount, 0);
-		nr = 1;
+		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
+	} else if (!folio_test_pmd_mappable(folio)) {
+		int i;
+
+		for (i = 0; i < nr; i++) {
+			struct page *page = folio_page(folio, i);
+
+			/* increment count (starts at -1) */
+			atomic_set(&page->_mapcount, 0);
+			__page_set_anon_rmap(folio, page, vma, address + PAGE_SIZE * i, 1);
+		}
+		/* increment count (starts at 0) */
+		atomic_set(&folio->_nr_pages_mapped, nr);
 	} else {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_entire_mapcount, 0);
 		atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
-		nr = folio_nr_pages(folio);
+		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
 	}
 
 	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
-	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 }
 
 /**
@@ -1430,7 +1441,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 		 * page of the folio is unmapped and at least one page
 		 * is still mapped.
 		 */
-		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
+		if (folio_test_large(folio) && folio_test_anon(folio))
 			if (!compound || nr < nr_pmdmapped)
 				deferred_split_folio(folio);
 	}

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-05 19:00                 ` Yu Zhao
  0 siblings, 0 replies; 167+ messages in thread
From: Yu Zhao @ 2023-07-05 19:00 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Yin Fengwei, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 8903 bytes --]

On Wed, Jul 5, 2023 at 4:16 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/07/2023 01:21, Yu Zhao wrote:
> > On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>
> >>
> >>
> >> On 7/4/23 23:36, Ryan Roberts wrote:
> >>> On 04/07/2023 08:11, Yu Zhao wrote:
> >>>> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>>>>
> >>>>> On 7/4/2023 10:18 AM, Yu Zhao wrote:
> >>>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>
> >>>>>>> Hi All,
> >>>>>>>
> >>>>>>> This is v2 of a series to implement variable order, large folios for anonymous
> >>>>>>> memory. The objective of this is to improve performance by allocating larger
> >>>>>>> chunks of memory during anonymous page faults. See [1] for background.
> >>>>>>
> >>>>>> Thanks for the quick response!
> >>>>>>
> >>>>>>> I've significantly reworked and simplified the patch set based on comments from
> >>>>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> >>>>>>> VARIABLE_THP, on Yu's advice.
> >>>>>>>
> >>>>>>> The last patch is for arm64 to explicitly override the default
> >>>>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted
> >>>>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change
> >>>>>>> could be handled through the arm64 tree separately. Neither has any build
> >>>>>>> dependency on the other.
> >>>>>>>
> >>>>>>> The one area where I haven't followed Yu's advice is in the determination of the
> >>>>>>> size of folio to use. It was suggested that I have a single preferred large
> >>>>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> >>>>>>> being existing overlapping populated PTEs, etc) then fallback immediately to
> >>>>>>> order-0. It turned out that this approach caused a performance regression in the
> >>>>>>> Speedometer benchmark.
> >>>>>>
> >>>>>> I suppose it's regression against the v1, not the unpatched kernel.
> >>>>> From the performance data Ryan shared, it's against unpatched kernel:
> >>>>>
> >>>>> Speedometer 2.0:
> >>>>>
> >>>>> | kernel                         |   runs_per_min |
> >>>>> |:-------------------------------|---------------:|
> >>>>> | baseline-4k                    |           0.0% |
> >>>>> | anonfolio-lkml-v1              |           0.7% |
> >>>>> | anonfolio-lkml-v2-simple-order |          -0.9% |
> >>>>> | anonfolio-lkml-v2              |           0.5% |
> >>>>
> >>>> I see. Thanks.
> >>>>
> >>>> A couple of questions:
> >>>> 1. Do we have a stddev?
> >>>
> >>> | kernel                    |   mean_abs |   std_abs |   mean_rel |   std_rel |
> >>> |:------------------------- |-----------:|----------:|-----------:|----------:|
> >>> | baseline-4k               |      117.4 |       0.8 |       0.0% |      0.7% |
> >>> | anonfolio-v1              |      118.2 |         1 |       0.7% |      0.9% |
> >>> | anonfolio-v2-simple-order |      116.4 |       1.1 |      -0.9% |      0.9% |
> >>> | anonfolio-v2              |        118 |       1.2 |       0.5% |      1.0% |
> >>>
> >>> This is with 3 runs per reboot across 5 reboots, with first run after reboot
> >>> trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data
> >>> points per kernel in total.
> >>>
> >>> I've rerun the test multiple times and see similar results each time.
> >>>
> >>> I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I
> >>> see the same performance as baseline-4k.
> >>>
> >>>
> >>>> 2. Do we have a theory why it regressed?
> >>>
> >>> I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that
> >>> mean when we fault, order-4 is often too big to fit in the VMA. So we fallback
> >>> to order-0. I guess this is happening so often for this workload that the cost
> >>> of doing the checks and fallback is outweighing the benefit of the memory that
> >>> does end up with order-4 folios.
> >>>
> >>> I've sampled the memory in each bucket (once per second) while running and its
> >>> roughly:
> >>>
> >>> 64K: 25%
> >>> 32K: 15%
> >>> 16K: 15%
> >>> 4K: 45%
> >>>
> >>> 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order.
> >>> But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and
> >>> the 64K contents is more static - that's just a guess though.
> >> So this is like out of vma range thing.
> >>
> >>>
> >>>> Assuming no bugs, I don't see how a real regression could happen --
> >>>> falling back to order-0 isn't different from the original behavior.
> >>>> Ryan, could you `perf record` and `cat /proc/vmstat` and share them?
> >>>
> >>> I can, but it will have to be a bit later in the week. I'll do some more test
> >>> runs overnight so we have a larger number of runs - hopefully that might tell us
> >>> that this is noise to a certain extent.
> >>>
> >>> I'd still like to hear a clear technical argument for why the bin-packing
> >>> approach is not the correct one!
> >> My understanding to Yu's (Yu, correct me if I am wrong) comments is that we
> >> postpone this part of change and make basic anon large folio support in. Then
> >> discuss which approach we should take. Maybe people will agree retry is the
> >> choice, maybe other approach will be taken...
> >>
> >> For example, for this out of VMA range case, per VMA order should be considered.
> >> We don't need make decision that the retry should be taken now.
> >
> > I've articulated the reasons in another email. Just summarize the most
> > important point here:
> > using more fallback orders makes a system reach equilibrium faster, at
> > which point it can't allocate the order of arch_wants_pte_order()
> > anymore. IOW, this best-fit policy can reduce the number of folios of
> > the h/w prefered order for a system running long enough.
>
> Thanks for taking the time to write all the arguments down. I understand what
> you are saying. If we are considering the whole system, then we also need to
> think about the page cache though, and that will allocate multiple orders, so
> you are still going to suffer fragmentation from that user.

1. page cache doesn't use the best-fit policy -- it has the advantage
of having RA hit/miss numbers -- IOW, it doesn't try all orders
without an estimated ROI.
2. page cache causes far less fragmentation in my experience: clean
page cache gets reclaimed first under memory; unmapped page cache is
less costly to migrate. Neither is true for anon, and what makes it
worse is that heavy anon users usually enable zram/zswap: allocating
memory (to store compressed data) under memory pressure makes
reclaim/compaction even harder.

> That said, I like the proposal patch posted where we have up to 3 orders that we
> try in order of preference; hw-preferred, PAGE_ALLOC_COSTLY_ORDER and 0. That
> feels like a good compromise that allows me to fulfil my objectives. I'm going
> to pull this together into a v3 patch set and aim to post towards the end of the
> week.
>
> Are you ok for me to add a Suggested-by: for you? (submitting-patches.rst says I
> need your explicit permission).

Thanks for asking. No need to worry about it -- it's been a great team
work with you, Fengwei, Yang et al.

I'm attaching a single patch containing all pieces I spelled
out/implied/forgot to mention. It doesn't depend on other series -- I
just stress-tested it on top of the latest mm-unstable. Please feel
free to reuse any bits you see fit. Again no need to worry about
Suggested-by.

> On the regression front, I've done a much bigger test run and see the regression
> is still present (although the mean has shifted a little bit). I've also built a
> kernel based on anonfolio-lkml-v2 but where arch_wants_pte_order() returns
> order-3. The aim was to test your hypothesis that 64K allocation is slow. This
> kernel is performing even better, so I think that confirms your hypothesis:

Great, thanks for confirming.

> | kernel                         |   runs_per_min |   runs |   sessions |
> |:-------------------------------|---------------:|-------:|-----------:|
> | baseline-4k                    |           0.0% |     75 |         15 |
> | anonfolio-lkml-v1              |           1.0% |     75 |         15 |
> | anonfolio-lkml-v2-simple-order |          -0.4% |     75 |         15 |
> | anonfolio-lkml-v2              |           0.9% |     75 |         15 |
> | anonfolio-lkml-v2-32k          |           1.4% |     10 |          5 |

Since we are all committed to the effort long term, the last number is
good enough for the initial step to conclude. Hopefully v3 can address
all pending comments and get into mm-unstable.

[-- Attachment #2: large_anon.patch --]
[-- Type: application/octet-stream, Size: 7081 bytes --]

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5063b482e34f..113d35d993ce 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -313,6 +313,13 @@ static inline bool arch_has_hw_pte_young(void)
 }
 #endif
 
+#ifndef arch_wants_pte_order
+static inline int arch_wants_pte_order(void)
+{
+	return 0;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index cf4ae87b1563..238f1a2ffbff 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4059,6 +4059,81 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	return ret;
 }
 
+static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+{
+	int i;
+
+	if (nr_pages == 1)
+		return vmf_pte_changed(vmf);
+
+	for (i = 0; i < nr_pages; i++) {
+		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
+			return true;
+	}
+
+	return false;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+{
+	int i;
+	gfp_t gfp;
+	pte_t *pte;
+	unsigned long addr;
+	struct vm_area_struct *vma = vmf->vma;
+	int preferred = arch_wants_pte_order() ? : PAGE_ALLOC_COSTLY_ORDER;
+	int orders[] = {
+		preferred,
+		preferred > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
+		0,
+	};
+
+	if (vmf_orig_pte_uffd_wp(vmf))
+		goto fallback;
+
+	for (i = 0; orders[i]; i++) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+		if (addr >= vma->vm_start && addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
+			break;
+	}
+
+	if (!orders[i])
+		goto fallback;
+
+	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+	VM_WARN_ON_ONCE(vmf->pte);
+
+	for (; orders[i]; i++) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+		vmf->pte = pte + pte_index(addr);
+		if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
+			break;
+	}
+
+	vmf->pte = NULL;
+	pte_unmap(pte);
+
+	gfp = vma_thp_gfp_mask(vma);
+
+	for (; orders[i]; i++) {
+		struct folio *folio;
+
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+		folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
+		if (folio) {
+			clear_huge_page(&folio->page, addr, 1 << orders[i]);
+			vmf->address = addr;
+			return folio;
+		}
+	}
+fallback:
+	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
+}
+#else
+#define alloc_anon_folio(vmf) vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
+#endif
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4066,6 +4141,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
  */
 static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 {
+	int i = 0;
+	int nr_pages = 1;
 	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
@@ -4110,10 +4187,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	folio = alloc_anon_folio(vmf);
 	if (!folio)
 		goto oom;
 
+	nr_pages = folio_nr_pages(folio);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
 	folio_throttle_swaprate(folio, GFP_KERNEL);
@@ -4125,17 +4204,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	 */
 	__folio_mark_uptodate(folio);
 
-	entry = mk_pte(&folio->page, vma->vm_page_prot);
-	entry = pte_sw_mkyoung(entry);
-	if (vma->vm_flags & VM_WRITE)
-		entry = pte_mkwrite(pte_mkdirty(entry));
-
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
 	if (!vmf->pte)
 		goto release;
-	if (vmf_pte_changed(vmf)) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
+	if (vmf_pte_range_changed(vmf, nr_pages)) {
+		for (i = 0; i < nr_pages; i++)
+			update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i);
 		goto release;
 	}
 
@@ -4150,16 +4225,24 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
+	folio_ref_add(folio, nr_pages - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
 	folio_add_new_anon_rmap(folio, vma, vmf->address);
 	folio_add_lru_vma(folio, vma);
+
+	for (i = 0; i < nr_pages; i++) {
+		entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
+		entry = pte_sw_mkyoung(entry);
+		if (vma->vm_flags & VM_WRITE)
+			entry = pte_mkwrite(pte_mkdirty(entry));
 setpte:
-	if (uffd_wp)
-		entry = pte_mkuffd_wp(entry);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+		if (uffd_wp)
+			entry = pte_mkuffd_wp(entry);
+		set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i, vmf->pte + i, entry);
 
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, vmf->address, vmf->pte);
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i);
+	}
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
diff --git a/mm/rmap.c b/mm/rmap.c
index 0c0d8857dfce..fb120c8717ec 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1284,25 +1284,36 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 		unsigned long address)
 {
-	int nr;
+	int nr = folio_nr_pages(folio);
 
-	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	VM_BUG_ON_VMA(address < vma->vm_start || address + PAGE_SIZE * nr > vma->vm_end, vma);
 	__folio_set_swapbacked(folio);
 
-	if (likely(!folio_test_pmd_mappable(folio))) {
+	if (!folio_test_large(folio)) {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_mapcount, 0);
-		nr = 1;
+		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
+	} else if (!folio_test_pmd_mappable(folio)) {
+		int i;
+
+		for (i = 0; i < nr; i++) {
+			struct page *page = folio_page(folio, i);
+
+			/* increment count (starts at -1) */
+			atomic_set(&page->_mapcount, 0);
+			__page_set_anon_rmap(folio, page, vma, address + PAGE_SIZE * i, 1);
+		}
+		/* increment count (starts at 0) */
+		atomic_set(&folio->_nr_pages_mapped, nr);
 	} else {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_entire_mapcount, 0);
 		atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
-		nr = folio_nr_pages(folio);
+		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
 	}
 
 	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
-	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 }
 
 /**
@@ -1430,7 +1441,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 		 * page of the folio is unmapped and at least one page
 		 * is still mapped.
 		 */
-		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
+		if (folio_test_large(folio) && folio_test_anon(folio))
 			if (!compound || nr < nr_pmdmapped)
 				deferred_split_folio(folio);
 	}

[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-03 13:53 ` Ryan Roberts
@ 2023-07-05 19:38   ` David Hildenbrand
  -1 siblings, 0 replies; 167+ messages in thread
From: David Hildenbrand @ 2023-07-05 19:38 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: linux-arm-kernel, linux-kernel, linux-mm

On 03.07.23 15:53, Ryan Roberts wrote:
> Hi All,
> 
> This is v2 of a series to implement variable order, large folios for anonymous
> memory. The objective of this is to improve performance by allocating larger
> chunks of memory during anonymous page faults. See [1] for background.
> 
> I've significantly reworked and simplified the patch set based on comments from
> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> VARIABLE_THP, on Yu's advice.
> 
> The last patch is for arm64 to explicitly override the default
> arch_wants_pte_order() and is intended as an example. If this series is accepted
> I suggest taking the first 4 patches through the mm tree and the arm64 change
> could be handled through the arm64 tree separately. Neither has any build
> dependency on the other.
> 
> The one area where I haven't followed Yu's advice is in the determination of the
> size of folio to use. It was suggested that I have a single preferred large
> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> being existing overlapping populated PTEs, etc) then fallback immediately to
> order-0. It turned out that this approach caused a performance regression in the
> Speedometer benchmark. With my v1 patch, there were significant quantities of
> memory which could not be placed in the 64K bucket and were instead being
> allocated for the 32K and 16K buckets. With the proposed simplification, that
> memory ended up using the 4K bucket, so page faults increased by 2.75x compared
> to the v1 patch (although due to the 64K bucket, this number is still a bit
> lower than the baseline). So instead, I continue to calculate a folio order that
> is somewhere between the preferred order and 0. (See below for more details).
> 
> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> [2], which is a hard dependency. I have a branch at [3].
> 
> 
> Changes since v1 [1]
> --------------------
> 
>    - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
>    - replaced with arch-independent alloc_anon_folio()
>        - follows THP allocation approach
>    - no longer retry with intermediate orders if allocation fails
>        - fallback directly to order-0
>    - remove folio_add_new_anon_rmap_range() patch
>        - instead add its new functionality to folio_add_new_anon_rmap()
>    - remove batch-zap pte mappings optimization patch
>        - remove enabler folio_remove_rmap_range() patch too
>        - These offer real perf improvement so will submit separately
>    - simplify Kconfig
>        - single FLEXIBLE_THP option, which is independent of arch
>        - depends on TRANSPARENT_HUGEPAGE
>        - when enabled default to max anon folio size of 64K unless arch
>          explicitly overrides
>    - simplify changes to do_anonymous_page():
>        - no more retry loop
> 
> 
> Performance
> -----------
> 
> Below results show 3 benchmarks; kernel compilation with 8 jobs, kernel
> compilation with 80 jobs, and speedometer 2.0 (a javascript benchmark running in
> Chromium). All cases are running on Ampere Altra with 1 NUMA node enabled,
> Ubuntu 22.04 and XFS filesystem. Each benchmark is repeated 15 times over 5
> reboots and averaged.
> 
> 'anonfolio-lkml-v1' is the v1 patchset at [1]. 'anonfolio-lkml-v2' is this v2
> patchset. 'anonfolio-lkml-v2-simple-order' is anonfolio-lkml-v2 but with the
> order selection simplification that Yu Zhao suggested - I'm trying to justify
> here why I did not follow the advice.
> 
> 
> Kernel compilation with 8 jobs:
> 
> | kernel                         |   real-time |   kern-time |   user-time |
> |:-------------------------------|------------:|------------:|------------:|
> | baseline-4k                    |        0.0% |        0.0% |        0.0% |
> | anonfolio-lkml-v1              |       -5.3% |      -42.9% |       -0.6% |
> | anonfolio-lkml-v2-simple-order |       -4.4% |      -36.5% |       -0.4% |
> | anonfolio-lkml-v2              |       -4.8% |      -38.6% |       -0.6% |
> 
> We can see that the simple-order approach is responsible for a regression of
> 0.4%.
> 
> 
> Kernel compilation with 80 jobs:
> 
> | kernel                         |   real-time |   kern-time |   user-time |
> |:-------------------------------|------------:|------------:|------------:|
> | baseline-4k                    |        0.0% |        0.0% |        0.0% |
> | anonfolio-lkml-v1              |       -4.6% |      -45.7% |        1.4% |
> | anonfolio-lkml-v2-simple-order |       -4.7% |      -40.2% |       -0.1% |
> | anonfolio-lkml-v2              |       -5.0% |      -42.6% |       -0.3% |
> 
> simple-order costs 0.3 % here. v2 is actually performing higher than v1 due to
> fixing the v1 regression on user-time.
> 
> 
> Speedometer 2.0:
> 
> | kernel                         |   runs_per_min |
> |:-------------------------------|---------------:|
> | baseline-4k                    |           0.0% |
> | anonfolio-lkml-v1              |           0.7% |
> | anonfolio-lkml-v2-simple-order |          -0.9% |
> | anonfolio-lkml-v2              |           0.5% |
> 
> simple-order regresses performance by 0.9% vs the baseline, for a total negative
> swing of 1.6% vs v1. This is fixed by keeping the more complex order selection
> mechanism from v1.
> 
> 
> The remaining (kernel time) performance gap between v1 and v2 for the above
> benchmarks is due to the removal of the "batch zap" patch in v2. Adding that
> back in gives us the performance back. I intend to submit that as a separate
> series once this series is accepted.
> 
> 
> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
> [3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v2
> 
> Thanks,
> Ryan

Hi Ryan,

is page migration already working as expected (what about page 
compaction?), and do we handle migration -ENOMEM when allocating a 
target page: do we split an fallback to 4k page migration?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-05 19:38   ` David Hildenbrand
  0 siblings, 0 replies; 167+ messages in thread
From: David Hildenbrand @ 2023-07-05 19:38 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: linux-arm-kernel, linux-kernel, linux-mm

On 03.07.23 15:53, Ryan Roberts wrote:
> Hi All,
> 
> This is v2 of a series to implement variable order, large folios for anonymous
> memory. The objective of this is to improve performance by allocating larger
> chunks of memory during anonymous page faults. See [1] for background.
> 
> I've significantly reworked and simplified the patch set based on comments from
> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> VARIABLE_THP, on Yu's advice.
> 
> The last patch is for arm64 to explicitly override the default
> arch_wants_pte_order() and is intended as an example. If this series is accepted
> I suggest taking the first 4 patches through the mm tree and the arm64 change
> could be handled through the arm64 tree separately. Neither has any build
> dependency on the other.
> 
> The one area where I haven't followed Yu's advice is in the determination of the
> size of folio to use. It was suggested that I have a single preferred large
> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> being existing overlapping populated PTEs, etc) then fallback immediately to
> order-0. It turned out that this approach caused a performance regression in the
> Speedometer benchmark. With my v1 patch, there were significant quantities of
> memory which could not be placed in the 64K bucket and were instead being
> allocated for the 32K and 16K buckets. With the proposed simplification, that
> memory ended up using the 4K bucket, so page faults increased by 2.75x compared
> to the v1 patch (although due to the 64K bucket, this number is still a bit
> lower than the baseline). So instead, I continue to calculate a folio order that
> is somewhere between the preferred order and 0. (See below for more details).
> 
> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> [2], which is a hard dependency. I have a branch at [3].
> 
> 
> Changes since v1 [1]
> --------------------
> 
>    - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
>    - replaced with arch-independent alloc_anon_folio()
>        - follows THP allocation approach
>    - no longer retry with intermediate orders if allocation fails
>        - fallback directly to order-0
>    - remove folio_add_new_anon_rmap_range() patch
>        - instead add its new functionality to folio_add_new_anon_rmap()
>    - remove batch-zap pte mappings optimization patch
>        - remove enabler folio_remove_rmap_range() patch too
>        - These offer real perf improvement so will submit separately
>    - simplify Kconfig
>        - single FLEXIBLE_THP option, which is independent of arch
>        - depends on TRANSPARENT_HUGEPAGE
>        - when enabled default to max anon folio size of 64K unless arch
>          explicitly overrides
>    - simplify changes to do_anonymous_page():
>        - no more retry loop
> 
> 
> Performance
> -----------
> 
> Below results show 3 benchmarks; kernel compilation with 8 jobs, kernel
> compilation with 80 jobs, and speedometer 2.0 (a javascript benchmark running in
> Chromium). All cases are running on Ampere Altra with 1 NUMA node enabled,
> Ubuntu 22.04 and XFS filesystem. Each benchmark is repeated 15 times over 5
> reboots and averaged.
> 
> 'anonfolio-lkml-v1' is the v1 patchset at [1]. 'anonfolio-lkml-v2' is this v2
> patchset. 'anonfolio-lkml-v2-simple-order' is anonfolio-lkml-v2 but with the
> order selection simplification that Yu Zhao suggested - I'm trying to justify
> here why I did not follow the advice.
> 
> 
> Kernel compilation with 8 jobs:
> 
> | kernel                         |   real-time |   kern-time |   user-time |
> |:-------------------------------|------------:|------------:|------------:|
> | baseline-4k                    |        0.0% |        0.0% |        0.0% |
> | anonfolio-lkml-v1              |       -5.3% |      -42.9% |       -0.6% |
> | anonfolio-lkml-v2-simple-order |       -4.4% |      -36.5% |       -0.4% |
> | anonfolio-lkml-v2              |       -4.8% |      -38.6% |       -0.6% |
> 
> We can see that the simple-order approach is responsible for a regression of
> 0.4%.
> 
> 
> Kernel compilation with 80 jobs:
> 
> | kernel                         |   real-time |   kern-time |   user-time |
> |:-------------------------------|------------:|------------:|------------:|
> | baseline-4k                    |        0.0% |        0.0% |        0.0% |
> | anonfolio-lkml-v1              |       -4.6% |      -45.7% |        1.4% |
> | anonfolio-lkml-v2-simple-order |       -4.7% |      -40.2% |       -0.1% |
> | anonfolio-lkml-v2              |       -5.0% |      -42.6% |       -0.3% |
> 
> simple-order costs 0.3 % here. v2 is actually performing higher than v1 due to
> fixing the v1 regression on user-time.
> 
> 
> Speedometer 2.0:
> 
> | kernel                         |   runs_per_min |
> |:-------------------------------|---------------:|
> | baseline-4k                    |           0.0% |
> | anonfolio-lkml-v1              |           0.7% |
> | anonfolio-lkml-v2-simple-order |          -0.9% |
> | anonfolio-lkml-v2              |           0.5% |
> 
> simple-order regresses performance by 0.9% vs the baseline, for a total negative
> swing of 1.6% vs v1. This is fixed by keeping the more complex order selection
> mechanism from v1.
> 
> 
> The remaining (kernel time) performance gap between v1 and v2 for the above
> benchmarks is due to the removal of the "batch zap" patch in v2. Adding that
> back in gives us the performance back. I intend to submit that as a separate
> series once this series is accepted.
> 
> 
> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
> [3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v2
> 
> Thanks,
> Ryan

Hi Ryan,

is page migration already working as expected (what about page 
compaction?), and do we handle migration -ENOMEM when allocating a 
target page: do we split an fallback to 4k page migration?

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-05 19:38   ` David Hildenbrand
@ 2023-07-06  8:02     ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-06  8:02 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi
  Cc: linux-arm-kernel, linux-kernel, linux-mm

On 05/07/2023 20:38, David Hildenbrand wrote:
> On 03.07.23 15:53, Ryan Roberts wrote:
>> Hi All,
>>
>> This is v2 of a series to implement variable order, large folios for anonymous
>> memory. The objective of this is to improve performance by allocating larger
>> chunks of memory during anonymous page faults. See [1] for background.
>>

[...]

>> Thanks,
>> Ryan
> 
> Hi Ryan,
> 
> is page migration already working as expected (what about page compaction?), and
> do we handle migration -ENOMEM when allocating a target page: do we split an
> fallback to 4k page migration?
> 

Hi David, All,

This series aims to be the bare minimum to demonstrate allocation of large anon
folios. As such, there is a laundry list of things that need to be done for this
feature to play nicely with other features. My preferred route is to merge this
with it's Kconfig defaulted to disabled, and its Kconfig description clearly
shouting that it's EXPERIMENTAL with an explanation of why (similar to
READ_ONLY_THP_FOR_FS).

That said, I've put together a table of the items that I'm aware of that need
attention. It would be great if people can review and add any missing items.
Then we can hopefully parallelize the implementation work. David, I don't think
the items you raised are covered - would you mind providing a bit more detail so
I can add them to the list? (or just add them to the list yourself, if you prefer).

---

- item:
    mlock

  description: >-
    Large, pte-mapped folios are ignored when mlock is requested. Code comment
    for mlock_vma_folio() says "...filter out pte mappings of THPs, which
    cannot be consistently counted: a pte mapping of the THP head cannot be
    distinguished by the page alone."

  location:
    - mlock_pte_range()
    - mlock_vma_folio()

  assignee:
    Yin, Fengwei


- item:
    numa balancing

  description: >-
    Large, pte-mapped folios are ignored by numa-balancing code. Commit
    comment (e81c480): "We're going to have THP mapped with PTEs. It will
    confuse numabalancing. Let's skip them for now."

  location:
    - do_numa_page()

  assignee:
    <none>


- item:
    madvise

  description: >-
    MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes
    exclusive only if mapcount==1, else skips remainder of operation. For
    large, pte-mapped folios, exclusive folios can have mapcount upto nr_pages
    and still be exclusive. Even better; don't split the folio if it fits
    entirely within the range? Discussion at

https://lore.kernel.org/linux-mm/6cec6f68-248e-63b4-5615-9e0f3f819a0a@redhat.com/
    talks about changing folio mapcounting - may help determine if exclusive
    without pgtable scan?

  location:
    - madvise_cold_or_pageout_pte_range()
    - madvise_free_pte_range()

  assignee:
    <none>


- item:
    shrink_folio_list

  description: >-
    Raised by Yu Zhao; I can't see the problem in the code - need
    clarification

  location:
    - shrink_folio_list()

  assignee:
    <none>


- item:
    compaction

  description: >-
    Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
    page-cache pages today. Is my understand correct?

  location:
    - <where?>

  assignee:
    <none>
---

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-06  8:02     ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-06  8:02 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi
  Cc: linux-arm-kernel, linux-kernel, linux-mm

On 05/07/2023 20:38, David Hildenbrand wrote:
> On 03.07.23 15:53, Ryan Roberts wrote:
>> Hi All,
>>
>> This is v2 of a series to implement variable order, large folios for anonymous
>> memory. The objective of this is to improve performance by allocating larger
>> chunks of memory during anonymous page faults. See [1] for background.
>>

[...]

>> Thanks,
>> Ryan
> 
> Hi Ryan,
> 
> is page migration already working as expected (what about page compaction?), and
> do we handle migration -ENOMEM when allocating a target page: do we split an
> fallback to 4k page migration?
> 

Hi David, All,

This series aims to be the bare minimum to demonstrate allocation of large anon
folios. As such, there is a laundry list of things that need to be done for this
feature to play nicely with other features. My preferred route is to merge this
with it's Kconfig defaulted to disabled, and its Kconfig description clearly
shouting that it's EXPERIMENTAL with an explanation of why (similar to
READ_ONLY_THP_FOR_FS).

That said, I've put together a table of the items that I'm aware of that need
attention. It would be great if people can review and add any missing items.
Then we can hopefully parallelize the implementation work. David, I don't think
the items you raised are covered - would you mind providing a bit more detail so
I can add them to the list? (or just add them to the list yourself, if you prefer).

---

- item:
    mlock

  description: >-
    Large, pte-mapped folios are ignored when mlock is requested. Code comment
    for mlock_vma_folio() says "...filter out pte mappings of THPs, which
    cannot be consistently counted: a pte mapping of the THP head cannot be
    distinguished by the page alone."

  location:
    - mlock_pte_range()
    - mlock_vma_folio()

  assignee:
    Yin, Fengwei


- item:
    numa balancing

  description: >-
    Large, pte-mapped folios are ignored by numa-balancing code. Commit
    comment (e81c480): "We're going to have THP mapped with PTEs. It will
    confuse numabalancing. Let's skip them for now."

  location:
    - do_numa_page()

  assignee:
    <none>


- item:
    madvise

  description: >-
    MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes
    exclusive only if mapcount==1, else skips remainder of operation. For
    large, pte-mapped folios, exclusive folios can have mapcount upto nr_pages
    and still be exclusive. Even better; don't split the folio if it fits
    entirely within the range? Discussion at

https://lore.kernel.org/linux-mm/6cec6f68-248e-63b4-5615-9e0f3f819a0a@redhat.com/
    talks about changing folio mapcounting - may help determine if exclusive
    without pgtable scan?

  location:
    - madvise_cold_or_pageout_pte_range()
    - madvise_free_pte_range()

  assignee:
    <none>


- item:
    shrink_folio_list

  description: >-
    Raised by Yu Zhao; I can't see the problem in the code - need
    clarification

  location:
    - shrink_folio_list()

  assignee:
    <none>


- item:
    compaction

  description: >-
    Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
    page-cache pages today. Is my understand correct?

  location:
    - <where?>

  assignee:
    <none>
---

Thanks,
Ryan



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-05  2:07         ` Yu Zhao
@ 2023-07-06 19:33           ` Matthew Wilcox
  -1 siblings, 0 replies; 167+ messages in thread
From: Matthew Wilcox @ 2023-07-06 19:33 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Tue, Jul 04, 2023 at 08:07:19PM -0600, Yu Zhao wrote:
> >  - On arm64 when the process has marked the VMA for THP (or when
> > transparent_hugepage=always) but the VMA does not meet the requirements for a
> > PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
> > contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
> > and for 64K this is 2M (order-5). The 64K base page case is very important since
> > the PMD size for that base page is 512MB which is almost impossible to allocate
> > in practice.
> 
> Which case (server or client) are you focusing on here? For our client
> devices, I can confidently say that 64KB has to be after 16KB, if it
> happens at all. For servers in general, I don't know of any major
> memory-intensive workloads that are not THP-aware, i.e., I don't think
> "VMA does not meet the requirements" is a concern.

It sounds like you've done some measurements, and I'd like to understand
those a bit better.  There are a number of factors involved:

 - A larger page size shrinks the length of the LRU list, so systems
   which see heavy LRU lock contention benefit more
 - A larger page size has more internal fragmentation, so we run out of
   memory and have to do reclaim more often (and maybe workload which
   used to fit in DRAM now do not)
(probably others; i'm not at 100% right now)

I think concerns about "allocating lots of order-2 folios makes it harder
to allocate order-4 folios" are _probably_ not warranted (without data
to prove otherwise).  All anonymous memory is movable, so our compaction
code should be able to create larger order folios.


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-06 19:33           ` Matthew Wilcox
  0 siblings, 0 replies; 167+ messages in thread
From: Matthew Wilcox @ 2023-07-06 19:33 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Tue, Jul 04, 2023 at 08:07:19PM -0600, Yu Zhao wrote:
> >  - On arm64 when the process has marked the VMA for THP (or when
> > transparent_hugepage=always) but the VMA does not meet the requirements for a
> > PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
> > contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
> > and for 64K this is 2M (order-5). The 64K base page case is very important since
> > the PMD size for that base page is 512MB which is almost impossible to allocate
> > in practice.
> 
> Which case (server or client) are you focusing on here? For our client
> devices, I can confidently say that 64KB has to be after 16KB, if it
> happens at all. For servers in general, I don't know of any major
> memory-intensive workloads that are not THP-aware, i.e., I don't think
> "VMA does not meet the requirements" is a concern.

It sounds like you've done some measurements, and I'd like to understand
those a bit better.  There are a number of factors involved:

 - A larger page size shrinks the length of the LRU list, so systems
   which see heavy LRU lock contention benefit more
 - A larger page size has more internal fragmentation, so we run out of
   memory and have to do reclaim more often (and maybe workload which
   used to fit in DRAM now do not)
(probably others; i'm not at 100% right now)

I think concerns about "allocating lots of order-2 folios makes it harder
to allocate order-4 folios" are _probably_ not warranted (without data
to prove otherwise).  All anonymous memory is movable, so our compaction
code should be able to create larger order folios.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-03 13:53   ` Ryan Roberts
@ 2023-07-07  8:01     ` Huang, Ying
  -1 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-07  8:01 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> allocated in large folios of a specified order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.

I likes the idea to share as much code as possible between large
(anonymous) folio and THP.  Finally, THP becomes just a special kind of
large folio.

Although we can use smaller page order for FLEXIBLE_THP, it's hard to
avoid internal fragmentation completely.  So, I think that finally we
will need to provide a mechanism for the users to opt out, e.g.,
something like "always madvise never" via
/sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
a good idea to reuse the existing interface of THP.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-07  8:01     ` Huang, Ying
  0 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-07  8:01 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> allocated in large folios of a specified order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.

I likes the idea to share as much code as possible between large
(anonymous) folio and THP.  Finally, THP becomes just a special kind of
large folio.

Although we can use smaller page order for FLEXIBLE_THP, it's hard to
avoid internal fragmentation completely.  So, I think that finally we
will need to provide a mechanism for the users to opt out, e.g.,
something like "always madvise never" via
/sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
a good idea to reuse the existing interface of THP.

Best Regards,
Huang, Ying

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
  2023-07-03 13:53   ` Ryan Roberts
@ 2023-07-07  8:21     ` Huang, Ying
  -1 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-07  8:21 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> With the introduction of large folios for anonymous memory, we would
> like to be able to split them when they have unmapped subpages, in order
> to free those unused pages under memory pressure. So remove the
> artificial requirement that the large folio needed to be at least
> PMD-sized.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> Reviewed-by: Yu Zhao <yuzhao@google.com>
> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
> ---
>  mm/rmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 82ef5ba363d1..bbcb2308a1c5 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>  		 * page of the folio is unmapped and at least one page
>  		 * is still mapped.
>  		 */
> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
> +		if (folio_test_large(folio) && folio_test_anon(folio))
>  			if (!compound || nr < nr_pmdmapped)
>  				deferred_split_folio(folio);
>  	}

One possible issue is that even for large folios mapped only in one
process, in zap_pte_range(), we will always call deferred_split_folio()
unnecessarily before freeing a large folio.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
@ 2023-07-07  8:21     ` Huang, Ying
  0 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-07  8:21 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> With the introduction of large folios for anonymous memory, we would
> like to be able to split them when they have unmapped subpages, in order
> to free those unused pages under memory pressure. So remove the
> artificial requirement that the large folio needed to be at least
> PMD-sized.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> Reviewed-by: Yu Zhao <yuzhao@google.com>
> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
> ---
>  mm/rmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 82ef5ba363d1..bbcb2308a1c5 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>  		 * page of the folio is unmapped and at least one page
>  		 * is still mapped.
>  		 */
> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
> +		if (folio_test_large(folio) && folio_test_anon(folio))
>  			if (!compound || nr < nr_pmdmapped)
>  				deferred_split_folio(folio);
>  	}

One possible issue is that even for large folios mapped only in one
process, in zap_pte_range(), we will always call deferred_split_folio()
unnecessarily before freeing a large folio.

Best Regards,
Huang, Ying

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
  2023-07-07  8:21     ` Huang, Ying
  (?)
@ 2023-07-07  9:39     ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-07  9:39 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/07/2023 09:21, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> With the introduction of large folios for anonymous memory, we would
>> like to be able to split them when they have unmapped subpages, in order
>> to free those unused pages under memory pressure. So remove the
>> artificial requirement that the large folio needed to be at least
>> PMD-sized.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>> ---
>>  mm/rmap.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 82ef5ba363d1..bbcb2308a1c5 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>>  		 * page of the folio is unmapped and at least one page
>>  		 * is still mapped.
>>  		 */
>> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>> +		if (folio_test_large(folio) && folio_test_anon(folio))
>>  			if (!compound || nr < nr_pmdmapped)
>>  				deferred_split_folio(folio);
>>  	}
> 
> One possible issue is that even for large folios mapped only in one
> process, in zap_pte_range(), we will always call deferred_split_folio()
> unnecessarily before freeing a large folio.

Hi Huang, thanks for reviewing!

I have a patch that solves this problem by determining a range of ptes covered
by a single folio and doing a "batch zap". This prevents the need to add the
folio to the deferred split queue, only to remove it again shortly afterwards.
This reduces lock contention and I can measure a performance improvement for the
kernel compilation benchmark. See [1].

However, I decided to remove it from this patch set on Yu Zhao's advice. We are
aiming for the minimal patch set to start with and wanted to focus people on
that. I intend to submit it separately later on.

[1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/

Thanks,
Ryan

> 
> Best Regards,
> Huang, Ying
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
  2023-07-07  8:21     ` Huang, Ying
@ 2023-07-07  9:42       ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-07  9:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

Somehow I managed to reply only to the linux-arm-kernel list on first attempt so
resending:

On 07/07/2023 09:21, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> With the introduction of large folios for anonymous memory, we would
>> like to be able to split them when they have unmapped subpages, in order
>> to free those unused pages under memory pressure. So remove the
>> artificial requirement that the large folio needed to be at least
>> PMD-sized.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>> ---
>>  mm/rmap.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 82ef5ba363d1..bbcb2308a1c5 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>>  		 * page of the folio is unmapped and at least one page
>>  		 * is still mapped.
>>  		 */
>> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>> +		if (folio_test_large(folio) && folio_test_anon(folio))
>>  			if (!compound || nr < nr_pmdmapped)
>>  				deferred_split_folio(folio);
>>  	}
> 
> One possible issue is that even for large folios mapped only in one
> process, in zap_pte_range(), we will always call deferred_split_folio()
> unnecessarily before freeing a large folio.

Hi Huang, thanks for reviewing!

I have a patch that solves this problem by determining a range of ptes covered
by a single folio and doing a "batch zap". This prevents the need to add the
folio to the deferred split queue, only to remove it again shortly afterwards.
This reduces lock contention and I can measure a performance improvement for the
kernel compilation benchmark. See [1].

However, I decided to remove it from this patch set on Yu Zhao's advice. We are
aiming for the minimal patch set to start with and wanted to focus people on
that. I intend to submit it separately later on.

[1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/

Thanks,
Ryan

> 
> Best Regards,
> Huang, Ying
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
@ 2023-07-07  9:42       ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-07  9:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

Somehow I managed to reply only to the linux-arm-kernel list on first attempt so
resending:

On 07/07/2023 09:21, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> With the introduction of large folios for anonymous memory, we would
>> like to be able to split them when they have unmapped subpages, in order
>> to free those unused pages under memory pressure. So remove the
>> artificial requirement that the large folio needed to be at least
>> PMD-sized.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>> ---
>>  mm/rmap.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 82ef5ba363d1..bbcb2308a1c5 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>>  		 * page of the folio is unmapped and at least one page
>>  		 * is still mapped.
>>  		 */
>> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>> +		if (folio_test_large(folio) && folio_test_anon(folio))
>>  			if (!compound || nr < nr_pmdmapped)
>>  				deferred_split_folio(folio);
>>  	}
> 
> One possible issue is that even for large folios mapped only in one
> process, in zap_pte_range(), we will always call deferred_split_folio()
> unnecessarily before freeing a large folio.

Hi Huang, thanks for reviewing!

I have a patch that solves this problem by determining a range of ptes covered
by a single folio and doing a "batch zap". This prevents the need to add the
folio to the deferred split queue, only to remove it again shortly afterwards.
This reduces lock contention and I can measure a performance improvement for the
kernel compilation benchmark. See [1].

However, I decided to remove it from this patch set on Yu Zhao's advice. We are
aiming for the minimal patch set to start with and wanted to focus people on
that. I intend to submit it separately later on.

[1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/

Thanks,
Ryan

> 
> Best Regards,
> Huang, Ying
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-07  8:01     ` Huang, Ying
@ 2023-07-07  9:52       ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-07  9:52 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 07/07/2023 09:01, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>> allocated in large folios of a specified order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
> 
> I likes the idea to share as much code as possible between large
> (anonymous) folio and THP.  Finally, THP becomes just a special kind of
> large folio.
> 
> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
> avoid internal fragmentation completely.  So, I think that finally we
> will need to provide a mechanism for the users to opt out, e.g.,
> something like "always madvise never" via
> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
> a good idea to reuse the existing interface of THP.

I wouldn't want to tie this to the existing interface, simply because that
implies that we would want to follow the "always" and "madvise" advice too; That
means that on a thp=madvise system (which is certainly the case for android and
other client systems) we would have to disable large anon folios for VMAs that
haven't explicitly opted in. That breaks the intention that this should be an
invisible performance boost. I think it's important to set the policy for use of
THP separately to use of large anon folios.

I could be persuaded on the merrits of a new runtime enable/disable interface if
there is concensus.

> 
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-07  9:52       ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-07  9:52 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 07/07/2023 09:01, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>> allocated in large folios of a specified order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
> 
> I likes the idea to share as much code as possible between large
> (anonymous) folio and THP.  Finally, THP becomes just a special kind of
> large folio.
> 
> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
> avoid internal fragmentation completely.  So, I think that finally we
> will need to provide a mechanism for the users to opt out, e.g.,
> something like "always madvise never" via
> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
> a good idea to reuse the existing interface of THP.

I wouldn't want to tie this to the existing interface, simply because that
implies that we would want to follow the "always" and "madvise" advice too; That
means that on a thp=madvise system (which is certainly the case for android and
other client systems) we would have to disable large anon folios for VMAs that
haven't explicitly opted in. That breaks the intention that this should be an
invisible performance boost. I think it's important to set the policy for use of
THP separately to use of large anon folios.

I could be persuaded on the merrits of a new runtime enable/disable interface if
there is concensus.

> 
> Best Regards,
> Huang, Ying


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
  2023-07-06 19:33           ` Matthew Wilcox
@ 2023-07-07 10:00             ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-07 10:00 UTC (permalink / raw)
  To: Matthew Wilcox, Yu Zhao
  Cc: Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 06/07/2023 20:33, Matthew Wilcox wrote:
> On Tue, Jul 04, 2023 at 08:07:19PM -0600, Yu Zhao wrote:
>>>  - On arm64 when the process has marked the VMA for THP (or when
>>> transparent_hugepage=always) but the VMA does not meet the requirements for a
>>> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
>>> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
>>> and for 64K this is 2M (order-5). The 64K base page case is very important since
>>> the PMD size for that base page is 512MB which is almost impossible to allocate
>>> in practice.
>>
>> Which case (server or client) are you focusing on here? For our client
>> devices, I can confidently say that 64KB has to be after 16KB, if it
>> happens at all. For servers in general, I don't know of any major
>> memory-intensive workloads that are not THP-aware, i.e., I don't think
>> "VMA does not meet the requirements" is a concern.
> 
> It sounds like you've done some measurements, and I'd like to understand
> those a bit better.  There are a number of factors involved:

I'm not sure if that's a question to me or Yu? I haven't personally done any
measurements for the 64K base page case. But Arm has a partner that is pushing
for this. I'm hoping to see some test results from them posted publicly in the
coming weeks. See [1] for more explanation on the rationale.

[1]
https://lore.kernel.org/linux-mm/4d4c45a2-0037-71de-b182-f516fee07e67@arm.com/T/#m8a7c4b71f94224ec3fe6d0a407f48d74c789ba4f

> 
>  - A larger page size shrinks the length of the LRU list, so systems
>    which see heavy LRU lock contention benefit more
>  - A larger page size has more internal fragmentation, so we run out of
>    memory and have to do reclaim more often (and maybe workload which
>    used to fit in DRAM now do not)
> (probably others; i'm not at 100% right now)
> 
> I think concerns about "allocating lots of order-2 folios makes it harder
> to allocate order-4 folios" are _probably_ not warranted (without data
> to prove otherwise).  All anonymous memory is movable, so our compaction
> code should be able to create larger order folios.
> 


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
@ 2023-07-07 10:00             ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-07 10:00 UTC (permalink / raw)
  To: Matthew Wilcox, Yu Zhao
  Cc: Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 06/07/2023 20:33, Matthew Wilcox wrote:
> On Tue, Jul 04, 2023 at 08:07:19PM -0600, Yu Zhao wrote:
>>>  - On arm64 when the process has marked the VMA for THP (or when
>>> transparent_hugepage=always) but the VMA does not meet the requirements for a
>>> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
>>> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
>>> and for 64K this is 2M (order-5). The 64K base page case is very important since
>>> the PMD size for that base page is 512MB which is almost impossible to allocate
>>> in practice.
>>
>> Which case (server or client) are you focusing on here? For our client
>> devices, I can confidently say that 64KB has to be after 16KB, if it
>> happens at all. For servers in general, I don't know of any major
>> memory-intensive workloads that are not THP-aware, i.e., I don't think
>> "VMA does not meet the requirements" is a concern.
> 
> It sounds like you've done some measurements, and I'd like to understand
> those a bit better.  There are a number of factors involved:

I'm not sure if that's a question to me or Yu? I haven't personally done any
measurements for the 64K base page case. But Arm has a partner that is pushing
for this. I'm hoping to see some test results from them posted publicly in the
coming weeks. See [1] for more explanation on the rationale.

[1]
https://lore.kernel.org/linux-mm/4d4c45a2-0037-71de-b182-f516fee07e67@arm.com/T/#m8a7c4b71f94224ec3fe6d0a407f48d74c789ba4f

> 
>  - A larger page size shrinks the length of the LRU list, so systems
>    which see heavy LRU lock contention benefit more
>  - A larger page size has more internal fragmentation, so we run out of
>    memory and have to do reclaim more often (and maybe workload which
>    used to fit in DRAM now do not)
> (probably others; i'm not at 100% right now)
> 
> I think concerns about "allocating lots of order-2 folios makes it harder
> to allocate order-4 folios" are _probably_ not warranted (without data
> to prove otherwise).  All anonymous memory is movable, so our compaction
> code should be able to create larger order folios.
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-07  9:52       ` Ryan Roberts
@ 2023-07-07 11:29         ` David Hildenbrand
  -1 siblings, 0 replies; 167+ messages in thread
From: David Hildenbrand @ 2023-07-07 11:29 UTC (permalink / raw)
  To: Ryan Roberts, Huang, Ying
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

On 07.07.23 11:52, Ryan Roberts wrote:
> On 07/07/2023 09:01, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>> allocated in large folios of a specified order. All pages of the large
>>> folio are pte-mapped during the same page fault, significantly reducing
>>> the number of page faults. The number of per-page operations (e.g. ref
>>> counting, rmap management lru list management) are also significantly
>>> reduced since those ops now become per-folio.
>>
>> I likes the idea to share as much code as possible between large
>> (anonymous) folio and THP.  Finally, THP becomes just a special kind of
>> large folio.
>>
>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>> avoid internal fragmentation completely.  So, I think that finally we
>> will need to provide a mechanism for the users to opt out, e.g.,
>> something like "always madvise never" via
>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>> a good idea to reuse the existing interface of THP.
> 
> I wouldn't want to tie this to the existing interface, simply because that
> implies that we would want to follow the "always" and "madvise" advice too; That
> means that on a thp=madvise system (which is certainly the case for android and
> other client systems) we would have to disable large anon folios for VMAs that
> haven't explicitly opted in. That breaks the intention that this should be an
> invisible performance boost. I think it's important to set the policy for use of

It will never ever be a completely invisible performance boost, just 
like ordinary THP.

Using the exact same existing toggle is the right thing to do. If 
someone specify "never" or "madvise", then do exactly that.

It might make sense to have more modes or additional toggles, but 
"madvise=never" means no memory waste.


I remember I raised it already in the past, but you *absolutely* have to 
respect the MADV_NOHUGEPAGE flag. There is user space out there (for 
example, userfaultfd) that doesn't want the kernel to populate any 
additional page tables. So if you have to respect that already, then 
also respect MADV_HUGEPAGE, simple.

> THP separately to use of large anon folios.
> 
> I could be persuaded on the merrits of a new runtime enable/disable interface if
> there is concensus.

There would have to be very good reason for a completely separate 
control. Bypassing MADV_NOHUGEPAGE or "madvise=never" simply because we 
add a "flexible" before the THP sounds broken.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-07 11:29         ` David Hildenbrand
  0 siblings, 0 replies; 167+ messages in thread
From: David Hildenbrand @ 2023-07-07 11:29 UTC (permalink / raw)
  To: Ryan Roberts, Huang, Ying
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

On 07.07.23 11:52, Ryan Roberts wrote:
> On 07/07/2023 09:01, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>> allocated in large folios of a specified order. All pages of the large
>>> folio are pte-mapped during the same page fault, significantly reducing
>>> the number of page faults. The number of per-page operations (e.g. ref
>>> counting, rmap management lru list management) are also significantly
>>> reduced since those ops now become per-folio.
>>
>> I likes the idea to share as much code as possible between large
>> (anonymous) folio and THP.  Finally, THP becomes just a special kind of
>> large folio.
>>
>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>> avoid internal fragmentation completely.  So, I think that finally we
>> will need to provide a mechanism for the users to opt out, e.g.,
>> something like "always madvise never" via
>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>> a good idea to reuse the existing interface of THP.
> 
> I wouldn't want to tie this to the existing interface, simply because that
> implies that we would want to follow the "always" and "madvise" advice too; That
> means that on a thp=madvise system (which is certainly the case for android and
> other client systems) we would have to disable large anon folios for VMAs that
> haven't explicitly opted in. That breaks the intention that this should be an
> invisible performance boost. I think it's important to set the policy for use of

It will never ever be a completely invisible performance boost, just 
like ordinary THP.

Using the exact same existing toggle is the right thing to do. If 
someone specify "never" or "madvise", then do exactly that.

It might make sense to have more modes or additional toggles, but 
"madvise=never" means no memory waste.


I remember I raised it already in the past, but you *absolutely* have to 
respect the MADV_NOHUGEPAGE flag. There is user space out there (for 
example, userfaultfd) that doesn't want the kernel to populate any 
additional page tables. So if you have to respect that already, then 
also respect MADV_HUGEPAGE, simple.

> THP separately to use of large anon folios.
> 
> I could be persuaded on the merrits of a new runtime enable/disable interface if
> there is concensus.

There would have to be very good reason for a completely separate 
control. Bypassing MADV_NOHUGEPAGE or "madvise=never" simply because we 
add a "flexible" before the THP sounds broken.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-06  8:02     ` Ryan Roberts
@ 2023-07-07 11:40       ` David Hildenbrand
  -1 siblings, 0 replies; 167+ messages in thread
From: David Hildenbrand @ 2023-07-07 11:40 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: linux-arm-kernel, linux-kernel, linux-mm

On 06.07.23 10:02, Ryan Roberts wrote:
> On 05/07/2023 20:38, David Hildenbrand wrote:
>> On 03.07.23 15:53, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> This is v2 of a series to implement variable order, large folios for anonymous
>>> memory. The objective of this is to improve performance by allocating larger
>>> chunks of memory during anonymous page faults. See [1] for background.
>>>
> 
> [...]
> 
>>> Thanks,
>>> Ryan
>>
>> Hi Ryan,
>>
>> is page migration already working as expected (what about page compaction?), and
>> do we handle migration -ENOMEM when allocating a target page: do we split an
>> fallback to 4k page migration?
>>
> 
> Hi David, All,

Hi Ryan,

thanks a lot for the list.

But can you comment on the page migration part (IOW did you try it already)?

For example, memory hotunplug, CMA, MCE handling, compaction all rely on 
page migration of something that was allocated using GFP_MOVABLE to 
actually work.

Compaction seems to skip any higher-order folios, but the question is if 
the udnerlying migration itself works.

If it already works: great! If not, this really has to be tackled early, 
because otherwise we'll be breaking the GFP_MOVABLE semantics.

> 
> This series aims to be the bare minimum to demonstrate allocation of large anon
> folios. As such, there is a laundry list of things that need to be done for this
> feature to play nicely with other features. My preferred route is to merge this
> with it's Kconfig defaulted to disabled, and its Kconfig description clearly
> shouting that it's EXPERIMENTAL with an explanation of why (similar to
> READ_ONLY_THP_FOR_FS).
As long as we are not sure about the user space control and as long as 
basic functionality is not working (example, page migration), I would 
tend to not merge this early just for the sake of it.

But yes, something like mlock can eventually be tackled later: as long 
as there is a runtime interface to disable it ;)

> 
> That said, I've put together a table of the items that I'm aware of that need
> attention. It would be great if people can review and add any missing items.
> Then we can hopefully parallelize the implementation work. David, I don't think
> the items you raised are covered - would you mind providing a bit more detail so
> I can add them to the list? (or just add them to the list yourself, if you prefer).
> 
> ---
> 
> - item:
>      mlock
> 
>    description: >-
>      Large, pte-mapped folios are ignored when mlock is requested. Code comment
>      for mlock_vma_folio() says "...filter out pte mappings of THPs, which
>      cannot be consistently counted: a pte mapping of the THP head cannot be
>      distinguished by the page alone."
> 
>    location:
>      - mlock_pte_range()
>      - mlock_vma_folio()
> 
>    assignee:
>      Yin, Fengwei
> 
> 
> - item:
>      numa balancing
> 
>    description: >-
>      Large, pte-mapped folios are ignored by numa-balancing code. Commit
>      comment (e81c480): "We're going to have THP mapped with PTEs. It will
>      confuse numabalancing. Let's skip them for now."
> 
>    location:
>      - do_numa_page()
> 
>    assignee:
>      <none>
> 
> 
> - item:
>      madvise
> 
>    description: >-
>      MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes
>      exclusive only if mapcount==1, else skips remainder of operation. For
>      large, pte-mapped folios, exclusive folios can have mapcount upto nr_pages
>      and still be exclusive. Even better; don't split the folio if it fits
>      entirely within the range? Discussion at
> 
> https://lore.kernel.org/linux-mm/6cec6f68-248e-63b4-5615-9e0f3f819a0a@redhat.com/
>      talks about changing folio mapcounting - may help determine if exclusive
>      without pgtable scan?
> 
>    location:
>      - madvise_cold_or_pageout_pte_range()
>      - madvise_free_pte_range()
> 
>    assignee:
>      <none>
> 
> 
> - item:
>      shrink_folio_list
> 
>    description: >-
>      Raised by Yu Zhao; I can't see the problem in the code - need
>      clarification
> 
>    location:
>      - shrink_folio_list()
> 
>    assignee:
>      <none>
> 
> 
> - item:
>      compaction
> 
>    description: >-
>      Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
>      page-cache pages today. Is my understand correct?
> 
>    location:
>      - <where?>
> 
>    assignee:
>      <none>

I'm still thinking about the whole mapcount thingy (and I burned way too 
much time on that yesterday), which is a big item for such a list and 
affects some of these items.

A pagetable scan is pretty much irrelevant for order-2 pages. But once 
we're talking about higher orders we really don't want to do that.

I'm preparing a writeup with users and challenges.


Is swapping working as expected? zswap?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-07 11:40       ` David Hildenbrand
  0 siblings, 0 replies; 167+ messages in thread
From: David Hildenbrand @ 2023-07-07 11:40 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov,
	Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi
  Cc: linux-arm-kernel, linux-kernel, linux-mm

On 06.07.23 10:02, Ryan Roberts wrote:
> On 05/07/2023 20:38, David Hildenbrand wrote:
>> On 03.07.23 15:53, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> This is v2 of a series to implement variable order, large folios for anonymous
>>> memory. The objective of this is to improve performance by allocating larger
>>> chunks of memory during anonymous page faults. See [1] for background.
>>>
> 
> [...]
> 
>>> Thanks,
>>> Ryan
>>
>> Hi Ryan,
>>
>> is page migration already working as expected (what about page compaction?), and
>> do we handle migration -ENOMEM when allocating a target page: do we split an
>> fallback to 4k page migration?
>>
> 
> Hi David, All,

Hi Ryan,

thanks a lot for the list.

But can you comment on the page migration part (IOW did you try it already)?

For example, memory hotunplug, CMA, MCE handling, compaction all rely on 
page migration of something that was allocated using GFP_MOVABLE to 
actually work.

Compaction seems to skip any higher-order folios, but the question is if 
the udnerlying migration itself works.

If it already works: great! If not, this really has to be tackled early, 
because otherwise we'll be breaking the GFP_MOVABLE semantics.

> 
> This series aims to be the bare minimum to demonstrate allocation of large anon
> folios. As such, there is a laundry list of things that need to be done for this
> feature to play nicely with other features. My preferred route is to merge this
> with it's Kconfig defaulted to disabled, and its Kconfig description clearly
> shouting that it's EXPERIMENTAL with an explanation of why (similar to
> READ_ONLY_THP_FOR_FS).
As long as we are not sure about the user space control and as long as 
basic functionality is not working (example, page migration), I would 
tend to not merge this early just for the sake of it.

But yes, something like mlock can eventually be tackled later: as long 
as there is a runtime interface to disable it ;)

> 
> That said, I've put together a table of the items that I'm aware of that need
> attention. It would be great if people can review and add any missing items.
> Then we can hopefully parallelize the implementation work. David, I don't think
> the items you raised are covered - would you mind providing a bit more detail so
> I can add them to the list? (or just add them to the list yourself, if you prefer).
> 
> ---
> 
> - item:
>      mlock
> 
>    description: >-
>      Large, pte-mapped folios are ignored when mlock is requested. Code comment
>      for mlock_vma_folio() says "...filter out pte mappings of THPs, which
>      cannot be consistently counted: a pte mapping of the THP head cannot be
>      distinguished by the page alone."
> 
>    location:
>      - mlock_pte_range()
>      - mlock_vma_folio()
> 
>    assignee:
>      Yin, Fengwei
> 
> 
> - item:
>      numa balancing
> 
>    description: >-
>      Large, pte-mapped folios are ignored by numa-balancing code. Commit
>      comment (e81c480): "We're going to have THP mapped with PTEs. It will
>      confuse numabalancing. Let's skip them for now."
> 
>    location:
>      - do_numa_page()
> 
>    assignee:
>      <none>
> 
> 
> - item:
>      madvise
> 
>    description: >-
>      MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes
>      exclusive only if mapcount==1, else skips remainder of operation. For
>      large, pte-mapped folios, exclusive folios can have mapcount upto nr_pages
>      and still be exclusive. Even better; don't split the folio if it fits
>      entirely within the range? Discussion at
> 
> https://lore.kernel.org/linux-mm/6cec6f68-248e-63b4-5615-9e0f3f819a0a@redhat.com/
>      talks about changing folio mapcounting - may help determine if exclusive
>      without pgtable scan?
> 
>    location:
>      - madvise_cold_or_pageout_pte_range()
>      - madvise_free_pte_range()
> 
>    assignee:
>      <none>
> 
> 
> - item:
>      shrink_folio_list
> 
>    description: >-
>      Raised by Yu Zhao; I can't see the problem in the code - need
>      clarification
> 
>    location:
>      - shrink_folio_list()
> 
>    assignee:
>      <none>
> 
> 
> - item:
>      compaction
> 
>    description: >-
>      Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
>      page-cache pages today. Is my understand correct?
> 
>    location:
>      - <where?>
> 
>    assignee:
>      <none>

I'm still thinking about the whole mapcount thingy (and I burned way too 
much time on that yesterday), which is a big item for such a list and 
affects some of these items.

A pagetable scan is pretty much irrelevant for order-2 pages. But once 
we're talking about higher orders we really don't want to do that.

I'm preparing a writeup with users and challenges.


Is swapping working as expected? zswap?

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-07 11:40       ` David Hildenbrand
@ 2023-07-07 13:12         ` Matthew Wilcox
  -1 siblings, 0 replies; 167+ messages in thread
From: Matthew Wilcox @ 2023-07-07 13:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
> On 06.07.23 10:02, Ryan Roberts wrote:
> But can you comment on the page migration part (IOW did you try it already)?
> 
> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
> page migration of something that was allocated using GFP_MOVABLE to actually
> work.
> 
> Compaction seems to skip any higher-order folios, but the question is if the
> udnerlying migration itself works.
> 
> If it already works: great! If not, this really has to be tackled early,
> because otherwise we'll be breaking the GFP_MOVABLE semantics.

I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
is not.

If you look at a function like folio_migrate_mapping(), it all seems
appropriately folio-ised.  There might be something in there that is
slightly wrong, but that would just be a bug to fix, not a huge
architectural problem.

The problem comes in the callers of migrate_pages().  They pass a
new_folio_t callback.  alloc_migration_target() is the usual one passed
and as far as I can tell is fine.  I've seen no problems reported with it.

compaction_alloc() is a disaster, and I don't know how to fix it.
The compaction code has its own allocator which is populated with order-0
folios.  How it populates that freelist is awful ... see split_map_pages()

> Is swapping working as expected? zswap?

Suboptimally.  Swap will split folios in order to swap them.  Somebody
needs to fix that, but it should work.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-07 13:12         ` Matthew Wilcox
  0 siblings, 0 replies; 167+ messages in thread
From: Matthew Wilcox @ 2023-07-07 13:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
> On 06.07.23 10:02, Ryan Roberts wrote:
> But can you comment on the page migration part (IOW did you try it already)?
> 
> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
> page migration of something that was allocated using GFP_MOVABLE to actually
> work.
> 
> Compaction seems to skip any higher-order folios, but the question is if the
> udnerlying migration itself works.
> 
> If it already works: great! If not, this really has to be tackled early,
> because otherwise we'll be breaking the GFP_MOVABLE semantics.

I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
is not.

If you look at a function like folio_migrate_mapping(), it all seems
appropriately folio-ised.  There might be something in there that is
slightly wrong, but that would just be a bug to fix, not a huge
architectural problem.

The problem comes in the callers of migrate_pages().  They pass a
new_folio_t callback.  alloc_migration_target() is the usual one passed
and as far as I can tell is fine.  I've seen no problems reported with it.

compaction_alloc() is a disaster, and I don't know how to fix it.
The compaction code has its own allocator which is populated with order-0
folios.  How it populates that freelist is awful ... see split_map_pages()

> Is swapping working as expected? zswap?

Suboptimally.  Swap will split folios in order to swap them.  Somebody
needs to fix that, but it should work.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-07 13:12         ` Matthew Wilcox
@ 2023-07-07 13:24           ` David Hildenbrand
  -1 siblings, 0 replies; 167+ messages in thread
From: David Hildenbrand @ 2023-07-07 13:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

On 07.07.23 15:12, Matthew Wilcox wrote:
> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>> On 06.07.23 10:02, Ryan Roberts wrote:
>> But can you comment on the page migration part (IOW did you try it already)?
>>
>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>> page migration of something that was allocated using GFP_MOVABLE to actually
>> work.
>>
>> Compaction seems to skip any higher-order folios, but the question is if the
>> udnerlying migration itself works.
>>
>> If it already works: great! If not, this really has to be tackled early,
>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
> 
> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
> is not.

Thanks! Very nice if at least ordinary migration works.

> 
> If you look at a function like folio_migrate_mapping(), it all seems
> appropriately folio-ised.  There might be something in there that is
> slightly wrong, but that would just be a bug to fix, not a huge
> architectural problem.
> 
> The problem comes in the callers of migrate_pages().  They pass a
> new_folio_t callback.  alloc_migration_target() is the usual one passed
> and as far as I can tell is fine.  I've seen no problems reported with it.
> 
> compaction_alloc() is a disaster, and I don't know how to fix it.
> The compaction code has its own allocator which is populated with order-0
> folios.  How it populates that freelist is awful ... see split_map_pages()

Yeah, all that code was written under the assumption that we're moving 
order-0 pages (which is what the anon+pagecache pages part).

 From what I recall, we're allocating order-0 pages from the high memory 
addresses, so we can migrate from low memory addresses, effectively 
freeing up low memory addresses and filling high memory addresses.

Adjusting that will be ... interesting. Instead of allocating order-0 
pages from high addresses, we might want to allocate "as large as 
possible" ("grab what we can") from high addresses and then have our own 
kind of buddy for allocating from that pool a compaction destination 
page, depending on our source page. Nasty.

What should always work is the split->migrate. But that's definitely not 
what we want in many cases.

> 
>> Is swapping working as expected? zswap?
> 
> Suboptimally.  Swap will split folios in order to swap them.  Somebody
> needs to fix that, but it should work.

Good!

It would be great to have some kind of a feature matrix that tells us 
what works perfectly, sub-optimally, barely, not at all (and what has 
not been tested). Maybe (likely!) we'll also find things that are 
sub-optimal for ordinary THP (like swapping, not even sure about).

I suspect that KSM should work mostly fine with flexible-thp. When 
deduplciating, we'll simply split the compound page and proceed as 
expected. But might be worth testing as well.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-07 13:24           ` David Hildenbrand
  0 siblings, 0 replies; 167+ messages in thread
From: David Hildenbrand @ 2023-07-07 13:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

On 07.07.23 15:12, Matthew Wilcox wrote:
> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>> On 06.07.23 10:02, Ryan Roberts wrote:
>> But can you comment on the page migration part (IOW did you try it already)?
>>
>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>> page migration of something that was allocated using GFP_MOVABLE to actually
>> work.
>>
>> Compaction seems to skip any higher-order folios, but the question is if the
>> udnerlying migration itself works.
>>
>> If it already works: great! If not, this really has to be tackled early,
>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
> 
> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
> is not.

Thanks! Very nice if at least ordinary migration works.

> 
> If you look at a function like folio_migrate_mapping(), it all seems
> appropriately folio-ised.  There might be something in there that is
> slightly wrong, but that would just be a bug to fix, not a huge
> architectural problem.
> 
> The problem comes in the callers of migrate_pages().  They pass a
> new_folio_t callback.  alloc_migration_target() is the usual one passed
> and as far as I can tell is fine.  I've seen no problems reported with it.
> 
> compaction_alloc() is a disaster, and I don't know how to fix it.
> The compaction code has its own allocator which is populated with order-0
> folios.  How it populates that freelist is awful ... see split_map_pages()

Yeah, all that code was written under the assumption that we're moving 
order-0 pages (which is what the anon+pagecache pages part).

 From what I recall, we're allocating order-0 pages from the high memory 
addresses, so we can migrate from low memory addresses, effectively 
freeing up low memory addresses and filling high memory addresses.

Adjusting that will be ... interesting. Instead of allocating order-0 
pages from high addresses, we might want to allocate "as large as 
possible" ("grab what we can") from high addresses and then have our own 
kind of buddy for allocating from that pool a compaction destination 
page, depending on our source page. Nasty.

What should always work is the split->migrate. But that's definitely not 
what we want in many cases.

> 
>> Is swapping working as expected? zswap?
> 
> Suboptimally.  Swap will split folios in order to swap them.  Somebody
> needs to fix that, but it should work.

Good!

It would be great to have some kind of a feature matrix that tells us 
what works perfectly, sub-optimally, barely, not at all (and what has 
not been tested). Maybe (likely!) we'll also find things that are 
sub-optimal for ordinary THP (like swapping, not even sure about).

I suspect that KSM should work mostly fine with flexible-thp. When 
deduplciating, we'll simply split the compound page and proceed as 
expected. But might be worth testing as well.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-07 11:29         ` David Hildenbrand
@ 2023-07-07 13:57           ` Matthew Wilcox
  -1 siblings, 0 replies; 167+ messages in thread
From: Matthew Wilcox @ 2023-07-07 13:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, Huang, Ying, Andrew Morton, Kirill A. Shutemov,
	Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
> On 07.07.23 11:52, Ryan Roberts wrote:
> > On 07/07/2023 09:01, Huang, Ying wrote:
> > > Although we can use smaller page order for FLEXIBLE_THP, it's hard to
> > > avoid internal fragmentation completely.  So, I think that finally we
> > > will need to provide a mechanism for the users to opt out, e.g.,
> > > something like "always madvise never" via
> > > /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
> > > a good idea to reuse the existing interface of THP.
> > 
> > I wouldn't want to tie this to the existing interface, simply because that
> > implies that we would want to follow the "always" and "madvise" advice too; That
> > means that on a thp=madvise system (which is certainly the case for android and
> > other client systems) we would have to disable large anon folios for VMAs that
> > haven't explicitly opted in. That breaks the intention that this should be an
> > invisible performance boost. I think it's important to set the policy for use of
> 
> It will never ever be a completely invisible performance boost, just like
> ordinary THP.
> 
> Using the exact same existing toggle is the right thing to do. If someone
> specify "never" or "madvise", then do exactly that.
> 
> It might make sense to have more modes or additional toggles, but
> "madvise=never" means no memory waste.

I hate the existing mechanisms.  They are an abdication of our
responsibility, and an attempt to blame the user (be it the sysadmin
or the programmer) of our code for using it wrongly.  We should not
replicate this mistake.

Our code should be auto-tuning.  I posted a long, detailed outline here:
https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/

> I remember I raised it already in the past, but you *absolutely* have to
> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
> example, userfaultfd) that doesn't want the kernel to populate any
> additional page tables. So if you have to respect that already, then also
> respect MADV_HUGEPAGE, simple.

Possibly having uffd enabled on a VMA should disable using large folios,
I can get behind that.  But the notion that userspace knows what it's
doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
know what it's doing.


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-07 13:57           ` Matthew Wilcox
  0 siblings, 0 replies; 167+ messages in thread
From: Matthew Wilcox @ 2023-07-07 13:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, Huang, Ying, Andrew Morton, Kirill A. Shutemov,
	Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
> On 07.07.23 11:52, Ryan Roberts wrote:
> > On 07/07/2023 09:01, Huang, Ying wrote:
> > > Although we can use smaller page order for FLEXIBLE_THP, it's hard to
> > > avoid internal fragmentation completely.  So, I think that finally we
> > > will need to provide a mechanism for the users to opt out, e.g.,
> > > something like "always madvise never" via
> > > /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
> > > a good idea to reuse the existing interface of THP.
> > 
> > I wouldn't want to tie this to the existing interface, simply because that
> > implies that we would want to follow the "always" and "madvise" advice too; That
> > means that on a thp=madvise system (which is certainly the case for android and
> > other client systems) we would have to disable large anon folios for VMAs that
> > haven't explicitly opted in. That breaks the intention that this should be an
> > invisible performance boost. I think it's important to set the policy for use of
> 
> It will never ever be a completely invisible performance boost, just like
> ordinary THP.
> 
> Using the exact same existing toggle is the right thing to do. If someone
> specify "never" or "madvise", then do exactly that.
> 
> It might make sense to have more modes or additional toggles, but
> "madvise=never" means no memory waste.

I hate the existing mechanisms.  They are an abdication of our
responsibility, and an attempt to blame the user (be it the sysadmin
or the programmer) of our code for using it wrongly.  We should not
replicate this mistake.

Our code should be auto-tuning.  I posted a long, detailed outline here:
https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/

> I remember I raised it already in the past, but you *absolutely* have to
> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
> example, userfaultfd) that doesn't want the kernel to populate any
> additional page tables. So if you have to respect that already, then also
> respect MADV_HUGEPAGE, simple.

Possibly having uffd enabled on a VMA should disable using large folios,
I can get behind that.  But the notion that userspace knows what it's
doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
know what it's doing.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-07 13:57           ` Matthew Wilcox
@ 2023-07-07 14:07             ` David Hildenbrand
  -1 siblings, 0 replies; 167+ messages in thread
From: David Hildenbrand @ 2023-07-07 14:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ryan Roberts, Huang, Ying, Andrew Morton, Kirill A. Shutemov,
	Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 07.07.23 15:57, Matthew Wilcox wrote:
> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>> On 07.07.23 11:52, Ryan Roberts wrote:
>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>> something like "always madvise never" via
>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>> a good idea to reuse the existing interface of THP.
>>>
>>> I wouldn't want to tie this to the existing interface, simply because that
>>> implies that we would want to follow the "always" and "madvise" advice too; That
>>> means that on a thp=madvise system (which is certainly the case for android and
>>> other client systems) we would have to disable large anon folios for VMAs that
>>> haven't explicitly opted in. That breaks the intention that this should be an
>>> invisible performance boost. I think it's important to set the policy for use of
>>
>> It will never ever be a completely invisible performance boost, just like
>> ordinary THP.
>>
>> Using the exact same existing toggle is the right thing to do. If someone
>> specify "never" or "madvise", then do exactly that.
>>
>> It might make sense to have more modes or additional toggles, but
>> "madvise=never" means no memory waste.
> 
> I hate the existing mechanisms.  They are an abdication of our
> responsibility, and an attempt to blame the user (be it the sysadmin
> or the programmer) of our code for using it wrongly.  We should not
> replicate this mistake.

I don't agree regarding the programmer responsibility. In some cases the 
programmer really doesn't want to get more memory populated than 
requested -- and knows exactly why setting MADV_NOHUGEPAGE is the right 
thing to do.

Regarding the madvise=never/madvise/always (sys admin decision), memory 
waste (and nailing down bugs or working around them in customer setups) 
have been very good reasons to let the admin have a word.

> 
> Our code should be auto-tuning.  I posted a long, detailed outline here:
> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
> 

Well, "auto-tuning" also should be perfect for everybody, but once 
reality strikes you know it isn't.

If people don't feel like using THP, let them have a word. The "madvise" 
config option is probably more controversial. But the "always vs. never" 
absolutely makes sense to me.

>> I remember I raised it already in the past, but you *absolutely* have to
>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>> example, userfaultfd) that doesn't want the kernel to populate any
>> additional page tables. So if you have to respect that already, then also
>> respect MADV_HUGEPAGE, simple.
> 
> Possibly having uffd enabled on a VMA should disable using large folios,

There are cases where we enable uffd *after* already touching memory 
(postcopy live migration in QEMU being the famous example). That doesn't 
fly.

> I can get behind that.  But the notion that userspace knows what it's
> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
> know what it's doing.

If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing 
... in some cases. And these include cases I care about messing with 
sparse VM memory :)

I have strong opinions against populating more than required when user 
space set MADV_NOHUGEPAGE.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-07 14:07             ` David Hildenbrand
  0 siblings, 0 replies; 167+ messages in thread
From: David Hildenbrand @ 2023-07-07 14:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ryan Roberts, Huang, Ying, Andrew Morton, Kirill A. Shutemov,
	Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 07.07.23 15:57, Matthew Wilcox wrote:
> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>> On 07.07.23 11:52, Ryan Roberts wrote:
>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>> something like "always madvise never" via
>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>> a good idea to reuse the existing interface of THP.
>>>
>>> I wouldn't want to tie this to the existing interface, simply because that
>>> implies that we would want to follow the "always" and "madvise" advice too; That
>>> means that on a thp=madvise system (which is certainly the case for android and
>>> other client systems) we would have to disable large anon folios for VMAs that
>>> haven't explicitly opted in. That breaks the intention that this should be an
>>> invisible performance boost. I think it's important to set the policy for use of
>>
>> It will never ever be a completely invisible performance boost, just like
>> ordinary THP.
>>
>> Using the exact same existing toggle is the right thing to do. If someone
>> specify "never" or "madvise", then do exactly that.
>>
>> It might make sense to have more modes or additional toggles, but
>> "madvise=never" means no memory waste.
> 
> I hate the existing mechanisms.  They are an abdication of our
> responsibility, and an attempt to blame the user (be it the sysadmin
> or the programmer) of our code for using it wrongly.  We should not
> replicate this mistake.

I don't agree regarding the programmer responsibility. In some cases the 
programmer really doesn't want to get more memory populated than 
requested -- and knows exactly why setting MADV_NOHUGEPAGE is the right 
thing to do.

Regarding the madvise=never/madvise/always (sys admin decision), memory 
waste (and nailing down bugs or working around them in customer setups) 
have been very good reasons to let the admin have a word.

> 
> Our code should be auto-tuning.  I posted a long, detailed outline here:
> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
> 

Well, "auto-tuning" also should be perfect for everybody, but once 
reality strikes you know it isn't.

If people don't feel like using THP, let them have a word. The "madvise" 
config option is probably more controversial. But the "always vs. never" 
absolutely makes sense to me.

>> I remember I raised it already in the past, but you *absolutely* have to
>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>> example, userfaultfd) that doesn't want the kernel to populate any
>> additional page tables. So if you have to respect that already, then also
>> respect MADV_HUGEPAGE, simple.
> 
> Possibly having uffd enabled on a VMA should disable using large folios,

There are cases where we enable uffd *after* already touching memory 
(postcopy live migration in QEMU being the famous example). That doesn't 
fly.

> I can get behind that.  But the notion that userspace knows what it's
> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
> know what it's doing.

If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing 
... in some cases. And these include cases I care about messing with 
sparse VM memory :)

I have strong opinions against populating more than required when user 
space set MADV_NOHUGEPAGE.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-07 14:07             ` David Hildenbrand
@ 2023-07-07 15:13               ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-07 15:13 UTC (permalink / raw)
  To: David Hildenbrand, Matthew Wilcox
  Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

On 07/07/2023 15:07, David Hildenbrand wrote:
> On 07.07.23 15:57, Matthew Wilcox wrote:
>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>> something like "always madvise never" via
>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>> a good idea to reuse the existing interface of THP.
>>>>
>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>> That
>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>> invisible performance boost. I think it's important to set the policy for
>>>> use of
>>>
>>> It will never ever be a completely invisible performance boost, just like
>>> ordinary THP.
>>>
>>> Using the exact same existing toggle is the right thing to do. If someone
>>> specify "never" or "madvise", then do exactly that.
>>>
>>> It might make sense to have more modes or additional toggles, but
>>> "madvise=never" means no memory waste.
>>
>> I hate the existing mechanisms.  They are an abdication of our
>> responsibility, and an attempt to blame the user (be it the sysadmin
>> or the programmer) of our code for using it wrongly.  We should not
>> replicate this mistake.
> 
> I don't agree regarding the programmer responsibility. In some cases the
> programmer really doesn't want to get more memory populated than requested --
> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
> 
> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
> (and nailing down bugs or working around them in customer setups) have been very
> good reasons to let the admin have a word.
> 
>>
>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>
> 
> Well, "auto-tuning" also should be perfect for everybody, but once reality
> strikes you know it isn't.
> 
> If people don't feel like using THP, let them have a word. The "madvise" config
> option is probably more controversial. But the "always vs. never" absolutely
> makes sense to me.
> 
>>> I remember I raised it already in the past, but you *absolutely* have to
>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>> example, userfaultfd) that doesn't want the kernel to populate any
>>> additional page tables. So if you have to respect that already, then also
>>> respect MADV_HUGEPAGE, simple.
>>
>> Possibly having uffd enabled on a VMA should disable using large folios,
> 
> There are cases where we enable uffd *after* already touching memory (postcopy
> live migration in QEMU being the famous example). That doesn't fly.
> 
>> I can get behind that.  But the notion that userspace knows what it's
>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>> know what it's doing.
> 
> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
> some cases. And these include cases I care about messing with sparse VM memory :)
> 
> I have strong opinions against populating more than required when user space set
> MADV_NOHUGEPAGE.

I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
reasonable to fallback to allocating an order-0 page in a VMA that has it set.
The app has gone out of its way to explicitly set it, after all.

I think the correct behaviour for the global thp controls (cmdline and sysfs)
are less obvious though. I could get on board with disabling large anon folios
globally when thp="never". But for other situations, I would prefer to keep
large anon folios enabled (treat "madvise" as "always"), with the argument that
their order is much smaller than traditional THP and therefore the internal
fragmentation is significantly reduced. I really don't want to end up with user
space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
anon folios.

I still feel that it would be better for the thp and large anon folio controls
to be independent though - what's the argument for tying them together?

> 


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-07 15:13               ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-07 15:13 UTC (permalink / raw)
  To: David Hildenbrand, Matthew Wilcox
  Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

On 07/07/2023 15:07, David Hildenbrand wrote:
> On 07.07.23 15:57, Matthew Wilcox wrote:
>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>> something like "always madvise never" via
>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>> a good idea to reuse the existing interface of THP.
>>>>
>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>> That
>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>> invisible performance boost. I think it's important to set the policy for
>>>> use of
>>>
>>> It will never ever be a completely invisible performance boost, just like
>>> ordinary THP.
>>>
>>> Using the exact same existing toggle is the right thing to do. If someone
>>> specify "never" or "madvise", then do exactly that.
>>>
>>> It might make sense to have more modes or additional toggles, but
>>> "madvise=never" means no memory waste.
>>
>> I hate the existing mechanisms.  They are an abdication of our
>> responsibility, and an attempt to blame the user (be it the sysadmin
>> or the programmer) of our code for using it wrongly.  We should not
>> replicate this mistake.
> 
> I don't agree regarding the programmer responsibility. In some cases the
> programmer really doesn't want to get more memory populated than requested --
> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
> 
> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
> (and nailing down bugs or working around them in customer setups) have been very
> good reasons to let the admin have a word.
> 
>>
>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>
> 
> Well, "auto-tuning" also should be perfect for everybody, but once reality
> strikes you know it isn't.
> 
> If people don't feel like using THP, let them have a word. The "madvise" config
> option is probably more controversial. But the "always vs. never" absolutely
> makes sense to me.
> 
>>> I remember I raised it already in the past, but you *absolutely* have to
>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>> example, userfaultfd) that doesn't want the kernel to populate any
>>> additional page tables. So if you have to respect that already, then also
>>> respect MADV_HUGEPAGE, simple.
>>
>> Possibly having uffd enabled on a VMA should disable using large folios,
> 
> There are cases where we enable uffd *after* already touching memory (postcopy
> live migration in QEMU being the famous example). That doesn't fly.
> 
>> I can get behind that.  But the notion that userspace knows what it's
>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>> know what it's doing.
> 
> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
> some cases. And these include cases I care about messing with sparse VM memory :)
> 
> I have strong opinions against populating more than required when user space set
> MADV_NOHUGEPAGE.

I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
reasonable to fallback to allocating an order-0 page in a VMA that has it set.
The app has gone out of its way to explicitly set it, after all.

I think the correct behaviour for the global thp controls (cmdline and sysfs)
are less obvious though. I could get on board with disabling large anon folios
globally when thp="never". But for other situations, I would prefer to keep
large anon folios enabled (treat "madvise" as "always"), with the argument that
their order is much smaller than traditional THP and therefore the internal
fragmentation is significantly reduced. I really don't want to end up with user
space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
anon folios.

I still feel that it would be better for the thp and large anon folio controls
to be independent though - what's the argument for tying them together?

> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-07 15:13               ` Ryan Roberts
@ 2023-07-07 16:06                 ` David Hildenbrand
  -1 siblings, 0 replies; 167+ messages in thread
From: David Hildenbrand @ 2023-07-07 16:06 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox
  Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

On 07.07.23 17:13, Ryan Roberts wrote:
> On 07/07/2023 15:07, David Hildenbrand wrote:
>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>> something like "always madvise never" via
>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>> a good idea to reuse the existing interface of THP.
>>>>>
>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>> That
>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>> invisible performance boost. I think it's important to set the policy for
>>>>> use of
>>>>
>>>> It will never ever be a completely invisible performance boost, just like
>>>> ordinary THP.
>>>>
>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>> specify "never" or "madvise", then do exactly that.
>>>>
>>>> It might make sense to have more modes or additional toggles, but
>>>> "madvise=never" means no memory waste.
>>>
>>> I hate the existing mechanisms.  They are an abdication of our
>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>> or the programmer) of our code for using it wrongly.  We should not
>>> replicate this mistake.
>>
>> I don't agree regarding the programmer responsibility. In some cases the
>> programmer really doesn't want to get more memory populated than requested --
>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>
>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>> (and nailing down bugs or working around them in customer setups) have been very
>> good reasons to let the admin have a word.
>>
>>>
>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>
>>
>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>> strikes you know it isn't.
>>
>> If people don't feel like using THP, let them have a word. The "madvise" config
>> option is probably more controversial. But the "always vs. never" absolutely
>> makes sense to me.
>>
>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>> additional page tables. So if you have to respect that already, then also
>>>> respect MADV_HUGEPAGE, simple.
>>>
>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>
>> There are cases where we enable uffd *after* already touching memory (postcopy
>> live migration in QEMU being the famous example). That doesn't fly.
>>
>>> I can get behind that.  But the notion that userspace knows what it's
>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>> know what it's doing.
>>
>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>> some cases. And these include cases I care about messing with sparse VM memory :)
>>
>> I have strong opinions against populating more than required when user space set
>> MADV_NOHUGEPAGE.
> 
> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
> The app has gone out of its way to explicitly set it, after all.
> 
> I think the correct behaviour for the global thp controls (cmdline and sysfs)
> are less obvious though. I could get on board with disabling large anon folios
> globally when thp="never". But for other situations, I would prefer to keep
> large anon folios enabled (treat "madvise" as "always"), with the argument that
> their order is much smaller than traditional THP and therefore the internal
> fragmentation is significantly reduced. I really don't want to end up with user
> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
> anon folios.

I was briefly playing with a nasty idea of an additional "madvise-pmd" 
option (that could be the new default), that would use PMD THP only in 
madvise'd regions, and ordinary everywhere else. But let's disregard 
that for now. I think there is a bigger issue (below).

> 
> I still feel that it would be better for the thp and large anon folio controls
> to be independent though - what's the argument for tying them together?

Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 
MiB PMD THP on aarch64 (4k kernel), how are they any different? Just the 
way they are mapped ...

It's easy to say "64k vs. 2 MiB" is a difference and we want separate 
controls, but how is "2MiB vs. 2 MiB" different?

Having that said, I think we have to make up our mind how much control 
we want to give user space. Again, the "2MiB vs. 2 MiB" case nicely 
shows that it's not trivial: memory waste is a real issue on some 
systems where we limit THP to madvise().


Just throwing it out for discussing:

What about keeping the "all / madvise / never" semantics (and 
MADV_NOHUGEPAGE ...) but having an additional config knob that specifies 
in which cases we *still* allow flexible THP even though the system was 
configured for "madvise".

I can't come up with a good name for that, but something like 
"max_auto_size=64k" could be something reasonable to set. We could have 
an arch+hw specific default.

(we all hate config options, I know, but I think there are good reasons 
to have such bare-minimum ones)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-07 16:06                 ` David Hildenbrand
  0 siblings, 0 replies; 167+ messages in thread
From: David Hildenbrand @ 2023-07-07 16:06 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox
  Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

On 07.07.23 17:13, Ryan Roberts wrote:
> On 07/07/2023 15:07, David Hildenbrand wrote:
>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>> something like "always madvise never" via
>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>> a good idea to reuse the existing interface of THP.
>>>>>
>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>> That
>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>> invisible performance boost. I think it's important to set the policy for
>>>>> use of
>>>>
>>>> It will never ever be a completely invisible performance boost, just like
>>>> ordinary THP.
>>>>
>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>> specify "never" or "madvise", then do exactly that.
>>>>
>>>> It might make sense to have more modes or additional toggles, but
>>>> "madvise=never" means no memory waste.
>>>
>>> I hate the existing mechanisms.  They are an abdication of our
>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>> or the programmer) of our code for using it wrongly.  We should not
>>> replicate this mistake.
>>
>> I don't agree regarding the programmer responsibility. In some cases the
>> programmer really doesn't want to get more memory populated than requested --
>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>
>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>> (and nailing down bugs or working around them in customer setups) have been very
>> good reasons to let the admin have a word.
>>
>>>
>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>
>>
>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>> strikes you know it isn't.
>>
>> If people don't feel like using THP, let them have a word. The "madvise" config
>> option is probably more controversial. But the "always vs. never" absolutely
>> makes sense to me.
>>
>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>> additional page tables. So if you have to respect that already, then also
>>>> respect MADV_HUGEPAGE, simple.
>>>
>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>
>> There are cases where we enable uffd *after* already touching memory (postcopy
>> live migration in QEMU being the famous example). That doesn't fly.
>>
>>> I can get behind that.  But the notion that userspace knows what it's
>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>> know what it's doing.
>>
>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>> some cases. And these include cases I care about messing with sparse VM memory :)
>>
>> I have strong opinions against populating more than required when user space set
>> MADV_NOHUGEPAGE.
> 
> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
> The app has gone out of its way to explicitly set it, after all.
> 
> I think the correct behaviour for the global thp controls (cmdline and sysfs)
> are less obvious though. I could get on board with disabling large anon folios
> globally when thp="never". But for other situations, I would prefer to keep
> large anon folios enabled (treat "madvise" as "always"), with the argument that
> their order is much smaller than traditional THP and therefore the internal
> fragmentation is significantly reduced. I really don't want to end up with user
> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
> anon folios.

I was briefly playing with a nasty idea of an additional "madvise-pmd" 
option (that could be the new default), that would use PMD THP only in 
madvise'd regions, and ordinary everywhere else. But let's disregard 
that for now. I think there is a bigger issue (below).

> 
> I still feel that it would be better for the thp and large anon folio controls
> to be independent though - what's the argument for tying them together?

Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 
MiB PMD THP on aarch64 (4k kernel), how are they any different? Just the 
way they are mapped ...

It's easy to say "64k vs. 2 MiB" is a difference and we want separate 
controls, but how is "2MiB vs. 2 MiB" different?

Having that said, I think we have to make up our mind how much control 
we want to give user space. Again, the "2MiB vs. 2 MiB" case nicely 
shows that it's not trivial: memory waste is a real issue on some 
systems where we limit THP to madvise().


Just throwing it out for discussing:

What about keeping the "all / madvise / never" semantics (and 
MADV_NOHUGEPAGE ...) but having an additional config knob that specifies 
in which cases we *still* allow flexible THP even though the system was 
configured for "madvise".

I can't come up with a good name for that, but something like 
"max_auto_size=64k" could be something reasonable to set. We could have 
an arch+hw specific default.

(we all hate config options, I know, but I think there are good reasons 
to have such bare-minimum ones)

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-07 16:06                 ` David Hildenbrand
@ 2023-07-07 16:22                   ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-07 16:22 UTC (permalink / raw)
  To: David Hildenbrand, Matthew Wilcox
  Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

On 07/07/2023 17:06, David Hildenbrand wrote:
> On 07.07.23 17:13, Ryan Roberts wrote:
>> On 07/07/2023 15:07, David Hildenbrand wrote:
>>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>>> something like "always madvise never" via
>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>>> a good idea to reuse the existing interface of THP.
>>>>>>
>>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>>> That
>>>>>> means that on a thp=madvise system (which is certainly the case for
>>>>>> android and
>>>>>> other client systems) we would have to disable large anon folios for VMAs
>>>>>> that
>>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>>> invisible performance boost. I think it's important to set the policy for
>>>>>> use of
>>>>>
>>>>> It will never ever be a completely invisible performance boost, just like
>>>>> ordinary THP.
>>>>>
>>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>>> specify "never" or "madvise", then do exactly that.
>>>>>
>>>>> It might make sense to have more modes or additional toggles, but
>>>>> "madvise=never" means no memory waste.
>>>>
>>>> I hate the existing mechanisms.  They are an abdication of our
>>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>>> or the programmer) of our code for using it wrongly.  We should not
>>>> replicate this mistake.
>>>
>>> I don't agree regarding the programmer responsibility. In some cases the
>>> programmer really doesn't want to get more memory populated than requested --
>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>>
>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>>> (and nailing down bugs or working around them in customer setups) have been very
>>> good reasons to let the admin have a word.
>>>
>>>>
>>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>>
>>>
>>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>>> strikes you know it isn't.
>>>
>>> If people don't feel like using THP, let them have a word. The "madvise" config
>>> option is probably more controversial. But the "always vs. never" absolutely
>>> makes sense to me.
>>>
>>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>>> additional page tables. So if you have to respect that already, then also
>>>>> respect MADV_HUGEPAGE, simple.
>>>>
>>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>>
>>> There are cases where we enable uffd *after* already touching memory (postcopy
>>> live migration in QEMU being the famous example). That doesn't fly.
>>>
>>>> I can get behind that.  But the notion that userspace knows what it's
>>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>>> know what it's doing.
>>>
>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>>> some cases. And these include cases I care about messing with sparse VM
>>> memory :)
>>>
>>> I have strong opinions against populating more than required when user space set
>>> MADV_NOHUGEPAGE.
>>
>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
>> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
>> The app has gone out of its way to explicitly set it, after all.
>>
>> I think the correct behaviour for the global thp controls (cmdline and sysfs)
>> are less obvious though. I could get on board with disabling large anon folios
>> globally when thp="never". But for other situations, I would prefer to keep
>> large anon folios enabled (treat "madvise" as "always"), with the argument that
>> their order is much smaller than traditional THP and therefore the internal
>> fragmentation is significantly reduced. I really don't want to end up with user
>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
>> anon folios.
> 
> I was briefly playing with a nasty idea of an additional "madvise-pmd" option
> (that could be the new default), that would use PMD THP only in madvise'd
> regions, and ordinary everywhere else. But let's disregard that for now. I think
> there is a bigger issue (below).
> 
>>
>> I still feel that it would be better for the thp and large anon folio controls
>> to be independent though - what's the argument for tying them together?
> 
> Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD
> THP on aarch64 (4k kernel), how are they any different? Just the way they are
> mapped ...

The last patch in the series shows my current approach to that:

int arch_wants_pte_order(struct vm_area_struct *vma)
{
	if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
		return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size
	else
		return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K
}

But Yu has raised concerns that this type of policy needs to be in the core mm.
So we could have the arch blindly return the preferred order from HW perspective
(which would be contpte size for arm64). Then for !hugepage_vma_check(), mm
could take the min of that value and some determined "acceptable" limit (which
in my mind is 64K ;-).

> 
> It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls,
> but how is "2MiB vs. 2 MiB" different?
> 
> Having that said, I think we have to make up our mind how much control we want
> to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not
> trivial: memory waste is a real issue on some systems where we limit THP to
> madvise().
> 
> 
> Just throwing it out for discussing:
> 
> What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE
> ...) but having an additional config knob that specifies in which cases we
> *still* allow flexible THP even though the system was configured for "madvise".
> 
> I can't come up with a good name for that, but something like
> "max_auto_size=64k" could be something reasonable to set. We could have an
> arch+hw specific default.

Ahha, yes, that's essentially what I have above. I personally also like the idea
of the limit being an absolute value rather than an order. Although I know Yu
feels differently (see [1]).

[1]
https://lore.kernel.org/linux-mm/4d4c45a2-0037-71de-b182-f516fee07e67@arm.com/T/#m2aff6eebd7f14d0d0620b48497d26eacecf970e6


> 
> (we all hate config options, I know, but I think there are good reasons to have
> such bare-minimum ones)
> 


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-07 16:22                   ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-07 16:22 UTC (permalink / raw)
  To: David Hildenbrand, Matthew Wilcox
  Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

On 07/07/2023 17:06, David Hildenbrand wrote:
> On 07.07.23 17:13, Ryan Roberts wrote:
>> On 07/07/2023 15:07, David Hildenbrand wrote:
>>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>>> something like "always madvise never" via
>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>>> a good idea to reuse the existing interface of THP.
>>>>>>
>>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>>> That
>>>>>> means that on a thp=madvise system (which is certainly the case for
>>>>>> android and
>>>>>> other client systems) we would have to disable large anon folios for VMAs
>>>>>> that
>>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>>> invisible performance boost. I think it's important to set the policy for
>>>>>> use of
>>>>>
>>>>> It will never ever be a completely invisible performance boost, just like
>>>>> ordinary THP.
>>>>>
>>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>>> specify "never" or "madvise", then do exactly that.
>>>>>
>>>>> It might make sense to have more modes or additional toggles, but
>>>>> "madvise=never" means no memory waste.
>>>>
>>>> I hate the existing mechanisms.  They are an abdication of our
>>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>>> or the programmer) of our code for using it wrongly.  We should not
>>>> replicate this mistake.
>>>
>>> I don't agree regarding the programmer responsibility. In some cases the
>>> programmer really doesn't want to get more memory populated than requested --
>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>>
>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>>> (and nailing down bugs or working around them in customer setups) have been very
>>> good reasons to let the admin have a word.
>>>
>>>>
>>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>>
>>>
>>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>>> strikes you know it isn't.
>>>
>>> If people don't feel like using THP, let them have a word. The "madvise" config
>>> option is probably more controversial. But the "always vs. never" absolutely
>>> makes sense to me.
>>>
>>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>>> additional page tables. So if you have to respect that already, then also
>>>>> respect MADV_HUGEPAGE, simple.
>>>>
>>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>>
>>> There are cases where we enable uffd *after* already touching memory (postcopy
>>> live migration in QEMU being the famous example). That doesn't fly.
>>>
>>>> I can get behind that.  But the notion that userspace knows what it's
>>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>>> know what it's doing.
>>>
>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>>> some cases. And these include cases I care about messing with sparse VM
>>> memory :)
>>>
>>> I have strong opinions against populating more than required when user space set
>>> MADV_NOHUGEPAGE.
>>
>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
>> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
>> The app has gone out of its way to explicitly set it, after all.
>>
>> I think the correct behaviour for the global thp controls (cmdline and sysfs)
>> are less obvious though. I could get on board with disabling large anon folios
>> globally when thp="never". But for other situations, I would prefer to keep
>> large anon folios enabled (treat "madvise" as "always"), with the argument that
>> their order is much smaller than traditional THP and therefore the internal
>> fragmentation is significantly reduced. I really don't want to end up with user
>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
>> anon folios.
> 
> I was briefly playing with a nasty idea of an additional "madvise-pmd" option
> (that could be the new default), that would use PMD THP only in madvise'd
> regions, and ordinary everywhere else. But let's disregard that for now. I think
> there is a bigger issue (below).
> 
>>
>> I still feel that it would be better for the thp and large anon folio controls
>> to be independent though - what's the argument for tying them together?
> 
> Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD
> THP on aarch64 (4k kernel), how are they any different? Just the way they are
> mapped ...

The last patch in the series shows my current approach to that:

int arch_wants_pte_order(struct vm_area_struct *vma)
{
	if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
		return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size
	else
		return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K
}

But Yu has raised concerns that this type of policy needs to be in the core mm.
So we could have the arch blindly return the preferred order from HW perspective
(which would be contpte size for arm64). Then for !hugepage_vma_check(), mm
could take the min of that value and some determined "acceptable" limit (which
in my mind is 64K ;-).

> 
> It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls,
> but how is "2MiB vs. 2 MiB" different?
> 
> Having that said, I think we have to make up our mind how much control we want
> to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not
> trivial: memory waste is a real issue on some systems where we limit THP to
> madvise().
> 
> 
> Just throwing it out for discussing:
> 
> What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE
> ...) but having an additional config knob that specifies in which cases we
> *still* allow flexible THP even though the system was configured for "madvise".
> 
> I can't come up with a good name for that, but something like
> "max_auto_size=64k" could be something reasonable to set. We could have an
> arch+hw specific default.

Ahha, yes, that's essentially what I have above. I personally also like the idea
of the limit being an absolute value rather than an order. Although I know Yu
feels differently (see [1]).

[1]
https://lore.kernel.org/linux-mm/4d4c45a2-0037-71de-b182-f516fee07e67@arm.com/T/#m2aff6eebd7f14d0d0620b48497d26eacecf970e6


> 
> (we all hate config options, I know, but I think there are good reasons to have
> such bare-minimum ones)
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-07 16:22                   ` Ryan Roberts
@ 2023-07-07 19:06                     ` David Hildenbrand
  -1 siblings, 0 replies; 167+ messages in thread
From: David Hildenbrand @ 2023-07-07 19:06 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox
  Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

>>> I still feel that it would be better for the thp and large anon folio controls
>>> to be independent though - what's the argument for tying them together?
>>
>> Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD
>> THP on aarch64 (4k kernel), how are they any different? Just the way they are
>> mapped ...
> 
> The last patch in the series shows my current approach to that:
> 
> int arch_wants_pte_order(struct vm_area_struct *vma)
> {
> 	if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> 		return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size
> 	else
> 		return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K
> }
> 
> But Yu has raised concerns that this type of policy needs to be in the core mm.
> So we could have the arch blindly return the preferred order from HW perspective
> (which would be contpte size for arm64). Then for !hugepage_vma_check(), mm
> could take the min of that value and some determined "acceptable" limit (which
> in my mind is 64K ;-).

Yeah, it's really tricky. Because why should arm64 with 64k base pages 
*not* return 2MiB (which is one possible cont-pte size IIRC) ?

I share the idea that 64k might *currently* on *some platforms* be a 
reasonable choice. But that's where the "fun" begins.

> 
>>
>> It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls,
>> but how is "2MiB vs. 2 MiB" different?
>>
>> Having that said, I think we have to make up our mind how much control we want
>> to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not
>> trivial: memory waste is a real issue on some systems where we limit THP to
>> madvise().
>>
>>
>> Just throwing it out for discussing:
>>
>> What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE
>> ...) but having an additional config knob that specifies in which cases we
>> *still* allow flexible THP even though the system was configured for "madvise".
>>
>> I can't come up with a good name for that, but something like
>> "max_auto_size=64k" could be something reasonable to set. We could have an
>> arch+hw specific default.
> 
> Ahha, yes, that's essentially what I have above. I personally also like the idea
> of the limit being an absolute value rather than an order. Although I know Yu
> feels differently (see [1]).

Exposed to user space I think it should be a human-readable value. 
Inside the kernel, I don't particularly care.

(Having databases/VMs on arch64 with 64k in mind) I think it might be 
interesting to have something like the following:

thp=madvise
max_auto_size=64k/128k/256k


So in MADV_HUGEPAGE VMAs (such as under QEMU), we'd happily take any 
flexible THP, especially ones < PMD THP (512 MiB) as well. 2 MiB or 4 
MiB THP? sure, give them to my VM. You're barely going to find 512 MiB 
THP either way in practice ....

But for the remainder of my system, just do something reasonable and 
don't go crazy on the memory waste.


I'll try reading all the previous discussions next week.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-07 19:06                     ` David Hildenbrand
  0 siblings, 0 replies; 167+ messages in thread
From: David Hildenbrand @ 2023-07-07 19:06 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox
  Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

>>> I still feel that it would be better for the thp and large anon folio controls
>>> to be independent though - what's the argument for tying them together?
>>
>> Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD
>> THP on aarch64 (4k kernel), how are they any different? Just the way they are
>> mapped ...
> 
> The last patch in the series shows my current approach to that:
> 
> int arch_wants_pte_order(struct vm_area_struct *vma)
> {
> 	if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> 		return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size
> 	else
> 		return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K
> }
> 
> But Yu has raised concerns that this type of policy needs to be in the core mm.
> So we could have the arch blindly return the preferred order from HW perspective
> (which would be contpte size for arm64). Then for !hugepage_vma_check(), mm
> could take the min of that value and some determined "acceptable" limit (which
> in my mind is 64K ;-).

Yeah, it's really tricky. Because why should arm64 with 64k base pages 
*not* return 2MiB (which is one possible cont-pte size IIRC) ?

I share the idea that 64k might *currently* on *some platforms* be a 
reasonable choice. But that's where the "fun" begins.

> 
>>
>> It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls,
>> but how is "2MiB vs. 2 MiB" different?
>>
>> Having that said, I think we have to make up our mind how much control we want
>> to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not
>> trivial: memory waste is a real issue on some systems where we limit THP to
>> madvise().
>>
>>
>> Just throwing it out for discussing:
>>
>> What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE
>> ...) but having an additional config knob that specifies in which cases we
>> *still* allow flexible THP even though the system was configured for "madvise".
>>
>> I can't come up with a good name for that, but something like
>> "max_auto_size=64k" could be something reasonable to set. We could have an
>> arch+hw specific default.
> 
> Ahha, yes, that's essentially what I have above. I personally also like the idea
> of the limit being an absolute value rather than an order. Although I know Yu
> feels differently (see [1]).

Exposed to user space I think it should be a human-readable value. 
Inside the kernel, I don't particularly care.

(Having databases/VMs on arch64 with 64k in mind) I think it might be 
interesting to have something like the following:

thp=madvise
max_auto_size=64k/128k/256k


So in MADV_HUGEPAGE VMAs (such as under QEMU), we'd happily take any 
flexible THP, especially ones < PMD THP (512 MiB) as well. 2 MiB or 4 
MiB THP? sure, give them to my VM. You're barely going to find 512 MiB 
THP either way in practice ....

But for the remainder of my system, just do something reasonable and 
don't go crazy on the memory waste.


I'll try reading all the previous discussions next week.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-07 13:57           ` Matthew Wilcox
@ 2023-07-10  2:49             ` Huang, Ying
  -1 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-10  2:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Ryan Roberts, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>> On 07.07.23 11:52, Ryan Roberts wrote:
>> > On 07/07/2023 09:01, Huang, Ying wrote:
>> > > Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>> > > avoid internal fragmentation completely.  So, I think that finally we
>> > > will need to provide a mechanism for the users to opt out, e.g.,
>> > > something like "always madvise never" via
>> > > /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>> > > a good idea to reuse the existing interface of THP.
>> > 
>> > I wouldn't want to tie this to the existing interface, simply because that
>> > implies that we would want to follow the "always" and "madvise" advice too; That
>> > means that on a thp=madvise system (which is certainly the case for android and
>> > other client systems) we would have to disable large anon folios for VMAs that
>> > haven't explicitly opted in. That breaks the intention that this should be an
>> > invisible performance boost. I think it's important to set the policy for use of
>> 
>> It will never ever be a completely invisible performance boost, just like
>> ordinary THP.
>> 
>> Using the exact same existing toggle is the right thing to do. If someone
>> specify "never" or "madvise", then do exactly that.
>> 
>> It might make sense to have more modes or additional toggles, but
>> "madvise=never" means no memory waste.
>
> I hate the existing mechanisms.  They are an abdication of our
> responsibility, and an attempt to blame the user (be it the sysadmin
> or the programmer) of our code for using it wrongly.  We should not
> replicate this mistake.
>
> Our code should be auto-tuning.  I posted a long, detailed outline here:
> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/

Yes.  Auto-tuning should be more preferable than any configuration
mechanisms.

Something like THP shrinker could be another way of auto-tuning.

https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/

That is, allocating the large folios on page fault, then try to detect
internal fragmentation.

>> I remember I raised it already in the past, but you *absolutely* have to
>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>> example, userfaultfd) that doesn't want the kernel to populate any
>> additional page tables. So if you have to respect that already, then also
>> respect MADV_HUGEPAGE, simple.
>
> Possibly having uffd enabled on a VMA should disable using large folios,
> I can get behind that.  But the notion that userspace knows what it's
> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
> know what it's doing.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-10  2:49             ` Huang, Ying
  0 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-10  2:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Ryan Roberts, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>> On 07.07.23 11:52, Ryan Roberts wrote:
>> > On 07/07/2023 09:01, Huang, Ying wrote:
>> > > Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>> > > avoid internal fragmentation completely.  So, I think that finally we
>> > > will need to provide a mechanism for the users to opt out, e.g.,
>> > > something like "always madvise never" via
>> > > /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>> > > a good idea to reuse the existing interface of THP.
>> > 
>> > I wouldn't want to tie this to the existing interface, simply because that
>> > implies that we would want to follow the "always" and "madvise" advice too; That
>> > means that on a thp=madvise system (which is certainly the case for android and
>> > other client systems) we would have to disable large anon folios for VMAs that
>> > haven't explicitly opted in. That breaks the intention that this should be an
>> > invisible performance boost. I think it's important to set the policy for use of
>> 
>> It will never ever be a completely invisible performance boost, just like
>> ordinary THP.
>> 
>> Using the exact same existing toggle is the right thing to do. If someone
>> specify "never" or "madvise", then do exactly that.
>> 
>> It might make sense to have more modes or additional toggles, but
>> "madvise=never" means no memory waste.
>
> I hate the existing mechanisms.  They are an abdication of our
> responsibility, and an attempt to blame the user (be it the sysadmin
> or the programmer) of our code for using it wrongly.  We should not
> replicate this mistake.
>
> Our code should be auto-tuning.  I posted a long, detailed outline here:
> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/

Yes.  Auto-tuning should be more preferable than any configuration
mechanisms.

Something like THP shrinker could be another way of auto-tuning.

https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/

That is, allocating the large folios on page fault, then try to detect
internal fragmentation.

>> I remember I raised it already in the past, but you *absolutely* have to
>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>> example, userfaultfd) that doesn't want the kernel to populate any
>> additional page tables. So if you have to respect that already, then also
>> respect MADV_HUGEPAGE, simple.
>
> Possibly having uffd enabled on a VMA should disable using large folios,
> I can get behind that.  But the notion that userspace knows what it's
> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
> know what it's doing.

Best Regards,
Huang, Ying

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-07 15:13               ` Ryan Roberts
@ 2023-07-10  3:03                 ` Huang, Ying
  -1 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-10  3:03 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 07/07/2023 15:07, David Hildenbrand wrote:
>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>> something like "always madvise never" via
>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>> a good idea to reuse the existing interface of THP.
>>>>>
>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>> That
>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>> invisible performance boost. I think it's important to set the policy for
>>>>> use of
>>>>
>>>> It will never ever be a completely invisible performance boost, just like
>>>> ordinary THP.
>>>>
>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>> specify "never" or "madvise", then do exactly that.
>>>>
>>>> It might make sense to have more modes or additional toggles, but
>>>> "madvise=never" means no memory waste.
>>>
>>> I hate the existing mechanisms.  They are an abdication of our
>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>> or the programmer) of our code for using it wrongly.  We should not
>>> replicate this mistake.
>> 
>> I don't agree regarding the programmer responsibility. In some cases the
>> programmer really doesn't want to get more memory populated than requested --
>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>> 
>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>> (and nailing down bugs or working around them in customer setups) have been very
>> good reasons to let the admin have a word.
>> 
>>>
>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>
>> 
>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>> strikes you know it isn't.
>> 
>> If people don't feel like using THP, let them have a word. The "madvise" config
>> option is probably more controversial. But the "always vs. never" absolutely
>> makes sense to me.
>> 
>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>> additional page tables. So if you have to respect that already, then also
>>>> respect MADV_HUGEPAGE, simple.
>>>
>>> Possibly having uffd enabled on a VMA should disable using large folios,
>> 
>> There are cases where we enable uffd *after* already touching memory (postcopy
>> live migration in QEMU being the famous example). That doesn't fly.
>> 
>>> I can get behind that.  But the notion that userspace knows what it's
>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>> know what it's doing.
>> 
>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>> some cases. And these include cases I care about messing with sparse VM memory :)
>> 
>> I have strong opinions against populating more than required when user space set
>> MADV_NOHUGEPAGE.
>
> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
> The app has gone out of its way to explicitly set it, after all.
>
> I think the correct behaviour for the global thp controls (cmdline and sysfs)
> are less obvious though. I could get on board with disabling large anon folios
> globally when thp="never". But for other situations, I would prefer to keep
> large anon folios enabled (treat "madvise" as "always"),

If we have some mechanism to auto-tune the large folios usage, for
example, detect the internal fragmentation and split the large folio,
then we can use thp="always" as default configuration.  If my memory
were correct, this is what Johannes and Alexander is working on.

> with the argument that
> their order is much smaller than traditional THP and therefore the internal
> fragmentation is significantly reduced.

Do you have any data for this?

> I really don't want to end up with user
> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
> anon folios.
>
> I still feel that it would be better for the thp and large anon folio controls
> to be independent though - what's the argument for tying them together?
>

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-10  3:03                 ` Huang, Ying
  0 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-10  3:03 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 07/07/2023 15:07, David Hildenbrand wrote:
>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>> something like "always madvise never" via
>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>> a good idea to reuse the existing interface of THP.
>>>>>
>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>> That
>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>> invisible performance boost. I think it's important to set the policy for
>>>>> use of
>>>>
>>>> It will never ever be a completely invisible performance boost, just like
>>>> ordinary THP.
>>>>
>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>> specify "never" or "madvise", then do exactly that.
>>>>
>>>> It might make sense to have more modes or additional toggles, but
>>>> "madvise=never" means no memory waste.
>>>
>>> I hate the existing mechanisms.  They are an abdication of our
>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>> or the programmer) of our code for using it wrongly.  We should not
>>> replicate this mistake.
>> 
>> I don't agree regarding the programmer responsibility. In some cases the
>> programmer really doesn't want to get more memory populated than requested --
>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>> 
>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>> (and nailing down bugs or working around them in customer setups) have been very
>> good reasons to let the admin have a word.
>> 
>>>
>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>
>> 
>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>> strikes you know it isn't.
>> 
>> If people don't feel like using THP, let them have a word. The "madvise" config
>> option is probably more controversial. But the "always vs. never" absolutely
>> makes sense to me.
>> 
>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>> additional page tables. So if you have to respect that already, then also
>>>> respect MADV_HUGEPAGE, simple.
>>>
>>> Possibly having uffd enabled on a VMA should disable using large folios,
>> 
>> There are cases where we enable uffd *after* already touching memory (postcopy
>> live migration in QEMU being the famous example). That doesn't fly.
>> 
>>> I can get behind that.  But the notion that userspace knows what it's
>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>> know what it's doing.
>> 
>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>> some cases. And these include cases I care about messing with sparse VM memory :)
>> 
>> I have strong opinions against populating more than required when user space set
>> MADV_NOHUGEPAGE.
>
> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
> The app has gone out of its way to explicitly set it, after all.
>
> I think the correct behaviour for the global thp controls (cmdline and sysfs)
> are less obvious though. I could get on board with disabling large anon folios
> globally when thp="never". But for other situations, I would prefer to keep
> large anon folios enabled (treat "madvise" as "always"),

If we have some mechanism to auto-tune the large folios usage, for
example, detect the internal fragmentation and split the large folio,
then we can use thp="always" as default configuration.  If my memory
were correct, this is what Johannes and Alexander is working on.

> with the argument that
> their order is much smaller than traditional THP and therefore the internal
> fragmentation is significantly reduced.

Do you have any data for this?

> I really don't want to end up with user
> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
> anon folios.
>
> I still feel that it would be better for the thp and large anon folio controls
> to be independent though - what's the argument for tying them together?
>

Best Regards,
Huang, Ying


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
  2023-07-07  9:42       ` Ryan Roberts
@ 2023-07-10  5:37         ` Huang, Ying
  -1 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-10  5:37 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so
> resending:
>
> On 07/07/2023 09:21, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> With the introduction of large folios for anonymous memory, we would
>>> like to be able to split them when they have unmapped subpages, in order
>>> to free those unused pages under memory pressure. So remove the
>>> artificial requirement that the large folio needed to be at least
>>> PMD-sized.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>>> ---
>>>  mm/rmap.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 82ef5ba363d1..bbcb2308a1c5 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>>>  		 * page of the folio is unmapped and at least one page
>>>  		 * is still mapped.
>>>  		 */
>>> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>>> +		if (folio_test_large(folio) && folio_test_anon(folio))
>>>  			if (!compound || nr < nr_pmdmapped)
>>>  				deferred_split_folio(folio);
>>>  	}
>> 
>> One possible issue is that even for large folios mapped only in one
>> process, in zap_pte_range(), we will always call deferred_split_folio()
>> unnecessarily before freeing a large folio.
>
> Hi Huang, thanks for reviewing!
>
> I have a patch that solves this problem by determining a range of ptes covered
> by a single folio and doing a "batch zap". This prevents the need to add the
> folio to the deferred split queue, only to remove it again shortly afterwards.
> This reduces lock contention and I can measure a performance improvement for the
> kernel compilation benchmark. See [1].
>
> However, I decided to remove it from this patch set on Yu Zhao's advice. We are
> aiming for the minimal patch set to start with and wanted to focus people on
> that. I intend to submit it separately later on.
>
> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/

Thanks for your information!  "batch zap" can solve the problem.

And, I agree with Matthew's comments to fix the large folios interaction
issues before merging the patches to allocate large folios as in the
following email.

https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/

If so, we don't need to introduce the above problem or a large patchset.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
@ 2023-07-10  5:37         ` Huang, Ying
  0 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-10  5:37 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so
> resending:
>
> On 07/07/2023 09:21, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> With the introduction of large folios for anonymous memory, we would
>>> like to be able to split them when they have unmapped subpages, in order
>>> to free those unused pages under memory pressure. So remove the
>>> artificial requirement that the large folio needed to be at least
>>> PMD-sized.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>>> ---
>>>  mm/rmap.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 82ef5ba363d1..bbcb2308a1c5 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>>>  		 * page of the folio is unmapped and at least one page
>>>  		 * is still mapped.
>>>  		 */
>>> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>>> +		if (folio_test_large(folio) && folio_test_anon(folio))
>>>  			if (!compound || nr < nr_pmdmapped)
>>>  				deferred_split_folio(folio);
>>>  	}
>> 
>> One possible issue is that even for large folios mapped only in one
>> process, in zap_pte_range(), we will always call deferred_split_folio()
>> unnecessarily before freeing a large folio.
>
> Hi Huang, thanks for reviewing!
>
> I have a patch that solves this problem by determining a range of ptes covered
> by a single folio and doing a "batch zap". This prevents the need to add the
> folio to the deferred split queue, only to remove it again shortly afterwards.
> This reduces lock contention and I can measure a performance improvement for the
> kernel compilation benchmark. See [1].
>
> However, I decided to remove it from this patch set on Yu Zhao's advice. We are
> aiming for the minimal patch set to start with and wanted to focus people on
> that. I intend to submit it separately later on.
>
> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/

Thanks for your information!  "batch zap" can solve the problem.

And, I agree with Matthew's comments to fix the large folios interaction
issues before merging the patches to allocate large folios as in the
following email.

https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/

If so, we don't need to introduce the above problem or a large patchset.

Best Regards,
Huang, Ying

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
  2023-07-10  5:37         ` Huang, Ying
@ 2023-07-10  8:29           ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-10  8:29 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 10/07/2023 06:37, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so
>> resending:
>>
>> On 07/07/2023 09:21, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> With the introduction of large folios for anonymous memory, we would
>>>> like to be able to split them when they have unmapped subpages, in order
>>>> to free those unused pages under memory pressure. So remove the
>>>> artificial requirement that the large folio needed to be at least
>>>> PMD-sized.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>>>> ---
>>>>  mm/rmap.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 82ef5ba363d1..bbcb2308a1c5 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>>>>  		 * page of the folio is unmapped and at least one page
>>>>  		 * is still mapped.
>>>>  		 */
>>>> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>>>> +		if (folio_test_large(folio) && folio_test_anon(folio))
>>>>  			if (!compound || nr < nr_pmdmapped)
>>>>  				deferred_split_folio(folio);
>>>>  	}
>>>
>>> One possible issue is that even for large folios mapped only in one
>>> process, in zap_pte_range(), we will always call deferred_split_folio()
>>> unnecessarily before freeing a large folio.
>>
>> Hi Huang, thanks for reviewing!
>>
>> I have a patch that solves this problem by determining a range of ptes covered
>> by a single folio and doing a "batch zap". This prevents the need to add the
>> folio to the deferred split queue, only to remove it again shortly afterwards.
>> This reduces lock contention and I can measure a performance improvement for the
>> kernel compilation benchmark. See [1].
>>
>> However, I decided to remove it from this patch set on Yu Zhao's advice. We are
>> aiming for the minimal patch set to start with and wanted to focus people on
>> that. I intend to submit it separately later on.
>>
>> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/
> 
> Thanks for your information!  "batch zap" can solve the problem.
> 
> And, I agree with Matthew's comments to fix the large folios interaction
> issues before merging the patches to allocate large folios as in the
> following email.
> 
> https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/
> 
> If so, we don't need to introduce the above problem or a large patchset.

I appreciate Matthew's and others position about not wanting to merge a minimal
implementation while there are some fundamental features (e.g. compaction) it
doesn't play well with - I'm working to create a definitive list so these items
can be tracked and tackled.

That said, I don't see this "batch zap" patch as an example of this. It's just a
performance enhancement that improves things even further than large anon folios
on their own. I'd rather concentrate on the core changes first then deal with
this type of thing later. Does that work for you?

> 
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
@ 2023-07-10  8:29           ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-10  8:29 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 10/07/2023 06:37, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so
>> resending:
>>
>> On 07/07/2023 09:21, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> With the introduction of large folios for anonymous memory, we would
>>>> like to be able to split them when they have unmapped subpages, in order
>>>> to free those unused pages under memory pressure. So remove the
>>>> artificial requirement that the large folio needed to be at least
>>>> PMD-sized.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>>>> ---
>>>>  mm/rmap.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 82ef5ba363d1..bbcb2308a1c5 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>>>>  		 * page of the folio is unmapped and at least one page
>>>>  		 * is still mapped.
>>>>  		 */
>>>> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>>>> +		if (folio_test_large(folio) && folio_test_anon(folio))
>>>>  			if (!compound || nr < nr_pmdmapped)
>>>>  				deferred_split_folio(folio);
>>>>  	}
>>>
>>> One possible issue is that even for large folios mapped only in one
>>> process, in zap_pte_range(), we will always call deferred_split_folio()
>>> unnecessarily before freeing a large folio.
>>
>> Hi Huang, thanks for reviewing!
>>
>> I have a patch that solves this problem by determining a range of ptes covered
>> by a single folio and doing a "batch zap". This prevents the need to add the
>> folio to the deferred split queue, only to remove it again shortly afterwards.
>> This reduces lock contention and I can measure a performance improvement for the
>> kernel compilation benchmark. See [1].
>>
>> However, I decided to remove it from this patch set on Yu Zhao's advice. We are
>> aiming for the minimal patch set to start with and wanted to focus people on
>> that. I intend to submit it separately later on.
>>
>> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/
> 
> Thanks for your information!  "batch zap" can solve the problem.
> 
> And, I agree with Matthew's comments to fix the large folios interaction
> issues before merging the patches to allocate large folios as in the
> following email.
> 
> https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/
> 
> If so, we don't need to introduce the above problem or a large patchset.

I appreciate Matthew's and others position about not wanting to merge a minimal
implementation while there are some fundamental features (e.g. compaction) it
doesn't play well with - I'm working to create a definitive list so these items
can be tracked and tackled.

That said, I don't see this "batch zap" patch as an example of this. It's just a
performance enhancement that improves things even further than large anon folios
on their own. I'd rather concentrate on the core changes first then deal with
this type of thing later. Does that work for you?

> 
> Best Regards,
> Huang, Ying


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-07 19:06                     ` David Hildenbrand
@ 2023-07-10  8:41                       ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-10  8:41 UTC (permalink / raw)
  To: David Hildenbrand, Matthew Wilcox
  Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

On 07/07/2023 20:06, David Hildenbrand wrote:
>>>> I still feel that it would be better for the thp and large anon folio controls
>>>> to be independent though - what's the argument for tying them together?
>>>
>>> Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD
>>> THP on aarch64 (4k kernel), how are they any different? Just the way they are
>>> mapped ...
>>
>> The last patch in the series shows my current approach to that:
>>
>> int arch_wants_pte_order(struct vm_area_struct *vma)
>> {
>>     if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>         return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size
>>     else
>>         return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K
>> }
>>
>> But Yu has raised concerns that this type of policy needs to be in the core mm.
>> So we could have the arch blindly return the preferred order from HW perspective
>> (which would be contpte size for arm64). Then for !hugepage_vma_check(), mm
>> could take the min of that value and some determined "acceptable" limit (which
>> in my mind is 64K ;-).
> 
> Yeah, it's really tricky. Because why should arm64 with 64k base pages *not*
> return 2MiB (which is one possible cont-pte size IIRC) ?
> 
> I share the idea that 64k might *currently* on *some platforms* be a reasonable
> choice. But that's where the "fun" begins.
> 
>>
>>>
>>> It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls,
>>> but how is "2MiB vs. 2 MiB" different?
>>>
>>> Having that said, I think we have to make up our mind how much control we want
>>> to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not
>>> trivial: memory waste is a real issue on some systems where we limit THP to
>>> madvise().
>>>
>>>
>>> Just throwing it out for discussing:
>>>
>>> What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE
>>> ...) but having an additional config knob that specifies in which cases we
>>> *still* allow flexible THP even though the system was configured for "madvise".
>>>
>>> I can't come up with a good name for that, but something like
>>> "max_auto_size=64k" could be something reasonable to set. We could have an
>>> arch+hw specific default.
>>
>> Ahha, yes, that's essentially what I have above. I personally also like the idea
>> of the limit being an absolute value rather than an order. Although I know Yu
>> feels differently (see [1]).
> 
> Exposed to user space I think it should be a human-readable value. Inside the
> kernel, I don't particularly care.

My point was less about human-readable vs not. It was about expressing a value
that is relative to the base page size vs expressing a value that is independent
of base page size. If the concern is about limiting internal fragmentation, I
think its the absolute size that matters.

> 
> (Having databases/VMs on arch64 with 64k in mind) I think it might be
> interesting to have something like the following:
> 
> thp=madvise
> max_auto_size=64k/128k/256k
> 
> 
> So in MADV_HUGEPAGE VMAs (such as under QEMU), we'd happily take any flexible
> THP, especially ones < PMD THP (512 MiB) as well. 2 MiB or 4 MiB THP? sure, give
> them to my VM. You're barely going to find 512 MiB THP either way in practice ....
> 
> But for the remainder of my system, just do something reasonable and don't go
> crazy on the memory waste.

Yep, we're on the same page. I've got a v3 that's almost ready to go, based on
Yu's prevuous round of review. I'm going to encorporate this mechanism into it
then post hopefully later in the week. Now I just need to figure out a decent
name for the max_auto_size control...

> 
> 
> I'll try reading all the previous discussions next week.
> 


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-10  8:41                       ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-10  8:41 UTC (permalink / raw)
  To: David Hildenbrand, Matthew Wilcox
  Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

On 07/07/2023 20:06, David Hildenbrand wrote:
>>>> I still feel that it would be better for the thp and large anon folio controls
>>>> to be independent though - what's the argument for tying them together?
>>>
>>> Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD
>>> THP on aarch64 (4k kernel), how are they any different? Just the way they are
>>> mapped ...
>>
>> The last patch in the series shows my current approach to that:
>>
>> int arch_wants_pte_order(struct vm_area_struct *vma)
>> {
>>     if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>         return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size
>>     else
>>         return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K
>> }
>>
>> But Yu has raised concerns that this type of policy needs to be in the core mm.
>> So we could have the arch blindly return the preferred order from HW perspective
>> (which would be contpte size for arm64). Then for !hugepage_vma_check(), mm
>> could take the min of that value and some determined "acceptable" limit (which
>> in my mind is 64K ;-).
> 
> Yeah, it's really tricky. Because why should arm64 with 64k base pages *not*
> return 2MiB (which is one possible cont-pte size IIRC) ?
> 
> I share the idea that 64k might *currently* on *some platforms* be a reasonable
> choice. But that's where the "fun" begins.
> 
>>
>>>
>>> It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls,
>>> but how is "2MiB vs. 2 MiB" different?
>>>
>>> Having that said, I think we have to make up our mind how much control we want
>>> to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not
>>> trivial: memory waste is a real issue on some systems where we limit THP to
>>> madvise().
>>>
>>>
>>> Just throwing it out for discussing:
>>>
>>> What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE
>>> ...) but having an additional config knob that specifies in which cases we
>>> *still* allow flexible THP even though the system was configured for "madvise".
>>>
>>> I can't come up with a good name for that, but something like
>>> "max_auto_size=64k" could be something reasonable to set. We could have an
>>> arch+hw specific default.
>>
>> Ahha, yes, that's essentially what I have above. I personally also like the idea
>> of the limit being an absolute value rather than an order. Although I know Yu
>> feels differently (see [1]).
> 
> Exposed to user space I think it should be a human-readable value. Inside the
> kernel, I don't particularly care.

My point was less about human-readable vs not. It was about expressing a value
that is relative to the base page size vs expressing a value that is independent
of base page size. If the concern is about limiting internal fragmentation, I
think its the absolute size that matters.

> 
> (Having databases/VMs on arch64 with 64k in mind) I think it might be
> interesting to have something like the following:
> 
> thp=madvise
> max_auto_size=64k/128k/256k
> 
> 
> So in MADV_HUGEPAGE VMAs (such as under QEMU), we'd happily take any flexible
> THP, especially ones < PMD THP (512 MiB) as well. 2 MiB or 4 MiB THP? sure, give
> them to my VM. You're barely going to find 512 MiB THP either way in practice ....
> 
> But for the remainder of my system, just do something reasonable and don't go
> crazy on the memory waste.

Yep, we're on the same page. I've got a v3 that's almost ready to go, based on
Yu's prevuous round of review. I'm going to encorporate this mechanism into it
then post hopefully later in the week. Now I just need to figure out a decent
name for the max_auto_size control...

> 
> 
> I'll try reading all the previous discussions next week.
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-10  3:03                 ` Huang, Ying
@ 2023-07-10  8:55                   ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-10  8:55 UTC (permalink / raw)
  To: Huang, Ying
  Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu

On 10/07/2023 04:03, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 07/07/2023 15:07, David Hildenbrand wrote:
>>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>>> something like "always madvise never" via
>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>>> a good idea to reuse the existing interface of THP.
>>>>>>
>>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>>> That
>>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>>> invisible performance boost. I think it's important to set the policy for
>>>>>> use of
>>>>>
>>>>> It will never ever be a completely invisible performance boost, just like
>>>>> ordinary THP.
>>>>>
>>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>>> specify "never" or "madvise", then do exactly that.
>>>>>
>>>>> It might make sense to have more modes or additional toggles, but
>>>>> "madvise=never" means no memory waste.
>>>>
>>>> I hate the existing mechanisms.  They are an abdication of our
>>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>>> or the programmer) of our code for using it wrongly.  We should not
>>>> replicate this mistake.
>>>
>>> I don't agree regarding the programmer responsibility. In some cases the
>>> programmer really doesn't want to get more memory populated than requested --
>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>>
>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>>> (and nailing down bugs or working around them in customer setups) have been very
>>> good reasons to let the admin have a word.
>>>
>>>>
>>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>>
>>>
>>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>>> strikes you know it isn't.
>>>
>>> If people don't feel like using THP, let them have a word. The "madvise" config
>>> option is probably more controversial. But the "always vs. never" absolutely
>>> makes sense to me.
>>>
>>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>>> additional page tables. So if you have to respect that already, then also
>>>>> respect MADV_HUGEPAGE, simple.
>>>>
>>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>>
>>> There are cases where we enable uffd *after* already touching memory (postcopy
>>> live migration in QEMU being the famous example). That doesn't fly.
>>>
>>>> I can get behind that.  But the notion that userspace knows what it's
>>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>>> know what it's doing.
>>>
>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>>> some cases. And these include cases I care about messing with sparse VM memory :)
>>>
>>> I have strong opinions against populating more than required when user space set
>>> MADV_NOHUGEPAGE.
>>
>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
>> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
>> The app has gone out of its way to explicitly set it, after all.
>>
>> I think the correct behaviour for the global thp controls (cmdline and sysfs)
>> are less obvious though. I could get on board with disabling large anon folios
>> globally when thp="never". But for other situations, I would prefer to keep
>> large anon folios enabled (treat "madvise" as "always"),
> 
> If we have some mechanism to auto-tune the large folios usage, for
> example, detect the internal fragmentation and split the large folio,
> then we can use thp="always" as default configuration.  If my memory
> were correct, this is what Johannes and Alexander is working on.

Could you point me to that work? I'd like to understand what the mechanism is.
The other half of my work aims to use arm64's pte "contiguous bit" to tell the
HW that a span of PTEs share the same mapping and is therefore coalesced into a
single TLB entry. The side effect of this, however, is that we only have a
single access and dirty bit for the whole contpte extent. So I'd like to avoid
any mechanism that relies on getting access/dirty at the base page granularity
for a large folio.

> 
>> with the argument that
>> their order is much smaller than traditional THP and therefore the internal
>> fragmentation is significantly reduced.
> 
> Do you have any data for this?

Some; its partly based on intuition that the smaller the allocation unit, the
smaller the internal fragmentation. And partly on peak memory usage data I've
collected for the benchmarks I'm running, comparing baseline-4k kernel with
baseline-16k and baseline-64 kernels along with a 4k kernel that supports large
anon folios (I appreciate that's not exactly what we are talking about here, and
it's not exactly an extensive set of results!):


Kernel Compliation with 8 Jobs:
| kernel        |   peak |
|:--------------|-------:|
| baseline-4k   |   0.0% |
| anonfolio     |   0.1% |
| baseline-16k  |   6.3% |
| baseline-64k  |  28.1% |


Kernel Compliation with 80 Jobs:
| kernel        |   peak |
|:--------------|-------:|
| baseline-4k   |   0.0% |
| anonfolio     |   1.7% |
| baseline-16k  |   2.6% |
| baseline-64k  |  12.3% |



> 
>> I really don't want to end up with user
>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
>> anon folios.
>>
>> I still feel that it would be better for the thp and large anon folio controls
>> to be independent though - what's the argument for tying them together?
>>
> 
> Best Regards,
> Huang, Ying
> 


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-10  8:55                   ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-10  8:55 UTC (permalink / raw)
  To: Huang, Ying
  Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu

On 10/07/2023 04:03, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 07/07/2023 15:07, David Hildenbrand wrote:
>>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>>> something like "always madvise never" via
>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>>> a good idea to reuse the existing interface of THP.
>>>>>>
>>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>>> That
>>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>>> invisible performance boost. I think it's important to set the policy for
>>>>>> use of
>>>>>
>>>>> It will never ever be a completely invisible performance boost, just like
>>>>> ordinary THP.
>>>>>
>>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>>> specify "never" or "madvise", then do exactly that.
>>>>>
>>>>> It might make sense to have more modes or additional toggles, but
>>>>> "madvise=never" means no memory waste.
>>>>
>>>> I hate the existing mechanisms.  They are an abdication of our
>>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>>> or the programmer) of our code for using it wrongly.  We should not
>>>> replicate this mistake.
>>>
>>> I don't agree regarding the programmer responsibility. In some cases the
>>> programmer really doesn't want to get more memory populated than requested --
>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>>
>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>>> (and nailing down bugs or working around them in customer setups) have been very
>>> good reasons to let the admin have a word.
>>>
>>>>
>>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>>
>>>
>>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>>> strikes you know it isn't.
>>>
>>> If people don't feel like using THP, let them have a word. The "madvise" config
>>> option is probably more controversial. But the "always vs. never" absolutely
>>> makes sense to me.
>>>
>>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>>> additional page tables. So if you have to respect that already, then also
>>>>> respect MADV_HUGEPAGE, simple.
>>>>
>>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>>
>>> There are cases where we enable uffd *after* already touching memory (postcopy
>>> live migration in QEMU being the famous example). That doesn't fly.
>>>
>>>> I can get behind that.  But the notion that userspace knows what it's
>>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>>> know what it's doing.
>>>
>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>>> some cases. And these include cases I care about messing with sparse VM memory :)
>>>
>>> I have strong opinions against populating more than required when user space set
>>> MADV_NOHUGEPAGE.
>>
>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
>> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
>> The app has gone out of its way to explicitly set it, after all.
>>
>> I think the correct behaviour for the global thp controls (cmdline and sysfs)
>> are less obvious though. I could get on board with disabling large anon folios
>> globally when thp="never". But for other situations, I would prefer to keep
>> large anon folios enabled (treat "madvise" as "always"),
> 
> If we have some mechanism to auto-tune the large folios usage, for
> example, detect the internal fragmentation and split the large folio,
> then we can use thp="always" as default configuration.  If my memory
> were correct, this is what Johannes and Alexander is working on.

Could you point me to that work? I'd like to understand what the mechanism is.
The other half of my work aims to use arm64's pte "contiguous bit" to tell the
HW that a span of PTEs share the same mapping and is therefore coalesced into a
single TLB entry. The side effect of this, however, is that we only have a
single access and dirty bit for the whole contpte extent. So I'd like to avoid
any mechanism that relies on getting access/dirty at the base page granularity
for a large folio.

> 
>> with the argument that
>> their order is much smaller than traditional THP and therefore the internal
>> fragmentation is significantly reduced.
> 
> Do you have any data for this?

Some; its partly based on intuition that the smaller the allocation unit, the
smaller the internal fragmentation. And partly on peak memory usage data I've
collected for the benchmarks I'm running, comparing baseline-4k kernel with
baseline-16k and baseline-64 kernels along with a 4k kernel that supports large
anon folios (I appreciate that's not exactly what we are talking about here, and
it's not exactly an extensive set of results!):


Kernel Compliation with 8 Jobs:
| kernel        |   peak |
|:--------------|-------:|
| baseline-4k   |   0.0% |
| anonfolio     |   0.1% |
| baseline-16k  |   6.3% |
| baseline-64k  |  28.1% |


Kernel Compliation with 80 Jobs:
| kernel        |   peak |
|:--------------|-------:|
| baseline-4k   |   0.0% |
| anonfolio     |   1.7% |
| baseline-16k  |   2.6% |
| baseline-64k  |  12.3% |



> 
>> I really don't want to end up with user
>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
>> anon folios.
>>
>> I still feel that it would be better for the thp and large anon folio controls
>> to be independent though - what's the argument for tying them together?
>>
> 
> Best Regards,
> Huang, Ying
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
  2023-07-10  8:29           ` Ryan Roberts
@ 2023-07-10  9:01             ` Huang, Ying
  -1 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-10  9:01 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 10/07/2023 06:37, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so
>>> resending:
>>>
>>> On 07/07/2023 09:21, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> With the introduction of large folios for anonymous memory, we would
>>>>> like to be able to split them when they have unmapped subpages, in order
>>>>> to free those unused pages under memory pressure. So remove the
>>>>> artificial requirement that the large folio needed to be at least
>>>>> PMD-sized.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>>>>> ---
>>>>>  mm/rmap.c | 2 +-
>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 82ef5ba363d1..bbcb2308a1c5 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>>>>>  		 * page of the folio is unmapped and at least one page
>>>>>  		 * is still mapped.
>>>>>  		 */
>>>>> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>>>>> +		if (folio_test_large(folio) && folio_test_anon(folio))
>>>>>  			if (!compound || nr < nr_pmdmapped)
>>>>>  				deferred_split_folio(folio);
>>>>>  	}
>>>>
>>>> One possible issue is that even for large folios mapped only in one
>>>> process, in zap_pte_range(), we will always call deferred_split_folio()
>>>> unnecessarily before freeing a large folio.
>>>
>>> Hi Huang, thanks for reviewing!
>>>
>>> I have a patch that solves this problem by determining a range of ptes covered
>>> by a single folio and doing a "batch zap". This prevents the need to add the
>>> folio to the deferred split queue, only to remove it again shortly afterwards.
>>> This reduces lock contention and I can measure a performance improvement for the
>>> kernel compilation benchmark. See [1].
>>>
>>> However, I decided to remove it from this patch set on Yu Zhao's advice. We are
>>> aiming for the minimal patch set to start with and wanted to focus people on
>>> that. I intend to submit it separately later on.
>>>
>>> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/
>> 
>> Thanks for your information!  "batch zap" can solve the problem.
>> 
>> And, I agree with Matthew's comments to fix the large folios interaction
>> issues before merging the patches to allocate large folios as in the
>> following email.
>> 
>> https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/
>> 
>> If so, we don't need to introduce the above problem or a large patchset.
>
> I appreciate Matthew's and others position about not wanting to merge a minimal
> implementation while there are some fundamental features (e.g. compaction) it
> doesn't play well with - I'm working to create a definitive list so these items
> can be tracked and tackled.

Good to know this, Thanks!

> That said, I don't see this "batch zap" patch as an example of this. It's just a
> performance enhancement that improves things even further than large anon folios
> on their own. I'd rather concentrate on the core changes first then deal with
> this type of thing later. Does that work for you?

IIUC, allocating large folios upon page fault depends on splitting large
folios in page_remove_rmap() to avoid memory wastage.  Splitting large
folios in page_remove_rmap() depends on "batch zap" to avoid performance
regression in zap_pte_range().  So we need them to be done earlier.  Or
I miss something?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
@ 2023-07-10  9:01             ` Huang, Ying
  0 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-10  9:01 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 10/07/2023 06:37, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so
>>> resending:
>>>
>>> On 07/07/2023 09:21, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> With the introduction of large folios for anonymous memory, we would
>>>>> like to be able to split them when they have unmapped subpages, in order
>>>>> to free those unused pages under memory pressure. So remove the
>>>>> artificial requirement that the large folio needed to be at least
>>>>> PMD-sized.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>>>>> ---
>>>>>  mm/rmap.c | 2 +-
>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 82ef5ba363d1..bbcb2308a1c5 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>>>>>  		 * page of the folio is unmapped and at least one page
>>>>>  		 * is still mapped.
>>>>>  		 */
>>>>> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>>>>> +		if (folio_test_large(folio) && folio_test_anon(folio))
>>>>>  			if (!compound || nr < nr_pmdmapped)
>>>>>  				deferred_split_folio(folio);
>>>>>  	}
>>>>
>>>> One possible issue is that even for large folios mapped only in one
>>>> process, in zap_pte_range(), we will always call deferred_split_folio()
>>>> unnecessarily before freeing a large folio.
>>>
>>> Hi Huang, thanks for reviewing!
>>>
>>> I have a patch that solves this problem by determining a range of ptes covered
>>> by a single folio and doing a "batch zap". This prevents the need to add the
>>> folio to the deferred split queue, only to remove it again shortly afterwards.
>>> This reduces lock contention and I can measure a performance improvement for the
>>> kernel compilation benchmark. See [1].
>>>
>>> However, I decided to remove it from this patch set on Yu Zhao's advice. We are
>>> aiming for the minimal patch set to start with and wanted to focus people on
>>> that. I intend to submit it separately later on.
>>>
>>> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/
>> 
>> Thanks for your information!  "batch zap" can solve the problem.
>> 
>> And, I agree with Matthew's comments to fix the large folios interaction
>> issues before merging the patches to allocate large folios as in the
>> following email.
>> 
>> https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/
>> 
>> If so, we don't need to introduce the above problem or a large patchset.
>
> I appreciate Matthew's and others position about not wanting to merge a minimal
> implementation while there are some fundamental features (e.g. compaction) it
> doesn't play well with - I'm working to create a definitive list so these items
> can be tracked and tackled.

Good to know this, Thanks!

> That said, I don't see this "batch zap" patch as an example of this. It's just a
> performance enhancement that improves things even further than large anon folios
> on their own. I'd rather concentrate on the core changes first then deal with
> this type of thing later. Does that work for you?

IIUC, allocating large folios upon page fault depends on splitting large
folios in page_remove_rmap() to avoid memory wastage.  Splitting large
folios in page_remove_rmap() depends on "batch zap" to avoid performance
regression in zap_pte_range().  So we need them to be done earlier.  Or
I miss something?

Best Regards,
Huang, Ying

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-10  8:55                   ` Ryan Roberts
@ 2023-07-10  9:18                     ` Huang, Ying
  -1 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-10  9:18 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 10/07/2023 04:03, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 07/07/2023 15:07, David Hildenbrand wrote:
>>>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>>>> something like "always madvise never" via
>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>>>> a good idea to reuse the existing interface of THP.
>>>>>>>
>>>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>>>> That
>>>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>>>> invisible performance boost. I think it's important to set the policy for
>>>>>>> use of
>>>>>>
>>>>>> It will never ever be a completely invisible performance boost, just like
>>>>>> ordinary THP.
>>>>>>
>>>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>>>> specify "never" or "madvise", then do exactly that.
>>>>>>
>>>>>> It might make sense to have more modes or additional toggles, but
>>>>>> "madvise=never" means no memory waste.
>>>>>
>>>>> I hate the existing mechanisms.  They are an abdication of our
>>>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>>>> or the programmer) of our code for using it wrongly.  We should not
>>>>> replicate this mistake.
>>>>
>>>> I don't agree regarding the programmer responsibility. In some cases the
>>>> programmer really doesn't want to get more memory populated than requested --
>>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>>>
>>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>>>> (and nailing down bugs or working around them in customer setups) have been very
>>>> good reasons to let the admin have a word.
>>>>
>>>>>
>>>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>>>
>>>>
>>>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>>>> strikes you know it isn't.
>>>>
>>>> If people don't feel like using THP, let them have a word. The "madvise" config
>>>> option is probably more controversial. But the "always vs. never" absolutely
>>>> makes sense to me.
>>>>
>>>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>>>> additional page tables. So if you have to respect that already, then also
>>>>>> respect MADV_HUGEPAGE, simple.
>>>>>
>>>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>>>
>>>> There are cases where we enable uffd *after* already touching memory (postcopy
>>>> live migration in QEMU being the famous example). That doesn't fly.
>>>>
>>>>> I can get behind that.  But the notion that userspace knows what it's
>>>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>>>> know what it's doing.
>>>>
>>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>>>> some cases. And these include cases I care about messing with sparse VM memory :)
>>>>
>>>> I have strong opinions against populating more than required when user space set
>>>> MADV_NOHUGEPAGE.
>>>
>>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
>>> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
>>> The app has gone out of its way to explicitly set it, after all.
>>>
>>> I think the correct behaviour for the global thp controls (cmdline and sysfs)
>>> are less obvious though. I could get on board with disabling large anon folios
>>> globally when thp="never". But for other situations, I would prefer to keep
>>> large anon folios enabled (treat "madvise" as "always"),
>> 
>> If we have some mechanism to auto-tune the large folios usage, for
>> example, detect the internal fragmentation and split the large folio,
>> then we can use thp="always" as default configuration.  If my memory
>> were correct, this is what Johannes and Alexander is working on.
>
> Could you point me to that work? I'd like to understand what the mechanism is.
> The other half of my work aims to use arm64's pte "contiguous bit" to tell the
> HW that a span of PTEs share the same mapping and is therefore coalesced into a
> single TLB entry. The side effect of this, however, is that we only have a
> single access and dirty bit for the whole contpte extent. So I'd like to avoid
> any mechanism that relies on getting access/dirty at the base page granularity
> for a large folio.

Please take a look at the THP shrinker patchset,

https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/

>> 
>>> with the argument that
>>> their order is much smaller than traditional THP and therefore the internal
>>> fragmentation is significantly reduced.
>> 
>> Do you have any data for this?
>
> Some; its partly based on intuition that the smaller the allocation unit, the
> smaller the internal fragmentation. And partly on peak memory usage data I've
> collected for the benchmarks I'm running, comparing baseline-4k kernel with
> baseline-16k and baseline-64 kernels along with a 4k kernel that supports large
> anon folios (I appreciate that's not exactly what we are talking about here, and
> it's not exactly an extensive set of results!):
>
>
> Kernel Compliation with 8 Jobs:
> | kernel        |   peak |
> |:--------------|-------:|
> | baseline-4k   |   0.0% |
> | anonfolio     |   0.1% |
> | baseline-16k  |   6.3% |
> | baseline-64k  |  28.1% |
>
>
> Kernel Compliation with 80 Jobs:
> | kernel        |   peak |
> |:--------------|-------:|
> | baseline-4k   |   0.0% |
> | anonfolio     |   1.7% |
> | baseline-16k  |   2.6% |
> | baseline-64k  |  12.3% |
>

Why is anonfolio better than baseline-64k if you always allocate 64k
anonymous folio?  Because page cache uses 64k in baseline-64k?

We may need to test some workloads with sparse access patterns too.

Best Regards,
Huang, Ying

>> 
>>> I really don't want to end up with user
>>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
>>> anon folios.
>>>
>>> I still feel that it would be better for the thp and large anon folio controls
>>> to be independent though - what's the argument for tying them together?
>>>


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-10  9:18                     ` Huang, Ying
  0 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-10  9:18 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 10/07/2023 04:03, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 07/07/2023 15:07, David Hildenbrand wrote:
>>>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>>>> something like "always madvise never" via
>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>>>> a good idea to reuse the existing interface of THP.
>>>>>>>
>>>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>>>> That
>>>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>>>> invisible performance boost. I think it's important to set the policy for
>>>>>>> use of
>>>>>>
>>>>>> It will never ever be a completely invisible performance boost, just like
>>>>>> ordinary THP.
>>>>>>
>>>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>>>> specify "never" or "madvise", then do exactly that.
>>>>>>
>>>>>> It might make sense to have more modes or additional toggles, but
>>>>>> "madvise=never" means no memory waste.
>>>>>
>>>>> I hate the existing mechanisms.  They are an abdication of our
>>>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>>>> or the programmer) of our code for using it wrongly.  We should not
>>>>> replicate this mistake.
>>>>
>>>> I don't agree regarding the programmer responsibility. In some cases the
>>>> programmer really doesn't want to get more memory populated than requested --
>>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>>>
>>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>>>> (and nailing down bugs or working around them in customer setups) have been very
>>>> good reasons to let the admin have a word.
>>>>
>>>>>
>>>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>>>
>>>>
>>>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>>>> strikes you know it isn't.
>>>>
>>>> If people don't feel like using THP, let them have a word. The "madvise" config
>>>> option is probably more controversial. But the "always vs. never" absolutely
>>>> makes sense to me.
>>>>
>>>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>>>> additional page tables. So if you have to respect that already, then also
>>>>>> respect MADV_HUGEPAGE, simple.
>>>>>
>>>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>>>
>>>> There are cases where we enable uffd *after* already touching memory (postcopy
>>>> live migration in QEMU being the famous example). That doesn't fly.
>>>>
>>>>> I can get behind that.  But the notion that userspace knows what it's
>>>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>>>> know what it's doing.
>>>>
>>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>>>> some cases. And these include cases I care about messing with sparse VM memory :)
>>>>
>>>> I have strong opinions against populating more than required when user space set
>>>> MADV_NOHUGEPAGE.
>>>
>>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
>>> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
>>> The app has gone out of its way to explicitly set it, after all.
>>>
>>> I think the correct behaviour for the global thp controls (cmdline and sysfs)
>>> are less obvious though. I could get on board with disabling large anon folios
>>> globally when thp="never". But for other situations, I would prefer to keep
>>> large anon folios enabled (treat "madvise" as "always"),
>> 
>> If we have some mechanism to auto-tune the large folios usage, for
>> example, detect the internal fragmentation and split the large folio,
>> then we can use thp="always" as default configuration.  If my memory
>> were correct, this is what Johannes and Alexander is working on.
>
> Could you point me to that work? I'd like to understand what the mechanism is.
> The other half of my work aims to use arm64's pte "contiguous bit" to tell the
> HW that a span of PTEs share the same mapping and is therefore coalesced into a
> single TLB entry. The side effect of this, however, is that we only have a
> single access and dirty bit for the whole contpte extent. So I'd like to avoid
> any mechanism that relies on getting access/dirty at the base page granularity
> for a large folio.

Please take a look at the THP shrinker patchset,

https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/

>> 
>>> with the argument that
>>> their order is much smaller than traditional THP and therefore the internal
>>> fragmentation is significantly reduced.
>> 
>> Do you have any data for this?
>
> Some; its partly based on intuition that the smaller the allocation unit, the
> smaller the internal fragmentation. And partly on peak memory usage data I've
> collected for the benchmarks I'm running, comparing baseline-4k kernel with
> baseline-16k and baseline-64 kernels along with a 4k kernel that supports large
> anon folios (I appreciate that's not exactly what we are talking about here, and
> it's not exactly an extensive set of results!):
>
>
> Kernel Compliation with 8 Jobs:
> | kernel        |   peak |
> |:--------------|-------:|
> | baseline-4k   |   0.0% |
> | anonfolio     |   0.1% |
> | baseline-16k  |   6.3% |
> | baseline-64k  |  28.1% |
>
>
> Kernel Compliation with 80 Jobs:
> | kernel        |   peak |
> |:--------------|-------:|
> | baseline-4k   |   0.0% |
> | anonfolio     |   1.7% |
> | baseline-16k  |   2.6% |
> | baseline-64k  |  12.3% |
>

Why is anonfolio better than baseline-64k if you always allocate 64k
anonymous folio?  Because page cache uses 64k in baseline-64k?

We may need to test some workloads with sparse access patterns too.

Best Regards,
Huang, Ying

>> 
>>> I really don't want to end up with user
>>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
>>> anon folios.
>>>
>>> I still feel that it would be better for the thp and large anon folio controls
>>> to be independent though - what's the argument for tying them together?
>>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-10  9:18                     ` Huang, Ying
@ 2023-07-10  9:25                       ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-10  9:25 UTC (permalink / raw)
  To: Huang, Ying
  Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu

On 10/07/2023 10:18, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 10/07/2023 04:03, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> On 07/07/2023 15:07, David Hildenbrand wrote:
>>>>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>>>>> something like "always madvise never" via
>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>>>>> a good idea to reuse the existing interface of THP.
>>>>>>>>
>>>>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>>>>> That
>>>>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>>>>> invisible performance boost. I think it's important to set the policy for
>>>>>>>> use of
>>>>>>>
>>>>>>> It will never ever be a completely invisible performance boost, just like
>>>>>>> ordinary THP.
>>>>>>>
>>>>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>>>>> specify "never" or "madvise", then do exactly that.
>>>>>>>
>>>>>>> It might make sense to have more modes or additional toggles, but
>>>>>>> "madvise=never" means no memory waste.
>>>>>>
>>>>>> I hate the existing mechanisms.  They are an abdication of our
>>>>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>>>>> or the programmer) of our code for using it wrongly.  We should not
>>>>>> replicate this mistake.
>>>>>
>>>>> I don't agree regarding the programmer responsibility. In some cases the
>>>>> programmer really doesn't want to get more memory populated than requested --
>>>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>>>>
>>>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>>>>> (and nailing down bugs or working around them in customer setups) have been very
>>>>> good reasons to let the admin have a word.
>>>>>
>>>>>>
>>>>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>>>>
>>>>>
>>>>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>>>>> strikes you know it isn't.
>>>>>
>>>>> If people don't feel like using THP, let them have a word. The "madvise" config
>>>>> option is probably more controversial. But the "always vs. never" absolutely
>>>>> makes sense to me.
>>>>>
>>>>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>>>>> additional page tables. So if you have to respect that already, then also
>>>>>>> respect MADV_HUGEPAGE, simple.
>>>>>>
>>>>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>>>>
>>>>> There are cases where we enable uffd *after* already touching memory (postcopy
>>>>> live migration in QEMU being the famous example). That doesn't fly.
>>>>>
>>>>>> I can get behind that.  But the notion that userspace knows what it's
>>>>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>>>>> know what it's doing.
>>>>>
>>>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>>>>> some cases. And these include cases I care about messing with sparse VM memory :)
>>>>>
>>>>> I have strong opinions against populating more than required when user space set
>>>>> MADV_NOHUGEPAGE.
>>>>
>>>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
>>>> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
>>>> The app has gone out of its way to explicitly set it, after all.
>>>>
>>>> I think the correct behaviour for the global thp controls (cmdline and sysfs)
>>>> are less obvious though. I could get on board with disabling large anon folios
>>>> globally when thp="never". But for other situations, I would prefer to keep
>>>> large anon folios enabled (treat "madvise" as "always"),
>>>
>>> If we have some mechanism to auto-tune the large folios usage, for
>>> example, detect the internal fragmentation and split the large folio,
>>> then we can use thp="always" as default configuration.  If my memory
>>> were correct, this is what Johannes and Alexander is working on.
>>
>> Could you point me to that work? I'd like to understand what the mechanism is.
>> The other half of my work aims to use arm64's pte "contiguous bit" to tell the
>> HW that a span of PTEs share the same mapping and is therefore coalesced into a
>> single TLB entry. The side effect of this, however, is that we only have a
>> single access and dirty bit for the whole contpte extent. So I'd like to avoid
>> any mechanism that relies on getting access/dirty at the base page granularity
>> for a large folio.
> 
> Please take a look at the THP shrinker patchset,
> 
> https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/

Thanks!

> 
>>>
>>>> with the argument that
>>>> their order is much smaller than traditional THP and therefore the internal
>>>> fragmentation is significantly reduced.
>>>
>>> Do you have any data for this?
>>
>> Some; its partly based on intuition that the smaller the allocation unit, the
>> smaller the internal fragmentation. And partly on peak memory usage data I've
>> collected for the benchmarks I'm running, comparing baseline-4k kernel with
>> baseline-16k and baseline-64 kernels along with a 4k kernel that supports large
>> anon folios (I appreciate that's not exactly what we are talking about here, and
>> it's not exactly an extensive set of results!):
>>
>>
>> Kernel Compliation with 8 Jobs:
>> | kernel        |   peak |
>> |:--------------|-------:|
>> | baseline-4k   |   0.0% |
>> | anonfolio     |   0.1% |
>> | baseline-16k  |   6.3% |
>> | baseline-64k  |  28.1% |
>>
>>
>> Kernel Compliation with 80 Jobs:
>> | kernel        |   peak |
>> |:--------------|-------:|
>> | baseline-4k   |   0.0% |
>> | anonfolio     |   1.7% |
>> | baseline-16k  |   2.6% |
>> | baseline-64k  |  12.3% |
>>
> 
> Why is anonfolio better than baseline-64k if you always allocate 64k
> anonymous folio?  Because page cache uses 64k in baseline-64k?

No, because the VMA boundaries are aligned to 4K and not 64K. Large Anon Folios
only allocates a 64K folio if it does not breach the bounds of the VMA (and if
it doesn't overlap other allocated PTEs).

> 
> We may need to test some workloads with sparse access patterns too.

Yes, I agree if you have a workload with a pathalogical memory access pattern
where it writes to addresses with a stride of 64K, all contained in a single
VMA, then you will end up allocating 16x the memory. This is obviously an
unrealistic extreme though.

> 
> Best Regards,
> Huang, Ying
> 
>>>
>>>> I really don't want to end up with user
>>>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
>>>> anon folios.
>>>>
>>>> I still feel that it would be better for the thp and large anon folio controls
>>>> to be independent though - what's the argument for tying them together?
>>>>
> 


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-10  9:25                       ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-10  9:25 UTC (permalink / raw)
  To: Huang, Ying
  Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu

On 10/07/2023 10:18, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 10/07/2023 04:03, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> On 07/07/2023 15:07, David Hildenbrand wrote:
>>>>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>>>>> something like "always madvise never" via
>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>>>>> a good idea to reuse the existing interface of THP.
>>>>>>>>
>>>>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>>>>> That
>>>>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>>>>> invisible performance boost. I think it's important to set the policy for
>>>>>>>> use of
>>>>>>>
>>>>>>> It will never ever be a completely invisible performance boost, just like
>>>>>>> ordinary THP.
>>>>>>>
>>>>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>>>>> specify "never" or "madvise", then do exactly that.
>>>>>>>
>>>>>>> It might make sense to have more modes or additional toggles, but
>>>>>>> "madvise=never" means no memory waste.
>>>>>>
>>>>>> I hate the existing mechanisms.  They are an abdication of our
>>>>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>>>>> or the programmer) of our code for using it wrongly.  We should not
>>>>>> replicate this mistake.
>>>>>
>>>>> I don't agree regarding the programmer responsibility. In some cases the
>>>>> programmer really doesn't want to get more memory populated than requested --
>>>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>>>>
>>>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>>>>> (and nailing down bugs or working around them in customer setups) have been very
>>>>> good reasons to let the admin have a word.
>>>>>
>>>>>>
>>>>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>>>>
>>>>>
>>>>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>>>>> strikes you know it isn't.
>>>>>
>>>>> If people don't feel like using THP, let them have a word. The "madvise" config
>>>>> option is probably more controversial. But the "always vs. never" absolutely
>>>>> makes sense to me.
>>>>>
>>>>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>>>>> additional page tables. So if you have to respect that already, then also
>>>>>>> respect MADV_HUGEPAGE, simple.
>>>>>>
>>>>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>>>>
>>>>> There are cases where we enable uffd *after* already touching memory (postcopy
>>>>> live migration in QEMU being the famous example). That doesn't fly.
>>>>>
>>>>>> I can get behind that.  But the notion that userspace knows what it's
>>>>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>>>>> know what it's doing.
>>>>>
>>>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>>>>> some cases. And these include cases I care about messing with sparse VM memory :)
>>>>>
>>>>> I have strong opinions against populating more than required when user space set
>>>>> MADV_NOHUGEPAGE.
>>>>
>>>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
>>>> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
>>>> The app has gone out of its way to explicitly set it, after all.
>>>>
>>>> I think the correct behaviour for the global thp controls (cmdline and sysfs)
>>>> are less obvious though. I could get on board with disabling large anon folios
>>>> globally when thp="never". But for other situations, I would prefer to keep
>>>> large anon folios enabled (treat "madvise" as "always"),
>>>
>>> If we have some mechanism to auto-tune the large folios usage, for
>>> example, detect the internal fragmentation and split the large folio,
>>> then we can use thp="always" as default configuration.  If my memory
>>> were correct, this is what Johannes and Alexander is working on.
>>
>> Could you point me to that work? I'd like to understand what the mechanism is.
>> The other half of my work aims to use arm64's pte "contiguous bit" to tell the
>> HW that a span of PTEs share the same mapping and is therefore coalesced into a
>> single TLB entry. The side effect of this, however, is that we only have a
>> single access and dirty bit for the whole contpte extent. So I'd like to avoid
>> any mechanism that relies on getting access/dirty at the base page granularity
>> for a large folio.
> 
> Please take a look at the THP shrinker patchset,
> 
> https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/

Thanks!

> 
>>>
>>>> with the argument that
>>>> their order is much smaller than traditional THP and therefore the internal
>>>> fragmentation is significantly reduced.
>>>
>>> Do you have any data for this?
>>
>> Some; its partly based on intuition that the smaller the allocation unit, the
>> smaller the internal fragmentation. And partly on peak memory usage data I've
>> collected for the benchmarks I'm running, comparing baseline-4k kernel with
>> baseline-16k and baseline-64 kernels along with a 4k kernel that supports large
>> anon folios (I appreciate that's not exactly what we are talking about here, and
>> it's not exactly an extensive set of results!):
>>
>>
>> Kernel Compliation with 8 Jobs:
>> | kernel        |   peak |
>> |:--------------|-------:|
>> | baseline-4k   |   0.0% |
>> | anonfolio     |   0.1% |
>> | baseline-16k  |   6.3% |
>> | baseline-64k  |  28.1% |
>>
>>
>> Kernel Compliation with 80 Jobs:
>> | kernel        |   peak |
>> |:--------------|-------:|
>> | baseline-4k   |   0.0% |
>> | anonfolio     |   1.7% |
>> | baseline-16k  |   2.6% |
>> | baseline-64k  |  12.3% |
>>
> 
> Why is anonfolio better than baseline-64k if you always allocate 64k
> anonymous folio?  Because page cache uses 64k in baseline-64k?

No, because the VMA boundaries are aligned to 4K and not 64K. Large Anon Folios
only allocates a 64K folio if it does not breach the bounds of the VMA (and if
it doesn't overlap other allocated PTEs).

> 
> We may need to test some workloads with sparse access patterns too.

Yes, I agree if you have a workload with a pathalogical memory access pattern
where it writes to addresses with a stride of 64K, all contained in a single
VMA, then you will end up allocating 16x the memory. This is obviously an
unrealistic extreme though.

> 
> Best Regards,
> Huang, Ying
> 
>>>
>>>> I really don't want to end up with user
>>>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
>>>> anon folios.
>>>>
>>>> I still feel that it would be better for the thp and large anon folio controls
>>>> to be independent though - what's the argument for tying them together?
>>>>
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
  2023-07-10  9:01             ` Huang, Ying
@ 2023-07-10  9:39               ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-10  9:39 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 10/07/2023 10:01, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 10/07/2023 06:37, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so
>>>> resending:
>>>>
>>>> On 07/07/2023 09:21, Huang, Ying wrote:
>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>
>>>>>> With the introduction of large folios for anonymous memory, we would
>>>>>> like to be able to split them when they have unmapped subpages, in order
>>>>>> to free those unused pages under memory pressure. So remove the
>>>>>> artificial requirement that the large folio needed to be at least
>>>>>> PMD-sized.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>>>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>>>>>> ---
>>>>>>  mm/rmap.c | 2 +-
>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 82ef5ba363d1..bbcb2308a1c5 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>>>>>>  		 * page of the folio is unmapped and at least one page
>>>>>>  		 * is still mapped.
>>>>>>  		 */
>>>>>> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>>>>>> +		if (folio_test_large(folio) && folio_test_anon(folio))
>>>>>>  			if (!compound || nr < nr_pmdmapped)
>>>>>>  				deferred_split_folio(folio);
>>>>>>  	}
>>>>>
>>>>> One possible issue is that even for large folios mapped only in one
>>>>> process, in zap_pte_range(), we will always call deferred_split_folio()
>>>>> unnecessarily before freeing a large folio.
>>>>
>>>> Hi Huang, thanks for reviewing!
>>>>
>>>> I have a patch that solves this problem by determining a range of ptes covered
>>>> by a single folio and doing a "batch zap". This prevents the need to add the
>>>> folio to the deferred split queue, only to remove it again shortly afterwards.
>>>> This reduces lock contention and I can measure a performance improvement for the
>>>> kernel compilation benchmark. See [1].
>>>>
>>>> However, I decided to remove it from this patch set on Yu Zhao's advice. We are
>>>> aiming for the minimal patch set to start with and wanted to focus people on
>>>> that. I intend to submit it separately later on.
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/
>>>
>>> Thanks for your information!  "batch zap" can solve the problem.
>>>
>>> And, I agree with Matthew's comments to fix the large folios interaction
>>> issues before merging the patches to allocate large folios as in the
>>> following email.
>>>
>>> https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/
>>>
>>> If so, we don't need to introduce the above problem or a large patchset.
>>
>> I appreciate Matthew's and others position about not wanting to merge a minimal
>> implementation while there are some fundamental features (e.g. compaction) it
>> doesn't play well with - I'm working to create a definitive list so these items
>> can be tracked and tackled.
> 
> Good to know this, Thanks!
> 
>> That said, I don't see this "batch zap" patch as an example of this. It's just a
>> performance enhancement that improves things even further than large anon folios
>> on their own. I'd rather concentrate on the core changes first then deal with
>> this type of thing later. Does that work for you?
> 
> IIUC, allocating large folios upon page fault depends on splitting large
> folios in page_remove_rmap() to avoid memory wastage.  Splitting large
> folios in page_remove_rmap() depends on "batch zap" to avoid performance
> regression in zap_pte_range().  So we need them to be done earlier.  Or
> I miss something?

My point was just that large anon folios improves performance significantly
overall, despite a small perf regression in zap_pte_range(). That regression is
reduced further by a patch from Yin Fengwei to reduce the lock contention [1].
So it doesn't seem urgent to me to get the "batch zap" change in.

I'll add it to my list, then prioritize it against the other stuff.

[1] https://lore.kernel.org/linux-mm/20230429082759.1600796-1-fengwei.yin@intel.com/

> 
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
@ 2023-07-10  9:39               ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-10  9:39 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On 10/07/2023 10:01, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 10/07/2023 06:37, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so
>>>> resending:
>>>>
>>>> On 07/07/2023 09:21, Huang, Ying wrote:
>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>
>>>>>> With the introduction of large folios for anonymous memory, we would
>>>>>> like to be able to split them when they have unmapped subpages, in order
>>>>>> to free those unused pages under memory pressure. So remove the
>>>>>> artificial requirement that the large folio needed to be at least
>>>>>> PMD-sized.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>>>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>>>>>> ---
>>>>>>  mm/rmap.c | 2 +-
>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 82ef5ba363d1..bbcb2308a1c5 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>>>>>>  		 * page of the folio is unmapped and at least one page
>>>>>>  		 * is still mapped.
>>>>>>  		 */
>>>>>> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>>>>>> +		if (folio_test_large(folio) && folio_test_anon(folio))
>>>>>>  			if (!compound || nr < nr_pmdmapped)
>>>>>>  				deferred_split_folio(folio);
>>>>>>  	}
>>>>>
>>>>> One possible issue is that even for large folios mapped only in one
>>>>> process, in zap_pte_range(), we will always call deferred_split_folio()
>>>>> unnecessarily before freeing a large folio.
>>>>
>>>> Hi Huang, thanks for reviewing!
>>>>
>>>> I have a patch that solves this problem by determining a range of ptes covered
>>>> by a single folio and doing a "batch zap". This prevents the need to add the
>>>> folio to the deferred split queue, only to remove it again shortly afterwards.
>>>> This reduces lock contention and I can measure a performance improvement for the
>>>> kernel compilation benchmark. See [1].
>>>>
>>>> However, I decided to remove it from this patch set on Yu Zhao's advice. We are
>>>> aiming for the minimal patch set to start with and wanted to focus people on
>>>> that. I intend to submit it separately later on.
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/
>>>
>>> Thanks for your information!  "batch zap" can solve the problem.
>>>
>>> And, I agree with Matthew's comments to fix the large folios interaction
>>> issues before merging the patches to allocate large folios as in the
>>> following email.
>>>
>>> https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/
>>>
>>> If so, we don't need to introduce the above problem or a large patchset.
>>
>> I appreciate Matthew's and others position about not wanting to merge a minimal
>> implementation while there are some fundamental features (e.g. compaction) it
>> doesn't play well with - I'm working to create a definitive list so these items
>> can be tracked and tackled.
> 
> Good to know this, Thanks!
> 
>> That said, I don't see this "batch zap" patch as an example of this. It's just a
>> performance enhancement that improves things even further than large anon folios
>> on their own. I'd rather concentrate on the core changes first then deal with
>> this type of thing later. Does that work for you?
> 
> IIUC, allocating large folios upon page fault depends on splitting large
> folios in page_remove_rmap() to avoid memory wastage.  Splitting large
> folios in page_remove_rmap() depends on "batch zap" to avoid performance
> regression in zap_pte_range().  So we need them to be done earlier.  Or
> I miss something?

My point was just that large anon folios improves performance significantly
overall, despite a small perf regression in zap_pte_range(). That regression is
reduced further by a patch from Yin Fengwei to reduce the lock contention [1].
So it doesn't seem urgent to me to get the "batch zap" change in.

I'll add it to my list, then prioritize it against the other stuff.

[1] https://lore.kernel.org/linux-mm/20230429082759.1600796-1-fengwei.yin@intel.com/

> 
> Best Regards,
> Huang, Ying


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-07 13:24           ` David Hildenbrand
@ 2023-07-10 10:07             ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-10 10:07 UTC (permalink / raw)
  To: David Hildenbrand, Matthew Wilcox
  Cc: Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi,
	linux-arm-kernel, linux-kernel, linux-mm

On 07/07/2023 14:24, David Hildenbrand wrote:
> On 07.07.23 15:12, Matthew Wilcox wrote:
>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>>> On 06.07.23 10:02, Ryan Roberts wrote:
>>> But can you comment on the page migration part (IOW did you try it already)?
>>>
>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>>> page migration of something that was allocated using GFP_MOVABLE to actually
>>> work.
>>>
>>> Compaction seems to skip any higher-order folios, but the question is if the
>>> udnerlying migration itself works.
>>>
>>> If it already works: great! If not, this really has to be tackled early,
>>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
>>
>> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
>> is not.
> 
> Thanks! Very nice if at least ordinary migration works.

That's good to hear - I hadn't personally investigated.

> 
>>
>> If you look at a function like folio_migrate_mapping(), it all seems
>> appropriately folio-ised.  There might be something in there that is
>> slightly wrong, but that would just be a bug to fix, not a huge
>> architectural problem.
>>
>> The problem comes in the callers of migrate_pages().  They pass a
>> new_folio_t callback.  alloc_migration_target() is the usual one passed
>> and as far as I can tell is fine.  I've seen no problems reported with it.
>>
>> compaction_alloc() is a disaster, and I don't know how to fix it.
>> The compaction code has its own allocator which is populated with order-0
>> folios.  How it populates that freelist is awful ... see split_map_pages()

I think this compaction issue also affects large folios in the page cache? So
really it is a pre-existing bug in the code base that needs to be fixed
independently of large anon folios? Should I assume you are tackling this, Matthew?

> 
> Yeah, all that code was written under the assumption that we're moving order-0
> pages (which is what the anon+pagecache pages part).
> 
> From what I recall, we're allocating order-0 pages from the high memory
> addresses, so we can migrate from low memory addresses, effectively freeing up
> low memory addresses and filling high memory addresses.
> 
> Adjusting that will be ... interesting. Instead of allocating order-0 pages from
> high addresses, we might want to allocate "as large as possible" ("grab what we
> can") from high addresses and then have our own kind of buddy for allocating
> from that pool a compaction destination page, depending on our source page. Nasty.
> 
> What should always work is the split->migrate. But that's definitely not what we
> want in many cases.
> 
>>
>>> Is swapping working as expected? zswap?
>>
>> Suboptimally.  Swap will split folios in order to swap them.  Somebody
>> needs to fix that, but it should work.
> 
> Good!
> 
> It would be great to have some kind of a feature matrix that tells us what works
> perfectly, sub-optimally, barely, not at all (and what has not been tested).
> Maybe (likely!) we'll also find things that are sub-optimal for ordinary THP
> (like swapping, not even sure about).

I'm building a list of known issues, but so far it has been based on code I've
found during review and things raised by people in these threads. Are there test
suites that explicitly test these features? If so I'll happily run them against
large anon folios, but at the moment I'm ignorant I'm afraid. I have been trying
to get mm selftests up and running, but I currently have a bunch of failures on
arm64, even without any of my patches - somthing I'm working through.

> 
> I suspect that KSM should work mostly fine with flexible-thp. When
> deduplciating, we'll simply split the compound page and proceed as expected. But
> might be worth testing as well.
> 


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-10 10:07             ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-10 10:07 UTC (permalink / raw)
  To: David Hildenbrand, Matthew Wilcox
  Cc: Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi,
	linux-arm-kernel, linux-kernel, linux-mm

On 07/07/2023 14:24, David Hildenbrand wrote:
> On 07.07.23 15:12, Matthew Wilcox wrote:
>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>>> On 06.07.23 10:02, Ryan Roberts wrote:
>>> But can you comment on the page migration part (IOW did you try it already)?
>>>
>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>>> page migration of something that was allocated using GFP_MOVABLE to actually
>>> work.
>>>
>>> Compaction seems to skip any higher-order folios, but the question is if the
>>> udnerlying migration itself works.
>>>
>>> If it already works: great! If not, this really has to be tackled early,
>>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
>>
>> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
>> is not.
> 
> Thanks! Very nice if at least ordinary migration works.

That's good to hear - I hadn't personally investigated.

> 
>>
>> If you look at a function like folio_migrate_mapping(), it all seems
>> appropriately folio-ised.  There might be something in there that is
>> slightly wrong, but that would just be a bug to fix, not a huge
>> architectural problem.
>>
>> The problem comes in the callers of migrate_pages().  They pass a
>> new_folio_t callback.  alloc_migration_target() is the usual one passed
>> and as far as I can tell is fine.  I've seen no problems reported with it.
>>
>> compaction_alloc() is a disaster, and I don't know how to fix it.
>> The compaction code has its own allocator which is populated with order-0
>> folios.  How it populates that freelist is awful ... see split_map_pages()

I think this compaction issue also affects large folios in the page cache? So
really it is a pre-existing bug in the code base that needs to be fixed
independently of large anon folios? Should I assume you are tackling this, Matthew?

> 
> Yeah, all that code was written under the assumption that we're moving order-0
> pages (which is what the anon+pagecache pages part).
> 
> From what I recall, we're allocating order-0 pages from the high memory
> addresses, so we can migrate from low memory addresses, effectively freeing up
> low memory addresses and filling high memory addresses.
> 
> Adjusting that will be ... interesting. Instead of allocating order-0 pages from
> high addresses, we might want to allocate "as large as possible" ("grab what we
> can") from high addresses and then have our own kind of buddy for allocating
> from that pool a compaction destination page, depending on our source page. Nasty.
> 
> What should always work is the split->migrate. But that's definitely not what we
> want in many cases.
> 
>>
>>> Is swapping working as expected? zswap?
>>
>> Suboptimally.  Swap will split folios in order to swap them.  Somebody
>> needs to fix that, but it should work.
> 
> Good!
> 
> It would be great to have some kind of a feature matrix that tells us what works
> perfectly, sub-optimally, barely, not at all (and what has not been tested).
> Maybe (likely!) we'll also find things that are sub-optimal for ordinary THP
> (like swapping, not even sure about).

I'm building a list of known issues, but so far it has been based on code I've
found during review and things raised by people in these threads. Are there test
suites that explicitly test these features? If so I'll happily run them against
large anon folios, but at the moment I'm ignorant I'm afraid. I have been trying
to get mm selftests up and running, but I currently have a bunch of failures on
arm64, even without any of my patches - somthing I'm working through.

> 
> I suspect that KSM should work mostly fine with flexible-thp. When
> deduplciating, we'll simply split the compound page and proceed as expected. But
> might be worth testing as well.
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-07 13:24           ` David Hildenbrand
@ 2023-07-10 16:53             ` Zi Yan
  -1 siblings, 0 replies; 167+ messages in thread
From: Zi Yan @ 2023-07-10 16:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Ryan Roberts, Andrew Morton, Kirill A. Shutemov,
	Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 2749 bytes --]

On 7 Jul 2023, at 9:24, David Hildenbrand wrote:

> On 07.07.23 15:12, Matthew Wilcox wrote:
>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>>> On 06.07.23 10:02, Ryan Roberts wrote:
>>> But can you comment on the page migration part (IOW did you try it already)?
>>>
>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>>> page migration of something that was allocated using GFP_MOVABLE to actually
>>> work.
>>>
>>> Compaction seems to skip any higher-order folios, but the question is if the
>>> udnerlying migration itself works.
>>>
>>> If it already works: great! If not, this really has to be tackled early,
>>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
>>
>> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
>> is not.
>
> Thanks! Very nice if at least ordinary migration works.
>
>>
>> If you look at a function like folio_migrate_mapping(), it all seems
>> appropriately folio-ised.  There might be something in there that is
>> slightly wrong, but that would just be a bug to fix, not a huge
>> architectural problem.
>>
>> The problem comes in the callers of migrate_pages().  They pass a
>> new_folio_t callback.  alloc_migration_target() is the usual one passed
>> and as far as I can tell is fine.  I've seen no problems reported with it.
>>
>> compaction_alloc() is a disaster, and I don't know how to fix it.
>> The compaction code has its own allocator which is populated with order-0
>> folios.  How it populates that freelist is awful ... see split_map_pages()
>
> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part).
>
> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses.
>
> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty.

We probably do not need a pool, since before migration, we have isolated folios to
be migrated and can come up with a stats on how many folios there are at each order.
Then, we can isolate free pages based on the stats and do not split free pages
all the way down to order-0. We can sort the source folios based on their orders
and isolate free pages from largest order to smallest order. That could avoid
a free page pool.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-10 16:53             ` Zi Yan
  0 siblings, 0 replies; 167+ messages in thread
From: Zi Yan @ 2023-07-10 16:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Ryan Roberts, Andrew Morton, Kirill A. Shutemov,
	Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm


[-- Attachment #1.1: Type: text/plain, Size: 2749 bytes --]

On 7 Jul 2023, at 9:24, David Hildenbrand wrote:

> On 07.07.23 15:12, Matthew Wilcox wrote:
>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>>> On 06.07.23 10:02, Ryan Roberts wrote:
>>> But can you comment on the page migration part (IOW did you try it already)?
>>>
>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>>> page migration of something that was allocated using GFP_MOVABLE to actually
>>> work.
>>>
>>> Compaction seems to skip any higher-order folios, but the question is if the
>>> udnerlying migration itself works.
>>>
>>> If it already works: great! If not, this really has to be tackled early,
>>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
>>
>> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
>> is not.
>
> Thanks! Very nice if at least ordinary migration works.
>
>>
>> If you look at a function like folio_migrate_mapping(), it all seems
>> appropriately folio-ised.  There might be something in there that is
>> slightly wrong, but that would just be a bug to fix, not a huge
>> architectural problem.
>>
>> The problem comes in the callers of migrate_pages().  They pass a
>> new_folio_t callback.  alloc_migration_target() is the usual one passed
>> and as far as I can tell is fine.  I've seen no problems reported with it.
>>
>> compaction_alloc() is a disaster, and I don't know how to fix it.
>> The compaction code has its own allocator which is populated with order-0
>> folios.  How it populates that freelist is awful ... see split_map_pages()
>
> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part).
>
> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses.
>
> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty.

We probably do not need a pool, since before migration, we have isolated folios to
be migrated and can come up with a stats on how many folios there are at each order.
Then, we can isolate free pages based on the stats and do not split free pages
all the way down to order-0. We can sort the source folios based on their orders
and isolate free pages from largest order to smallest order. That could avoid
a free page pool.

--
Best Regards,
Yan, Zi

[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-10 10:07             ` Ryan Roberts
@ 2023-07-10 16:57               ` Matthew Wilcox
  -1 siblings, 0 replies; 167+ messages in thread
From: Matthew Wilcox @ 2023-07-10 16:57 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Andrew Morton, Kirill A. Shutemov,
	Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Mon, Jul 10, 2023 at 11:07:47AM +0100, Ryan Roberts wrote:
> I think this compaction issue also affects large folios in the page cache? So
> really it is a pre-existing bug in the code base that needs to be fixed
> independently of large anon folios? Should I assume you are tackling this, Matthew?

It does need to be fixed independently of large anon folios.  Said fix
should probably be backported to 6.1 once it's suitably stable.  However,
I'm not working on it.  I have a lot of projects and this one's a
missed-opportunity, not a show-stopper.  Sounds like Zi Yan might be
interested in tackling it though!


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-10 16:57               ` Matthew Wilcox
  0 siblings, 0 replies; 167+ messages in thread
From: Matthew Wilcox @ 2023-07-10 16:57 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Andrew Morton, Kirill A. Shutemov,
	Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

On Mon, Jul 10, 2023 at 11:07:47AM +0100, Ryan Roberts wrote:
> I think this compaction issue also affects large folios in the page cache? So
> really it is a pre-existing bug in the code base that needs to be fixed
> independently of large anon folios? Should I assume you are tackling this, Matthew?

It does need to be fixed independently of large anon folios.  Said fix
should probably be backported to 6.1 once it's suitably stable.  However,
I'm not working on it.  I have a lot of projects and this one's a
missed-opportunity, not a show-stopper.  Sounds like Zi Yan might be
interested in tackling it though!


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
  2023-07-10  9:25                       ` Ryan Roberts
@ 2023-07-11  0:48                         ` Huang, Ying
  -1 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-11  0:48 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 10/07/2023 10:18, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 10/07/2023 04:03, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> On 07/07/2023 15:07, David Hildenbrand wrote:
>>>>>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>>>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>>>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>>>>>> something like "always madvise never" via
>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>>>>>> a good idea to reuse the existing interface of THP.
>>>>>>>>>
>>>>>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>>>>>> That
>>>>>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>>>>>> invisible performance boost. I think it's important to set the policy for
>>>>>>>>> use of
>>>>>>>>
>>>>>>>> It will never ever be a completely invisible performance boost, just like
>>>>>>>> ordinary THP.
>>>>>>>>
>>>>>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>>>>>> specify "never" or "madvise", then do exactly that.
>>>>>>>>
>>>>>>>> It might make sense to have more modes or additional toggles, but
>>>>>>>> "madvise=never" means no memory waste.
>>>>>>>
>>>>>>> I hate the existing mechanisms.  They are an abdication of our
>>>>>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>>>>>> or the programmer) of our code for using it wrongly.  We should not
>>>>>>> replicate this mistake.
>>>>>>
>>>>>> I don't agree regarding the programmer responsibility. In some cases the
>>>>>> programmer really doesn't want to get more memory populated than requested --
>>>>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>>>>>
>>>>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>>>>>> (and nailing down bugs or working around them in customer setups) have been very
>>>>>> good reasons to let the admin have a word.
>>>>>>
>>>>>>>
>>>>>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>>>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>>>>>
>>>>>>
>>>>>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>>>>>> strikes you know it isn't.
>>>>>>
>>>>>> If people don't feel like using THP, let them have a word. The "madvise" config
>>>>>> option is probably more controversial. But the "always vs. never" absolutely
>>>>>> makes sense to me.
>>>>>>
>>>>>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>>>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>>>>>> additional page tables. So if you have to respect that already, then also
>>>>>>>> respect MADV_HUGEPAGE, simple.
>>>>>>>
>>>>>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>>>>>
>>>>>> There are cases where we enable uffd *after* already touching memory (postcopy
>>>>>> live migration in QEMU being the famous example). That doesn't fly.
>>>>>>
>>>>>>> I can get behind that.  But the notion that userspace knows what it's
>>>>>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>>>>>> know what it's doing.
>>>>>>
>>>>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>>>>>> some cases. And these include cases I care about messing with sparse VM memory :)
>>>>>>
>>>>>> I have strong opinions against populating more than required when user space set
>>>>>> MADV_NOHUGEPAGE.
>>>>>
>>>>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
>>>>> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
>>>>> The app has gone out of its way to explicitly set it, after all.
>>>>>
>>>>> I think the correct behaviour for the global thp controls (cmdline and sysfs)
>>>>> are less obvious though. I could get on board with disabling large anon folios
>>>>> globally when thp="never". But for other situations, I would prefer to keep
>>>>> large anon folios enabled (treat "madvise" as "always"),
>>>>
>>>> If we have some mechanism to auto-tune the large folios usage, for
>>>> example, detect the internal fragmentation and split the large folio,
>>>> then we can use thp="always" as default configuration.  If my memory
>>>> were correct, this is what Johannes and Alexander is working on.
>>>
>>> Could you point me to that work? I'd like to understand what the mechanism is.
>>> The other half of my work aims to use arm64's pte "contiguous bit" to tell the
>>> HW that a span of PTEs share the same mapping and is therefore coalesced into a
>>> single TLB entry. The side effect of this, however, is that we only have a
>>> single access and dirty bit for the whole contpte extent. So I'd like to avoid
>>> any mechanism that relies on getting access/dirty at the base page granularity
>>> for a large folio.
>> 
>> Please take a look at the THP shrinker patchset,
>> 
>> https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/
>
> Thanks!
>
>> 
>>>>
>>>>> with the argument that
>>>>> their order is much smaller than traditional THP and therefore the internal
>>>>> fragmentation is significantly reduced.
>>>>
>>>> Do you have any data for this?
>>>
>>> Some; its partly based on intuition that the smaller the allocation unit, the
>>> smaller the internal fragmentation. And partly on peak memory usage data I've
>>> collected for the benchmarks I'm running, comparing baseline-4k kernel with
>>> baseline-16k and baseline-64 kernels along with a 4k kernel that supports large
>>> anon folios (I appreciate that's not exactly what we are talking about here, and
>>> it's not exactly an extensive set of results!):
>>>
>>>
>>> Kernel Compliation with 8 Jobs:
>>> | kernel        |   peak |
>>> |:--------------|-------:|
>>> | baseline-4k   |   0.0% |
>>> | anonfolio     |   0.1% |
>>> | baseline-16k  |   6.3% |
>>> | baseline-64k  |  28.1% |
>>>
>>>
>>> Kernel Compliation with 80 Jobs:
>>> | kernel        |   peak |
>>> |:--------------|-------:|
>>> | baseline-4k   |   0.0% |
>>> | anonfolio     |   1.7% |
>>> | baseline-16k  |   2.6% |
>>> | baseline-64k  |  12.3% |
>>>
>> 
>> Why is anonfolio better than baseline-64k if you always allocate 64k
>> anonymous folio?  Because page cache uses 64k in baseline-64k?
>
> No, because the VMA boundaries are aligned to 4K and not 64K. Large Anon Folios
> only allocates a 64K folio if it does not breach the bounds of the VMA (and if
> it doesn't overlap other allocated PTEs).

Thanks for explanation!

We will use more memory for file cache too for baseline-64k, right?  So,
you observed much more anonymous pages, but not so for file cache pages?

>> 
>> We may need to test some workloads with sparse access patterns too.
>
> Yes, I agree if you have a workload with a pathalogical memory access pattern
> where it writes to addresses with a stride of 64K, all contained in a single
> VMA, then you will end up allocating 16x the memory. This is obviously an
> unrealistic extreme though.

I think that there should be some realistic workload which has sparse
access patterns.

Best Regards,
Huang, Ying

>> 
>>>>
>>>>> I really don't want to end up with user
>>>>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
>>>>> anon folios.
>>>>>
>>>>> I still feel that it would be better for the thp and large anon folio controls
>>>>> to be independent though - what's the argument for tying them together?
>>>>>
>> 

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
@ 2023-07-11  0:48                         ` Huang, Ying
  0 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-11  0:48 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 10/07/2023 10:18, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 10/07/2023 04:03, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> On 07/07/2023 15:07, David Hildenbrand wrote:
>>>>>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>>>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>>>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>>>>>> something like "always madvise never" via
>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>>>>>> a good idea to reuse the existing interface of THP.
>>>>>>>>>
>>>>>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>>>>>> That
>>>>>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>>>>>> invisible performance boost. I think it's important to set the policy for
>>>>>>>>> use of
>>>>>>>>
>>>>>>>> It will never ever be a completely invisible performance boost, just like
>>>>>>>> ordinary THP.
>>>>>>>>
>>>>>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>>>>>> specify "never" or "madvise", then do exactly that.
>>>>>>>>
>>>>>>>> It might make sense to have more modes or additional toggles, but
>>>>>>>> "madvise=never" means no memory waste.
>>>>>>>
>>>>>>> I hate the existing mechanisms.  They are an abdication of our
>>>>>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>>>>>> or the programmer) of our code for using it wrongly.  We should not
>>>>>>> replicate this mistake.
>>>>>>
>>>>>> I don't agree regarding the programmer responsibility. In some cases the
>>>>>> programmer really doesn't want to get more memory populated than requested --
>>>>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>>>>>
>>>>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>>>>>> (and nailing down bugs or working around them in customer setups) have been very
>>>>>> good reasons to let the admin have a word.
>>>>>>
>>>>>>>
>>>>>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>>>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>>>>>
>>>>>>
>>>>>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>>>>>> strikes you know it isn't.
>>>>>>
>>>>>> If people don't feel like using THP, let them have a word. The "madvise" config
>>>>>> option is probably more controversial. But the "always vs. never" absolutely
>>>>>> makes sense to me.
>>>>>>
>>>>>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>>>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>>>>>> additional page tables. So if you have to respect that already, then also
>>>>>>>> respect MADV_HUGEPAGE, simple.
>>>>>>>
>>>>>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>>>>>
>>>>>> There are cases where we enable uffd *after* already touching memory (postcopy
>>>>>> live migration in QEMU being the famous example). That doesn't fly.
>>>>>>
>>>>>>> I can get behind that.  But the notion that userspace knows what it's
>>>>>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>>>>>> know what it's doing.
>>>>>>
>>>>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>>>>>> some cases. And these include cases I care about messing with sparse VM memory :)
>>>>>>
>>>>>> I have strong opinions against populating more than required when user space set
>>>>>> MADV_NOHUGEPAGE.
>>>>>
>>>>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
>>>>> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
>>>>> The app has gone out of its way to explicitly set it, after all.
>>>>>
>>>>> I think the correct behaviour for the global thp controls (cmdline and sysfs)
>>>>> are less obvious though. I could get on board with disabling large anon folios
>>>>> globally when thp="never". But for other situations, I would prefer to keep
>>>>> large anon folios enabled (treat "madvise" as "always"),
>>>>
>>>> If we have some mechanism to auto-tune the large folios usage, for
>>>> example, detect the internal fragmentation and split the large folio,
>>>> then we can use thp="always" as default configuration.  If my memory
>>>> were correct, this is what Johannes and Alexander is working on.
>>>
>>> Could you point me to that work? I'd like to understand what the mechanism is.
>>> The other half of my work aims to use arm64's pte "contiguous bit" to tell the
>>> HW that a span of PTEs share the same mapping and is therefore coalesced into a
>>> single TLB entry. The side effect of this, however, is that we only have a
>>> single access and dirty bit for the whole contpte extent. So I'd like to avoid
>>> any mechanism that relies on getting access/dirty at the base page granularity
>>> for a large folio.
>> 
>> Please take a look at the THP shrinker patchset,
>> 
>> https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/
>
> Thanks!
>
>> 
>>>>
>>>>> with the argument that
>>>>> their order is much smaller than traditional THP and therefore the internal
>>>>> fragmentation is significantly reduced.
>>>>
>>>> Do you have any data for this?
>>>
>>> Some; its partly based on intuition that the smaller the allocation unit, the
>>> smaller the internal fragmentation. And partly on peak memory usage data I've
>>> collected for the benchmarks I'm running, comparing baseline-4k kernel with
>>> baseline-16k and baseline-64 kernels along with a 4k kernel that supports large
>>> anon folios (I appreciate that's not exactly what we are talking about here, and
>>> it's not exactly an extensive set of results!):
>>>
>>>
>>> Kernel Compliation with 8 Jobs:
>>> | kernel        |   peak |
>>> |:--------------|-------:|
>>> | baseline-4k   |   0.0% |
>>> | anonfolio     |   0.1% |
>>> | baseline-16k  |   6.3% |
>>> | baseline-64k  |  28.1% |
>>>
>>>
>>> Kernel Compliation with 80 Jobs:
>>> | kernel        |   peak |
>>> |:--------------|-------:|
>>> | baseline-4k   |   0.0% |
>>> | anonfolio     |   1.7% |
>>> | baseline-16k  |   2.6% |
>>> | baseline-64k  |  12.3% |
>>>
>> 
>> Why is anonfolio better than baseline-64k if you always allocate 64k
>> anonymous folio?  Because page cache uses 64k in baseline-64k?
>
> No, because the VMA boundaries are aligned to 4K and not 64K. Large Anon Folios
> only allocates a 64K folio if it does not breach the bounds of the VMA (and if
> it doesn't overlap other allocated PTEs).

Thanks for explanation!

We will use more memory for file cache too for baseline-64k, right?  So,
you observed much more anonymous pages, but not so for file cache pages?

>> 
>> We may need to test some workloads with sparse access patterns too.
>
> Yes, I agree if you have a workload with a pathalogical memory access pattern
> where it writes to addresses with a stride of 64K, all contained in a single
> VMA, then you will end up allocating 16x the memory. This is obviously an
> unrealistic extreme though.

I think that there should be some realistic workload which has sparse
access patterns.

Best Regards,
Huang, Ying

>> 
>>>>
>>>>> I really don't want to end up with user
>>>>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
>>>>> anon folios.
>>>>>
>>>>> I still feel that it would be better for the thp and large anon folio controls
>>>>> to be independent though - what's the argument for tying them together?
>>>>>
>> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
  2023-07-10  9:39               ` Ryan Roberts
@ 2023-07-11  1:56                 ` Huang, Ying
  -1 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-11  1:56 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 10/07/2023 10:01, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 10/07/2023 06:37, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so
>>>>> resending:
>>>>>
>>>>> On 07/07/2023 09:21, Huang, Ying wrote:
>>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>>
>>>>>>> With the introduction of large folios for anonymous memory, we would
>>>>>>> like to be able to split them when they have unmapped subpages, in order
>>>>>>> to free those unused pages under memory pressure. So remove the
>>>>>>> artificial requirement that the large folio needed to be at least
>>>>>>> PMD-sized.
>>>>>>>
>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>>>>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>>>>>>> ---
>>>>>>>  mm/rmap.c | 2 +-
>>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>> index 82ef5ba363d1..bbcb2308a1c5 100644
>>>>>>> --- a/mm/rmap.c
>>>>>>> +++ b/mm/rmap.c
>>>>>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>>>>>>>  		 * page of the folio is unmapped and at least one page
>>>>>>>  		 * is still mapped.
>>>>>>>  		 */
>>>>>>> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>>>>>>> +		if (folio_test_large(folio) && folio_test_anon(folio))
>>>>>>>  			if (!compound || nr < nr_pmdmapped)
>>>>>>>  				deferred_split_folio(folio);
>>>>>>>  	}
>>>>>>
>>>>>> One possible issue is that even for large folios mapped only in one
>>>>>> process, in zap_pte_range(), we will always call deferred_split_folio()
>>>>>> unnecessarily before freeing a large folio.
>>>>>
>>>>> Hi Huang, thanks for reviewing!
>>>>>
>>>>> I have a patch that solves this problem by determining a range of ptes covered
>>>>> by a single folio and doing a "batch zap". This prevents the need to add the
>>>>> folio to the deferred split queue, only to remove it again shortly afterwards.
>>>>> This reduces lock contention and I can measure a performance improvement for the
>>>>> kernel compilation benchmark. See [1].
>>>>>
>>>>> However, I decided to remove it from this patch set on Yu Zhao's advice. We are
>>>>> aiming for the minimal patch set to start with and wanted to focus people on
>>>>> that. I intend to submit it separately later on.
>>>>>
>>>>> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/
>>>>
>>>> Thanks for your information!  "batch zap" can solve the problem.
>>>>
>>>> And, I agree with Matthew's comments to fix the large folios interaction
>>>> issues before merging the patches to allocate large folios as in the
>>>> following email.
>>>>
>>>> https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/
>>>>
>>>> If so, we don't need to introduce the above problem or a large patchset.
>>>
>>> I appreciate Matthew's and others position about not wanting to merge a minimal
>>> implementation while there are some fundamental features (e.g. compaction) it
>>> doesn't play well with - I'm working to create a definitive list so these items
>>> can be tracked and tackled.
>> 
>> Good to know this, Thanks!
>> 
>>> That said, I don't see this "batch zap" patch as an example of this. It's just a
>>> performance enhancement that improves things even further than large anon folios
>>> on their own. I'd rather concentrate on the core changes first then deal with
>>> this type of thing later. Does that work for you?
>> 
>> IIUC, allocating large folios upon page fault depends on splitting large
>> folios in page_remove_rmap() to avoid memory wastage.  Splitting large
>> folios in page_remove_rmap() depends on "batch zap" to avoid performance
>> regression in zap_pte_range().  So we need them to be done earlier.  Or
>> I miss something?
>
> My point was just that large anon folios improves performance significantly
> overall, despite a small perf regression in zap_pte_range(). That regression is
> reduced further by a patch from Yin Fengwei to reduce the lock contention [1].
> So it doesn't seem urgent to me to get the "batch zap" change in.

I don't think Fengwei's patch will help much here.  Because that patch
is to optimize if the folio isn't in deferred split queue, but now the
folio will be put in deferred split queue.

And I don't think allocating large folios upon page fault is more
urgent.  We should avoid regression if possible.

> I'll add it to my list, then prioritize it against the other stuff.
>
> [1] https://lore.kernel.org/linux-mm/20230429082759.1600796-1-fengwei.yin@intel.com/
>

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios
@ 2023-07-11  1:56                 ` Huang, Ying
  0 siblings, 0 replies; 167+ messages in thread
From: Huang, Ying @ 2023-07-11  1:56 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel,
	linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 10/07/2023 10:01, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 10/07/2023 06:37, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so
>>>>> resending:
>>>>>
>>>>> On 07/07/2023 09:21, Huang, Ying wrote:
>>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>>
>>>>>>> With the introduction of large folios for anonymous memory, we would
>>>>>>> like to be able to split them when they have unmapped subpages, in order
>>>>>>> to free those unused pages under memory pressure. So remove the
>>>>>>> artificial requirement that the large folio needed to be at least
>>>>>>> PMD-sized.
>>>>>>>
>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>>>>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>>>>>>> ---
>>>>>>>  mm/rmap.c | 2 +-
>>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>> index 82ef5ba363d1..bbcb2308a1c5 100644
>>>>>>> --- a/mm/rmap.c
>>>>>>> +++ b/mm/rmap.c
>>>>>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>>>>>>>  		 * page of the folio is unmapped and at least one page
>>>>>>>  		 * is still mapped.
>>>>>>>  		 */
>>>>>>> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>>>>>>> +		if (folio_test_large(folio) && folio_test_anon(folio))
>>>>>>>  			if (!compound || nr < nr_pmdmapped)
>>>>>>>  				deferred_split_folio(folio);
>>>>>>>  	}
>>>>>>
>>>>>> One possible issue is that even for large folios mapped only in one
>>>>>> process, in zap_pte_range(), we will always call deferred_split_folio()
>>>>>> unnecessarily before freeing a large folio.
>>>>>
>>>>> Hi Huang, thanks for reviewing!
>>>>>
>>>>> I have a patch that solves this problem by determining a range of ptes covered
>>>>> by a single folio and doing a "batch zap". This prevents the need to add the
>>>>> folio to the deferred split queue, only to remove it again shortly afterwards.
>>>>> This reduces lock contention and I can measure a performance improvement for the
>>>>> kernel compilation benchmark. See [1].
>>>>>
>>>>> However, I decided to remove it from this patch set on Yu Zhao's advice. We are
>>>>> aiming for the minimal patch set to start with and wanted to focus people on
>>>>> that. I intend to submit it separately later on.
>>>>>
>>>>> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/
>>>>
>>>> Thanks for your information!  "batch zap" can solve the problem.
>>>>
>>>> And, I agree with Matthew's comments to fix the large folios interaction
>>>> issues before merging the patches to allocate large folios as in the
>>>> following email.
>>>>
>>>> https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/
>>>>
>>>> If so, we don't need to introduce the above problem or a large patchset.
>>>
>>> I appreciate Matthew's and others position about not wanting to merge a minimal
>>> implementation while there are some fundamental features (e.g. compaction) it
>>> doesn't play well with - I'm working to create a definitive list so these items
>>> can be tracked and tackled.
>> 
>> Good to know this, Thanks!
>> 
>>> That said, I don't see this "batch zap" patch as an example of this. It's just a
>>> performance enhancement that improves things even further than large anon folios
>>> on their own. I'd rather concentrate on the core changes first then deal with
>>> this type of thing later. Does that work for you?
>> 
>> IIUC, allocating large folios upon page fault depends on splitting large
>> folios in page_remove_rmap() to avoid memory wastage.  Splitting large
>> folios in page_remove_rmap() depends on "batch zap" to avoid performance
>> regression in zap_pte_range().  So we need them to be done earlier.  Or
>> I miss something?
>
> My point was just that large anon folios improves performance significantly
> overall, despite a small perf regression in zap_pte_range(). That regression is
> reduced further by a patch from Yin Fengwei to reduce the lock contention [1].
> So it doesn't seem urgent to me to get the "batch zap" change in.

I don't think Fengwei's patch will help much here.  Because that patch
is to optimize if the folio isn't in deferred split queue, but now the
folio will be put in deferred split queue.

And I don't think allocating large folios upon page fault is more
urgent.  We should avoid regression if possible.

> I'll add it to my list, then prioritize it against the other stuff.
>
> [1] https://lore.kernel.org/linux-mm/20230429082759.1600796-1-fengwei.yin@intel.com/
>

Best Regards,
Huang, Ying

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-07 13:12         ` Matthew Wilcox
@ 2023-07-11 21:11           ` Luis Chamberlain
  -1 siblings, 0 replies; 167+ messages in thread
From: Luis Chamberlain @ 2023-07-11 21:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Ryan Roberts, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm

On Fri, Jul 07, 2023 at 02:12:01PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
> 
> > Is swapping working as expected? zswap?
> 
> Suboptimally.  Swap will split folios in order to swap them.

Wouldn't that mean if high order folios are used a lot but swap is also
used, until this is fixed you wouldn't get the expected reclaim gains
for high order folios and we'd need compaction more then?

> Somebody needs to fix that, but it should work.

As we look at shmem stuff it was on the path so something we have
considered doing. Ie, it's on our team's list of items to help with
but currently on a backburner.

  Luis

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-11 21:11           ` Luis Chamberlain
  0 siblings, 0 replies; 167+ messages in thread
From: Luis Chamberlain @ 2023-07-11 21:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Ryan Roberts, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm

On Fri, Jul 07, 2023 at 02:12:01PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
> 
> > Is swapping working as expected? zswap?
> 
> Suboptimally.  Swap will split folios in order to swap them.

Wouldn't that mean if high order folios are used a lot but swap is also
used, until this is fixed you wouldn't get the expected reclaim gains
for high order folios and we'd need compaction more then?

> Somebody needs to fix that, but it should work.

As we look at shmem stuff it was on the path so something we have
considered doing. Ie, it's on our team's list of items to help with
but currently on a backburner.

  Luis

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-11 21:11           ` Luis Chamberlain
@ 2023-07-11 21:59             ` Matthew Wilcox
  -1 siblings, 0 replies; 167+ messages in thread
From: Matthew Wilcox @ 2023-07-11 21:59 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: David Hildenbrand, Ryan Roberts, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm

On Tue, Jul 11, 2023 at 02:11:19PM -0700, Luis Chamberlain wrote:
> On Fri, Jul 07, 2023 at 02:12:01PM +0100, Matthew Wilcox wrote:
> > On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
> > 
> > > Is swapping working as expected? zswap?
> > 
> > Suboptimally.  Swap will split folios in order to swap them.
> 
> Wouldn't that mean if high order folios are used a lot but swap is also
> used, until this is fixed you wouldn't get the expected reclaim gains
> for high order folios and we'd need compaction more then?

They're split in shrink_folio_list(), so they stay intact until
that point?

> > Somebody needs to fix that, but it should work.
> 
> As we look at shmem stuff it was on the path so something we have
> considered doing. Ie, it's on our team's list of items to help with
> but currently on a backburner.

Something I was thinking about is that you'll need to prohibit swap
devices or swap files being created on large block devices.  Until
we rewrite the entire swap subsystem ...

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-11 21:59             ` Matthew Wilcox
  0 siblings, 0 replies; 167+ messages in thread
From: Matthew Wilcox @ 2023-07-11 21:59 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: David Hildenbrand, Ryan Roberts, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm

On Tue, Jul 11, 2023 at 02:11:19PM -0700, Luis Chamberlain wrote:
> On Fri, Jul 07, 2023 at 02:12:01PM +0100, Matthew Wilcox wrote:
> > On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
> > 
> > > Is swapping working as expected? zswap?
> > 
> > Suboptimally.  Swap will split folios in order to swap them.
> 
> Wouldn't that mean if high order folios are used a lot but swap is also
> used, until this is fixed you wouldn't get the expected reclaim gains
> for high order folios and we'd need compaction more then?

They're split in shrink_folio_list(), so they stay intact until
that point?

> > Somebody needs to fix that, but it should work.
> 
> As we look at shmem stuff it was on the path so something we have
> considered doing. Ie, it's on our team's list of items to help with
> but currently on a backburner.

Something I was thinking about is that you'll need to prohibit swap
devices or swap files being created on large block devices.  Until
we rewrite the entire swap subsystem ...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-10 16:53             ` Zi Yan
@ 2023-07-19 15:49               ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-19 15:49 UTC (permalink / raw)
  To: Zi Yan, David Hildenbrand
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

On 10/07/2023 17:53, Zi Yan wrote:
> On 7 Jul 2023, at 9:24, David Hildenbrand wrote:
> 
>> On 07.07.23 15:12, Matthew Wilcox wrote:
>>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>>>> On 06.07.23 10:02, Ryan Roberts wrote:
>>>> But can you comment on the page migration part (IOW did you try it already)?
>>>>
>>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>>>> page migration of something that was allocated using GFP_MOVABLE to actually
>>>> work.
>>>>
>>>> Compaction seems to skip any higher-order folios, but the question is if the
>>>> udnerlying migration itself works.
>>>>
>>>> If it already works: great! If not, this really has to be tackled early,
>>>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
>>>
>>> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
>>> is not.
>>
>> Thanks! Very nice if at least ordinary migration works.
>>
>>>
>>> If you look at a function like folio_migrate_mapping(), it all seems
>>> appropriately folio-ised.  There might be something in there that is
>>> slightly wrong, but that would just be a bug to fix, not a huge
>>> architectural problem.
>>>
>>> The problem comes in the callers of migrate_pages().  They pass a
>>> new_folio_t callback.  alloc_migration_target() is the usual one passed
>>> and as far as I can tell is fine.  I've seen no problems reported with it.
>>>
>>> compaction_alloc() is a disaster, and I don't know how to fix it.
>>> The compaction code has its own allocator which is populated with order-0
>>> folios.  How it populates that freelist is awful ... see split_map_pages()
>>
>> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part).
>>
>> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses.
>>
>> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty.
> 
> We probably do not need a pool, since before migration, we have isolated folios to
> be migrated and can come up with a stats on how many folios there are at each order.
> Then, we can isolate free pages based on the stats and do not split free pages
> all the way down to order-0. We can sort the source folios based on their orders
> and isolate free pages from largest order to smallest order. That could avoid
> a free page pool.

Hi Zi, I just wanted to check; is this something you are working on or planning
to work on? I'm trying to maintain a list of all the items that need to get
sorted for large anon folios. It would be great to put your name against it! ;-)

> 
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-19 15:49               ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-19 15:49 UTC (permalink / raw)
  To: Zi Yan, David Hildenbrand
  Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
	Yang Shi, linux-arm-kernel, linux-kernel, linux-mm

On 10/07/2023 17:53, Zi Yan wrote:
> On 7 Jul 2023, at 9:24, David Hildenbrand wrote:
> 
>> On 07.07.23 15:12, Matthew Wilcox wrote:
>>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>>>> On 06.07.23 10:02, Ryan Roberts wrote:
>>>> But can you comment on the page migration part (IOW did you try it already)?
>>>>
>>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>>>> page migration of something that was allocated using GFP_MOVABLE to actually
>>>> work.
>>>>
>>>> Compaction seems to skip any higher-order folios, but the question is if the
>>>> udnerlying migration itself works.
>>>>
>>>> If it already works: great! If not, this really has to be tackled early,
>>>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
>>>
>>> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
>>> is not.
>>
>> Thanks! Very nice if at least ordinary migration works.
>>
>>>
>>> If you look at a function like folio_migrate_mapping(), it all seems
>>> appropriately folio-ised.  There might be something in there that is
>>> slightly wrong, but that would just be a bug to fix, not a huge
>>> architectural problem.
>>>
>>> The problem comes in the callers of migrate_pages().  They pass a
>>> new_folio_t callback.  alloc_migration_target() is the usual one passed
>>> and as far as I can tell is fine.  I've seen no problems reported with it.
>>>
>>> compaction_alloc() is a disaster, and I don't know how to fix it.
>>> The compaction code has its own allocator which is populated with order-0
>>> folios.  How it populates that freelist is awful ... see split_map_pages()
>>
>> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part).
>>
>> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses.
>>
>> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty.
> 
> We probably do not need a pool, since before migration, we have isolated folios to
> be migrated and can come up with a stats on how many folios there are at each order.
> Then, we can isolate free pages based on the stats and do not split free pages
> all the way down to order-0. We can sort the source folios based on their orders
> and isolate free pages from largest order to smallest order. That could avoid
> a free page pool.

Hi Zi, I just wanted to check; is this something you are working on or planning
to work on? I'm trying to maintain a list of all the items that need to get
sorted for large anon folios. It would be great to put your name against it! ;-)

> 
> --
> Best Regards,
> Yan, Zi


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-19 15:49               ` Ryan Roberts
@ 2023-07-19 16:05                 ` Zi Yan
  -1 siblings, 0 replies; 167+ messages in thread
From: Zi Yan @ 2023-07-19 16:05 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3221 bytes --]

On 19 Jul 2023, at 11:49, Ryan Roberts wrote:

> On 10/07/2023 17:53, Zi Yan wrote:
>> On 7 Jul 2023, at 9:24, David Hildenbrand wrote:
>>
>>> On 07.07.23 15:12, Matthew Wilcox wrote:
>>>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>>>>> On 06.07.23 10:02, Ryan Roberts wrote:
>>>>> But can you comment on the page migration part (IOW did you try it already)?
>>>>>
>>>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>>>>> page migration of something that was allocated using GFP_MOVABLE to actually
>>>>> work.
>>>>>
>>>>> Compaction seems to skip any higher-order folios, but the question is if the
>>>>> udnerlying migration itself works.
>>>>>
>>>>> If it already works: great! If not, this really has to be tackled early,
>>>>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
>>>>
>>>> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
>>>> is not.
>>>
>>> Thanks! Very nice if at least ordinary migration works.
>>>
>>>>
>>>> If you look at a function like folio_migrate_mapping(), it all seems
>>>> appropriately folio-ised.  There might be something in there that is
>>>> slightly wrong, but that would just be a bug to fix, not a huge
>>>> architectural problem.
>>>>
>>>> The problem comes in the callers of migrate_pages().  They pass a
>>>> new_folio_t callback.  alloc_migration_target() is the usual one passed
>>>> and as far as I can tell is fine.  I've seen no problems reported with it.
>>>>
>>>> compaction_alloc() is a disaster, and I don't know how to fix it.
>>>> The compaction code has its own allocator which is populated with order-0
>>>> folios.  How it populates that freelist is awful ... see split_map_pages()
>>>
>>> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part).
>>>
>>> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses.
>>>
>>> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty.
>>
>> We probably do not need a pool, since before migration, we have isolated folios to
>> be migrated and can come up with a stats on how many folios there are at each order.
>> Then, we can isolate free pages based on the stats and do not split free pages
>> all the way down to order-0. We can sort the source folios based on their orders
>> and isolate free pages from largest order to smallest order. That could avoid
>> a free page pool.
>
> Hi Zi, I just wanted to check; is this something you are working on or planning
> to work on? I'm trying to maintain a list of all the items that need to get
> sorted for large anon folios. It would be great to put your name against it! ;-)

Sure. I can work on this one.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-19 16:05                 ` Zi Yan
  0 siblings, 0 replies; 167+ messages in thread
From: Zi Yan @ 2023-07-19 16:05 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm


[-- Attachment #1.1: Type: text/plain, Size: 3221 bytes --]

On 19 Jul 2023, at 11:49, Ryan Roberts wrote:

> On 10/07/2023 17:53, Zi Yan wrote:
>> On 7 Jul 2023, at 9:24, David Hildenbrand wrote:
>>
>>> On 07.07.23 15:12, Matthew Wilcox wrote:
>>>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>>>>> On 06.07.23 10:02, Ryan Roberts wrote:
>>>>> But can you comment on the page migration part (IOW did you try it already)?
>>>>>
>>>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>>>>> page migration of something that was allocated using GFP_MOVABLE to actually
>>>>> work.
>>>>>
>>>>> Compaction seems to skip any higher-order folios, but the question is if the
>>>>> udnerlying migration itself works.
>>>>>
>>>>> If it already works: great! If not, this really has to be tackled early,
>>>>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
>>>>
>>>> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
>>>> is not.
>>>
>>> Thanks! Very nice if at least ordinary migration works.
>>>
>>>>
>>>> If you look at a function like folio_migrate_mapping(), it all seems
>>>> appropriately folio-ised.  There might be something in there that is
>>>> slightly wrong, but that would just be a bug to fix, not a huge
>>>> architectural problem.
>>>>
>>>> The problem comes in the callers of migrate_pages().  They pass a
>>>> new_folio_t callback.  alloc_migration_target() is the usual one passed
>>>> and as far as I can tell is fine.  I've seen no problems reported with it.
>>>>
>>>> compaction_alloc() is a disaster, and I don't know how to fix it.
>>>> The compaction code has its own allocator which is populated with order-0
>>>> folios.  How it populates that freelist is awful ... see split_map_pages()
>>>
>>> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part).
>>>
>>> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses.
>>>
>>> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty.
>>
>> We probably do not need a pool, since before migration, we have isolated folios to
>> be migrated and can come up with a stats on how many folios there are at each order.
>> Then, we can isolate free pages based on the stats and do not split free pages
>> all the way down to order-0. We can sort the source folios based on their orders
>> and isolate free pages from largest order to smallest order. That could avoid
>> a free page pool.
>
> Hi Zi, I just wanted to check; is this something you are working on or planning
> to work on? I'm trying to maintain a list of all the items that need to get
> sorted for large anon folios. It would be great to put your name against it! ;-)

Sure. I can work on this one.

--
Best Regards,
Yan, Zi

[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
  2023-07-19 16:05                 ` Zi Yan
@ 2023-07-19 18:37                   ` Ryan Roberts
  -1 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-19 18:37 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm

On 19/07/2023 17:05, Zi Yan wrote:
> On 19 Jul 2023, at 11:49, Ryan Roberts wrote:
> 
>> On 10/07/2023 17:53, Zi Yan wrote:
>>> On 7 Jul 2023, at 9:24, David Hildenbrand wrote:
>>>
>>>> On 07.07.23 15:12, Matthew Wilcox wrote:
>>>>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>>>>>> On 06.07.23 10:02, Ryan Roberts wrote:
>>>>>> But can you comment on the page migration part (IOW did you try it already)?
>>>>>>
>>>>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>>>>>> page migration of something that was allocated using GFP_MOVABLE to actually
>>>>>> work.
>>>>>>
>>>>>> Compaction seems to skip any higher-order folios, but the question is if the
>>>>>> udnerlying migration itself works.
>>>>>>
>>>>>> If it already works: great! If not, this really has to be tackled early,
>>>>>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
>>>>>
>>>>> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
>>>>> is not.
>>>>
>>>> Thanks! Very nice if at least ordinary migration works.
>>>>
>>>>>
>>>>> If you look at a function like folio_migrate_mapping(), it all seems
>>>>> appropriately folio-ised.  There might be something in there that is
>>>>> slightly wrong, but that would just be a bug to fix, not a huge
>>>>> architectural problem.
>>>>>
>>>>> The problem comes in the callers of migrate_pages().  They pass a
>>>>> new_folio_t callback.  alloc_migration_target() is the usual one passed
>>>>> and as far as I can tell is fine.  I've seen no problems reported with it.
>>>>>
>>>>> compaction_alloc() is a disaster, and I don't know how to fix it.
>>>>> The compaction code has its own allocator which is populated with order-0
>>>>> folios.  How it populates that freelist is awful ... see split_map_pages()
>>>>
>>>> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part).
>>>>
>>>> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses.
>>>>
>>>> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty.
>>>
>>> We probably do not need a pool, since before migration, we have isolated folios to
>>> be migrated and can come up with a stats on how many folios there are at each order.
>>> Then, we can isolate free pages based on the stats and do not split free pages
>>> all the way down to order-0. We can sort the source folios based on their orders
>>> and isolate free pages from largest order to smallest order. That could avoid
>>> a free page pool.
>>
>> Hi Zi, I just wanted to check; is this something you are working on or planning
>> to work on? I'm trying to maintain a list of all the items that need to get
>> sorted for large anon folios. It would be great to put your name against it! ;-)
> 
> Sure. I can work on this one.

Awesome - thanks!

> 
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory
@ 2023-07-19 18:37                   ` Ryan Roberts
  0 siblings, 0 replies; 167+ messages in thread
From: Ryan Roberts @ 2023-07-19 18:37 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton,
	Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas,
	Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel,
	linux-kernel, linux-mm

On 19/07/2023 17:05, Zi Yan wrote:
> On 19 Jul 2023, at 11:49, Ryan Roberts wrote:
> 
>> On 10/07/2023 17:53, Zi Yan wrote:
>>> On 7 Jul 2023, at 9:24, David Hildenbrand wrote:
>>>
>>>> On 07.07.23 15:12, Matthew Wilcox wrote:
>>>>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>>>>>> On 06.07.23 10:02, Ryan Roberts wrote:
>>>>>> But can you comment on the page migration part (IOW did you try it already)?
>>>>>>
>>>>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>>>>>> page migration of something that was allocated using GFP_MOVABLE to actually
>>>>>> work.
>>>>>>
>>>>>> Compaction seems to skip any higher-order folios, but the question is if the
>>>>>> udnerlying migration itself works.
>>>>>>
>>>>>> If it already works: great! If not, this really has to be tackled early,
>>>>>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
>>>>>
>>>>> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
>>>>> is not.
>>>>
>>>> Thanks! Very nice if at least ordinary migration works.
>>>>
>>>>>
>>>>> If you look at a function like folio_migrate_mapping(), it all seems
>>>>> appropriately folio-ised.  There might be something in there that is
>>>>> slightly wrong, but that would just be a bug to fix, not a huge
>>>>> architectural problem.
>>>>>
>>>>> The problem comes in the callers of migrate_pages().  They pass a
>>>>> new_folio_t callback.  alloc_migration_target() is the usual one passed
>>>>> and as far as I can tell is fine.  I've seen no problems reported with it.
>>>>>
>>>>> compaction_alloc() is a disaster, and I don't know how to fix it.
>>>>> The compaction code has its own allocator which is populated with order-0
>>>>> folios.  How it populates that freelist is awful ... see split_map_pages()
>>>>
>>>> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part).
>>>>
>>>> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses.
>>>>
>>>> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty.
>>>
>>> We probably do not need a pool, since before migration, we have isolated folios to
>>> be migrated and can come up with a stats on how many folios there are at each order.
>>> Then, we can isolate free pages based on the stats and do not split free pages
>>> all the way down to order-0. We can sort the source folios based on their orders
>>> and isolate free pages from largest order to smallest order. That could avoid
>>> a free page pool.
>>
>> Hi Zi, I just wanted to check; is this something you are working on or planning
>> to work on? I'm trying to maintain a list of all the items that need to get
>> sorted for large anon folios. It would be great to put your name against it! ;-)
> 
> Sure. I can work on this one.

Awesome - thanks!

> 
> --
> Best Regards,
> Yan, Zi


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 167+ messages in thread

end of thread, other threads:[~2023-07-19 18:38 UTC | newest]

Thread overview: 167+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-03 13:53 [PATCH v2 0/5] variable-order, large folios for anonymous memory Ryan Roberts
2023-07-03 13:53 ` Ryan Roberts
2023-07-03 13:53 ` [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() Ryan Roberts
2023-07-03 13:53   ` Ryan Roberts
2023-07-03 19:05   ` Yu Zhao
2023-07-03 19:05     ` Yu Zhao
2023-07-04  2:13     ` Yin, Fengwei
2023-07-04  2:13       ` Yin, Fengwei
2023-07-04 11:19       ` Ryan Roberts
2023-07-04 11:19         ` Ryan Roberts
2023-07-04  2:14   ` Yin, Fengwei
2023-07-04  2:14     ` Yin, Fengwei
2023-07-03 13:53 ` [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts
2023-07-03 13:53   ` Ryan Roberts
2023-07-07  8:21   ` Huang, Ying
2023-07-07  8:21     ` Huang, Ying
2023-07-07  9:39     ` Ryan Roberts
2023-07-07  9:42     ` Ryan Roberts
2023-07-07  9:42       ` Ryan Roberts
2023-07-10  5:37       ` Huang, Ying
2023-07-10  5:37         ` Huang, Ying
2023-07-10  8:29         ` Ryan Roberts
2023-07-10  8:29           ` Ryan Roberts
2023-07-10  9:01           ` Huang, Ying
2023-07-10  9:01             ` Huang, Ying
2023-07-10  9:39             ` Ryan Roberts
2023-07-10  9:39               ` Ryan Roberts
2023-07-11  1:56               ` Huang, Ying
2023-07-11  1:56                 ` Huang, Ying
2023-07-03 13:53 ` [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() Ryan Roberts
2023-07-03 13:53   ` Ryan Roberts
2023-07-03 19:50   ` Yu Zhao
2023-07-03 19:50     ` Yu Zhao
2023-07-04 13:20     ` Ryan Roberts
2023-07-04 13:20       ` Ryan Roberts
2023-07-05  2:07       ` Yu Zhao
2023-07-05  2:07         ` Yu Zhao
2023-07-05  9:11         ` Ryan Roberts
2023-07-05  9:11           ` Ryan Roberts
2023-07-05 17:24           ` Yu Zhao
2023-07-05 17:24             ` Yu Zhao
2023-07-05 18:01             ` Ryan Roberts
2023-07-05 18:01               ` Ryan Roberts
2023-07-06 19:33         ` Matthew Wilcox
2023-07-06 19:33           ` Matthew Wilcox
2023-07-07 10:00           ` Ryan Roberts
2023-07-07 10:00             ` Ryan Roberts
2023-07-04  2:22   ` Yin, Fengwei
2023-07-04  2:22     ` Yin, Fengwei
2023-07-04  3:02     ` Yu Zhao
2023-07-04  3:02       ` Yu Zhao
2023-07-04  3:59       ` Yu Zhao
2023-07-04  3:59         ` Yu Zhao
2023-07-04  5:22         ` Yin, Fengwei
2023-07-04  5:22           ` Yin, Fengwei
2023-07-04  5:42           ` Yu Zhao
2023-07-04  5:42             ` Yu Zhao
2023-07-04 12:36         ` Ryan Roberts
2023-07-04 12:36           ` Ryan Roberts
2023-07-04 13:23           ` Ryan Roberts
2023-07-04 13:23             ` Ryan Roberts
2023-07-05  1:40             ` Yu Zhao
2023-07-05  1:40               ` Yu Zhao
2023-07-05  1:23           ` Yu Zhao
2023-07-05  1:23             ` Yu Zhao
2023-07-05  2:18             ` Yin Fengwei
2023-07-05  2:18               ` Yin Fengwei
2023-07-03 13:53 ` [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance Ryan Roberts
2023-07-03 13:53   ` Ryan Roberts
2023-07-03 15:51   ` kernel test robot
2023-07-03 15:51     ` kernel test robot
2023-07-03 16:01   ` kernel test robot
2023-07-03 16:01     ` kernel test robot
2023-07-04  1:35   ` Yu Zhao
2023-07-04  1:35     ` Yu Zhao
2023-07-04 14:08     ` Ryan Roberts
2023-07-04 14:08       ` Ryan Roberts
2023-07-04 23:47       ` Yu Zhao
2023-07-04 23:47         ` Yu Zhao
2023-07-04  3:45   ` Yin, Fengwei
2023-07-04  3:45     ` Yin, Fengwei
2023-07-04 14:20     ` Ryan Roberts
2023-07-04 14:20       ` Ryan Roberts
2023-07-04 23:35       ` Yin Fengwei
2023-07-04 23:57       ` Matthew Wilcox
2023-07-04 23:57         ` Matthew Wilcox
2023-07-05  9:54         ` Ryan Roberts
2023-07-05  9:54           ` Ryan Roberts
2023-07-05 12:08           ` Matthew Wilcox
2023-07-05 12:08             ` Matthew Wilcox
2023-07-07  8:01   ` Huang, Ying
2023-07-07  8:01     ` Huang, Ying
2023-07-07  9:52     ` Ryan Roberts
2023-07-07  9:52       ` Ryan Roberts
2023-07-07 11:29       ` David Hildenbrand
2023-07-07 11:29         ` David Hildenbrand
2023-07-07 13:57         ` Matthew Wilcox
2023-07-07 13:57           ` Matthew Wilcox
2023-07-07 14:07           ` David Hildenbrand
2023-07-07 14:07             ` David Hildenbrand
2023-07-07 15:13             ` Ryan Roberts
2023-07-07 15:13               ` Ryan Roberts
2023-07-07 16:06               ` David Hildenbrand
2023-07-07 16:06                 ` David Hildenbrand
2023-07-07 16:22                 ` Ryan Roberts
2023-07-07 16:22                   ` Ryan Roberts
2023-07-07 19:06                   ` David Hildenbrand
2023-07-07 19:06                     ` David Hildenbrand
2023-07-10  8:41                     ` Ryan Roberts
2023-07-10  8:41                       ` Ryan Roberts
2023-07-10  3:03               ` Huang, Ying
2023-07-10  3:03                 ` Huang, Ying
2023-07-10  8:55                 ` Ryan Roberts
2023-07-10  8:55                   ` Ryan Roberts
2023-07-10  9:18                   ` Huang, Ying
2023-07-10  9:18                     ` Huang, Ying
2023-07-10  9:25                     ` Ryan Roberts
2023-07-10  9:25                       ` Ryan Roberts
2023-07-11  0:48                       ` Huang, Ying
2023-07-11  0:48                         ` Huang, Ying
2023-07-10  2:49           ` Huang, Ying
2023-07-10  2:49             ` Huang, Ying
2023-07-03 13:53 ` [PATCH v2 5/5] arm64: mm: Override arch_wants_pte_order() Ryan Roberts
2023-07-03 13:53   ` Ryan Roberts
2023-07-03 20:02   ` Yu Zhao
2023-07-03 20:02     ` Yu Zhao
2023-07-04  2:18 ` [PATCH v2 0/5] variable-order, large folios for anonymous memory Yu Zhao
2023-07-04  2:18   ` Yu Zhao
2023-07-04  6:22   ` Yin, Fengwei
2023-07-04  6:22     ` Yin, Fengwei
2023-07-04  7:11     ` Yu Zhao
2023-07-04  7:11       ` Yu Zhao
2023-07-04 15:36       ` Ryan Roberts
2023-07-04 15:36         ` Ryan Roberts
2023-07-04 23:52         ` Yin Fengwei
2023-07-05  0:21           ` Yu Zhao
2023-07-05  0:21             ` Yu Zhao
2023-07-05 10:16             ` Ryan Roberts
2023-07-05 10:16               ` Ryan Roberts
2023-07-05 19:00               ` Yu Zhao
2023-07-05 19:00                 ` Yu Zhao
2023-07-05 19:38 ` David Hildenbrand
2023-07-05 19:38   ` David Hildenbrand
2023-07-06  8:02   ` Ryan Roberts
2023-07-06  8:02     ` Ryan Roberts
2023-07-07 11:40     ` David Hildenbrand
2023-07-07 11:40       ` David Hildenbrand
2023-07-07 13:12       ` Matthew Wilcox
2023-07-07 13:12         ` Matthew Wilcox
2023-07-07 13:24         ` David Hildenbrand
2023-07-07 13:24           ` David Hildenbrand
2023-07-10 10:07           ` Ryan Roberts
2023-07-10 10:07             ` Ryan Roberts
2023-07-10 16:57             ` Matthew Wilcox
2023-07-10 16:57               ` Matthew Wilcox
2023-07-10 16:53           ` Zi Yan
2023-07-10 16:53             ` Zi Yan
2023-07-19 15:49             ` Ryan Roberts
2023-07-19 15:49               ` Ryan Roberts
2023-07-19 16:05               ` Zi Yan
2023-07-19 16:05                 ` Zi Yan
2023-07-19 18:37                 ` Ryan Roberts
2023-07-19 18:37                   ` Ryan Roberts
2023-07-11 21:11         ` Luis Chamberlain
2023-07-11 21:11           ` Luis Chamberlain
2023-07-11 21:59           ` Matthew Wilcox
2023-07-11 21:59             ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.