All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-09-29 11:44 ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

Hi All,

This is v6 of a series to implement variable order, large folios for anonymous
memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO",
"FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The
objective of this is to improve performance by allocating larger chunks of
memory during anonymous page faults:

1) Since SW (the kernel) is dealing with larger chunks of memory than base
   pages, there are efficiency savings to be had; fewer page faults, batched PTE
   and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
   overhead. This should benefit all architectures.
2) Since we are now mapping physically contiguous chunks of memory, we can take
   advantage of HW TLB compression techniques. A reduction in TLB pressure
   speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
   TLB entries; "the contiguous bit" (architectural) and HPA (uarch).

The major change in this revision is the addition of sysfs controls to allow
this "small-order THP" to be enabled/disabled/configured independently of
PMD-order THP. The approach I've taken differs a bit from previous discussions;
instead of creating a whole new interface ("large_folio"), I'm extending THP. I
personally think this makes things clearer and more extensible. See [6] for
detailed rationale.

Because we now have runtime enable/disable control, I've removed the compile
time Kconfig switch. It still defaults to runtime-disabled.

NOTE: These changes should not be merged until the prerequisites are complete.
These are in progress and tracked at [7].

This series is based on mm-hotfixes-unstable (f9911db48293).


Testing
=======

This version adds patches to mm selftests so that the cow tests explicitly test
small-order THP, in the same way that PMD-order THP is tested. The new tests all
pass, and no regressions are observed in the mm selftest suite.


Performance
===========

The below tables show performance and memory data for selected workloads, with
different small-order THPs enabled. All configs are compared to a 4k page kernel
with small-order THP disabled. 16k and 64k (with small-order THP disabled)
kernels are included to aid the comparison. All kernels built from the same
source; mm-hotfixes-unstable (f9911db48293) + this series.

4k-page-16k-folio: 16k (order-2) THP enabled
4k-page-32k-folio: 32k+16k (order-3, order-2) THP enabled
4k-page-64k-folio: 64k+32k+16k (order-4, order-3, order-2) THP enabled

Running on Ampere Altra with 1 NUMA node enabled, Ubuntu 22.04, XFS filesystem
20 repeats across 5 reboots (with 1 warmup run after each reboot)
Run in its own cgroup and read memory.peak after completion


Kernel Compilation with 8 jobs: (make defconfig && make -s -j8 Image)
(smaller is better):

| kernel            |   real-time |   kern-time |   user-time | memory |
|:------------------|------------:|------------:|------------:|-------:|
| baseline-4k-page  |        0.0% |        0.0% |        0.0% |   0.0% |
| 16k-page          |       -9.0% |      -49.7% |       -4.0% |   6.2% |
| 64k-page          |      -11.9% |      -66.5% |       -5.0% |  28.3% |
| 4k-page-16k-folio |       -2.8% |      -23.0% |       -0.3% |   0.0% |
| 4k-page-32k-folio |       -4.0% |      -32.0% |       -0.6% |   0.1% |
| 4k-page-64k-folio |       -4.6% |      -37.9% |       -0.5% |   0.1% |


Kernel Compilation with 80 jobs: (make defconfig && make -s -j80 Image)
(smaller is better):

| kernel            |   real-time |   kern-time |   user-time | memory |
|:------------------|------------:|------------:|------------:|:-------|
| baseline-4k-page  |        0.0% |        0.0% |        0.0% |   0.0% |
| 16k-page          |       -9.2% |      -52.1% |       -3.6% |   4.6% |
| 64k-page          |      -11.4% |      -66.4% |       -3.0% |  12.6% |
| 4k-page-16k-folio |       -3.2% |      -22.8% |       -0.3% |   2.7% |
| 4k-page-32k-folio |       -4.8% |      -37.1% |       -0.5% |   2.9% |
| 4k-page-64k-folio |       -5.0% |      -42.1% |       -0.3% |   3.4% |


Speedometer 2.0: Running on Chromium automated with Selenium
(bigger is better for runs_per_min, smaller is better for memory):

| kernel            |   runs_per_min | memory |
|:------------------|---------------:|-------:|
| baseline-4k-page  |           0.0% |   0.0% |
| 16k-page          |           5.9% |  10.6% |
| 4k-page-16k-folio |           1.0% |  -0.6% |
| 4k-page-32k-folio |           1.3% |   3.5% |
| 4k-page-64k-folio |           1.3% |   6.4% |


Changes since v5 [5]
====================

  - Added accounting for PTE-mapped THPs (patch 3)
  - Added runtime control mechanism via sysfs as extension to THP (patch 4)
  - Minor refactoring of alloc_anon_folio() to integrate with runtime controls
  - Stripped out hardcoded policy for allocation order; its now all user space
    controlled (although user space can request "recommend" which will configure
    the HW-preferred order)


Changes since v4 [4]
====================

  - Removed "arm64: mm: Override arch_wants_pte_order()" patch; arm64
    now uses the default order-3 size. I have moved this patch over to
    the contpte series.
  - Added "mm: Allow deferred splitting of arbitrary large anon folios" back
    into series. I originally removed this at v2 to add to a separate series,
    but that series has transformed significantly and it no longer fits, so
    bringing it back here.
  - Reintroduced dependency on set_ptes(); Originally dropped this at v2, but
    set_ptes() is in mm-unstable now.
  - Updated policy for when to allocate LAF; only fallback to order-0 if
    MADV_NOHUGEPAGE is present or if THP disabled via prctl; no longer rely on
    sysfs's never/madvise/always knob.
  - Fallback to order-0 whenever uffd is armed for the vma, not just when
    uffd-wp is set on the pte.
  - alloc_anon_folio() now returns `struct folio *`, where errors are encoded
    with ERR_PTR().

  The last 3 changes were proposed by Yu Zhao - thanks!


Changes since v3 [3]
====================

  - Renamed feature from FLEXIBLE_THP to LARGE_ANON_FOLIO.
  - Removed `flexthp_unhinted_max` boot parameter. Discussion concluded that a
    sysctl is preferable but we will wait until real workload needs it.
  - Fixed uninitialized `addr` on read fault path in do_anonymous_page().
  - Added mm selftests for large anon folios in cow test suite.


Changes since v2 [2]
====================

  - Dropped commit "Allow deferred splitting of arbitrary large anon folios"
      - Huang, Ying suggested the "batch zap" work (which I dropped from this
        series after v1) is a prerequisite for merging FLXEIBLE_THP, so I've
        moved the deferred split patch to a separate series along with the batch
        zap changes. I plan to submit this series early next week.
  - Changed folio order fallback policy
      - We no longer iterate from preferred to 0 looking for acceptable policy
      - Instead we iterate through preferred, PAGE_ALLOC_COSTLY_ORDER and 0 only
  - Removed vma parameter from arch_wants_pte_order()
  - Added command line parameter `flexthp_unhinted_max`
      - clamps preferred order when vma hasn't explicitly opted-in to THP
  - Never allocate large folio for MADV_NOHUGEPAGE vma (or when THP is disabled
    for process or system).
  - Simplified implementation and integration with do_anonymous_page()
  - Removed dependency on set_ptes()


Changes since v1 [1]
====================

  - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
  - replaced with arch-independent alloc_anon_folio()
      - follows THP allocation approach
  - no longer retry with intermediate orders if allocation fails
      - fallback directly to order-0
  - remove folio_add_new_anon_rmap_range() patch
      - instead add its new functionality to folio_add_new_anon_rmap()
  - remove batch-zap pte mappings optimization patch
      - remove enabler folio_remove_rmap_range() patch too
      - These offer real perf improvement so will submit separately
  - simplify Kconfig
      - single FLEXIBLE_THP option, which is independent of arch
      - depends on TRANSPARENT_HUGEPAGE
      - when enabled default to max anon folio size of 64K unless arch
        explicitly overrides
  - simplify changes to do_anonymous_page():
      - no more retry loop


[1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20230703135330.1865927-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-mm/20230714160407.4142030-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/linux-mm/20230726095146.2826796-1-ryan.roberts@arm.com/
[5] https://lore.kernel.org/linux-mm/20230810142942.3169679-1-ryan.roberts@arm.com/
[6] https://lore.kernel.org/linux-mm/1b03f4d6-634d-4786-81a0-5a104799b125@arm.com/
[7] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/


Thanks,
Ryan


Ryan Roberts (9):
  mm: Allow deferred splitting of arbitrary anon large folios
  mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
  mm: thp: Account pte-mapped anonymous THP usage
  mm: thp: Introduce anon_orders and anon_always_mask sysfs files
  mm: thp: Extend THP to allocate anonymous large folios
  mm: thp: Add "recommend" option for anon_orders
  arm64/mm: Override arch_wants_pte_order()
  selftests/mm/cow: Generalize do_run_with_thp() helper
  selftests/mm/cow: Add tests for small-order anon THP

 Documentation/ABI/testing/procfs-smaps_rollup |   1 +
 .../admin-guide/cgroup-v1/memory.rst          |   5 +-
 Documentation/admin-guide/cgroup-v2.rst       |   6 +-
 Documentation/admin-guide/mm/transhuge.rst    |  96 ++++++-
 Documentation/filesystems/proc.rst            |  20 +-
 arch/arm64/include/asm/pgtable.h              |  10 +
 drivers/base/node.c                           |   2 +
 fs/proc/meminfo.c                             |   2 +
 fs/proc/task_mmu.c                            |   7 +-
 include/linux/huge_mm.h                       |  95 +++++--
 include/linux/mmzone.h                        |   1 +
 include/linux/pgtable.h                       |  13 +
 mm/huge_memory.c                              | 172 ++++++++++--
 mm/khugepaged.c                               |  18 +-
 mm/memcontrol.c                               |   8 +
 mm/memory.c                                   | 114 +++++++-
 mm/page_vma_mapped.c                          |   3 +-
 mm/rmap.c                                     |  42 ++-
 mm/show_mem.c                                 |   2 +
 mm/vmstat.c                                   |   1 +
 tools/testing/selftests/mm/cow.c              | 244 +++++++++++++-----
 21 files changed, 696 insertions(+), 166 deletions(-)

--
2.25.1


^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-09-29 11:44 ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

Hi All,

This is v6 of a series to implement variable order, large folios for anonymous
memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO",
"FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The
objective of this is to improve performance by allocating larger chunks of
memory during anonymous page faults:

1) Since SW (the kernel) is dealing with larger chunks of memory than base
   pages, there are efficiency savings to be had; fewer page faults, batched PTE
   and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
   overhead. This should benefit all architectures.
2) Since we are now mapping physically contiguous chunks of memory, we can take
   advantage of HW TLB compression techniques. A reduction in TLB pressure
   speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
   TLB entries; "the contiguous bit" (architectural) and HPA (uarch).

The major change in this revision is the addition of sysfs controls to allow
this "small-order THP" to be enabled/disabled/configured independently of
PMD-order THP. The approach I've taken differs a bit from previous discussions;
instead of creating a whole new interface ("large_folio"), I'm extending THP. I
personally think this makes things clearer and more extensible. See [6] for
detailed rationale.

Because we now have runtime enable/disable control, I've removed the compile
time Kconfig switch. It still defaults to runtime-disabled.

NOTE: These changes should not be merged until the prerequisites are complete.
These are in progress and tracked at [7].

This series is based on mm-hotfixes-unstable (f9911db48293).


Testing
=======

This version adds patches to mm selftests so that the cow tests explicitly test
small-order THP, in the same way that PMD-order THP is tested. The new tests all
pass, and no regressions are observed in the mm selftest suite.


Performance
===========

The below tables show performance and memory data for selected workloads, with
different small-order THPs enabled. All configs are compared to a 4k page kernel
with small-order THP disabled. 16k and 64k (with small-order THP disabled)
kernels are included to aid the comparison. All kernels built from the same
source; mm-hotfixes-unstable (f9911db48293) + this series.

4k-page-16k-folio: 16k (order-2) THP enabled
4k-page-32k-folio: 32k+16k (order-3, order-2) THP enabled
4k-page-64k-folio: 64k+32k+16k (order-4, order-3, order-2) THP enabled

Running on Ampere Altra with 1 NUMA node enabled, Ubuntu 22.04, XFS filesystem
20 repeats across 5 reboots (with 1 warmup run after each reboot)
Run in its own cgroup and read memory.peak after completion


Kernel Compilation with 8 jobs: (make defconfig && make -s -j8 Image)
(smaller is better):

| kernel            |   real-time |   kern-time |   user-time | memory |
|:------------------|------------:|------------:|------------:|-------:|
| baseline-4k-page  |        0.0% |        0.0% |        0.0% |   0.0% |
| 16k-page          |       -9.0% |      -49.7% |       -4.0% |   6.2% |
| 64k-page          |      -11.9% |      -66.5% |       -5.0% |  28.3% |
| 4k-page-16k-folio |       -2.8% |      -23.0% |       -0.3% |   0.0% |
| 4k-page-32k-folio |       -4.0% |      -32.0% |       -0.6% |   0.1% |
| 4k-page-64k-folio |       -4.6% |      -37.9% |       -0.5% |   0.1% |


Kernel Compilation with 80 jobs: (make defconfig && make -s -j80 Image)
(smaller is better):

| kernel            |   real-time |   kern-time |   user-time | memory |
|:------------------|------------:|------------:|------------:|:-------|
| baseline-4k-page  |        0.0% |        0.0% |        0.0% |   0.0% |
| 16k-page          |       -9.2% |      -52.1% |       -3.6% |   4.6% |
| 64k-page          |      -11.4% |      -66.4% |       -3.0% |  12.6% |
| 4k-page-16k-folio |       -3.2% |      -22.8% |       -0.3% |   2.7% |
| 4k-page-32k-folio |       -4.8% |      -37.1% |       -0.5% |   2.9% |
| 4k-page-64k-folio |       -5.0% |      -42.1% |       -0.3% |   3.4% |


Speedometer 2.0: Running on Chromium automated with Selenium
(bigger is better for runs_per_min, smaller is better for memory):

| kernel            |   runs_per_min | memory |
|:------------------|---------------:|-------:|
| baseline-4k-page  |           0.0% |   0.0% |
| 16k-page          |           5.9% |  10.6% |
| 4k-page-16k-folio |           1.0% |  -0.6% |
| 4k-page-32k-folio |           1.3% |   3.5% |
| 4k-page-64k-folio |           1.3% |   6.4% |


Changes since v5 [5]
====================

  - Added accounting for PTE-mapped THPs (patch 3)
  - Added runtime control mechanism via sysfs as extension to THP (patch 4)
  - Minor refactoring of alloc_anon_folio() to integrate with runtime controls
  - Stripped out hardcoded policy for allocation order; its now all user space
    controlled (although user space can request "recommend" which will configure
    the HW-preferred order)


Changes since v4 [4]
====================

  - Removed "arm64: mm: Override arch_wants_pte_order()" patch; arm64
    now uses the default order-3 size. I have moved this patch over to
    the contpte series.
  - Added "mm: Allow deferred splitting of arbitrary large anon folios" back
    into series. I originally removed this at v2 to add to a separate series,
    but that series has transformed significantly and it no longer fits, so
    bringing it back here.
  - Reintroduced dependency on set_ptes(); Originally dropped this at v2, but
    set_ptes() is in mm-unstable now.
  - Updated policy for when to allocate LAF; only fallback to order-0 if
    MADV_NOHUGEPAGE is present or if THP disabled via prctl; no longer rely on
    sysfs's never/madvise/always knob.
  - Fallback to order-0 whenever uffd is armed for the vma, not just when
    uffd-wp is set on the pte.
  - alloc_anon_folio() now returns `struct folio *`, where errors are encoded
    with ERR_PTR().

  The last 3 changes were proposed by Yu Zhao - thanks!


Changes since v3 [3]
====================

  - Renamed feature from FLEXIBLE_THP to LARGE_ANON_FOLIO.
  - Removed `flexthp_unhinted_max` boot parameter. Discussion concluded that a
    sysctl is preferable but we will wait until real workload needs it.
  - Fixed uninitialized `addr` on read fault path in do_anonymous_page().
  - Added mm selftests for large anon folios in cow test suite.


Changes since v2 [2]
====================

  - Dropped commit "Allow deferred splitting of arbitrary large anon folios"
      - Huang, Ying suggested the "batch zap" work (which I dropped from this
        series after v1) is a prerequisite for merging FLXEIBLE_THP, so I've
        moved the deferred split patch to a separate series along with the batch
        zap changes. I plan to submit this series early next week.
  - Changed folio order fallback policy
      - We no longer iterate from preferred to 0 looking for acceptable policy
      - Instead we iterate through preferred, PAGE_ALLOC_COSTLY_ORDER and 0 only
  - Removed vma parameter from arch_wants_pte_order()
  - Added command line parameter `flexthp_unhinted_max`
      - clamps preferred order when vma hasn't explicitly opted-in to THP
  - Never allocate large folio for MADV_NOHUGEPAGE vma (or when THP is disabled
    for process or system).
  - Simplified implementation and integration with do_anonymous_page()
  - Removed dependency on set_ptes()


Changes since v1 [1]
====================

  - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
  - replaced with arch-independent alloc_anon_folio()
      - follows THP allocation approach
  - no longer retry with intermediate orders if allocation fails
      - fallback directly to order-0
  - remove folio_add_new_anon_rmap_range() patch
      - instead add its new functionality to folio_add_new_anon_rmap()
  - remove batch-zap pte mappings optimization patch
      - remove enabler folio_remove_rmap_range() patch too
      - These offer real perf improvement so will submit separately
  - simplify Kconfig
      - single FLEXIBLE_THP option, which is independent of arch
      - depends on TRANSPARENT_HUGEPAGE
      - when enabled default to max anon folio size of 64K unless arch
        explicitly overrides
  - simplify changes to do_anonymous_page():
      - no more retry loop


[1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20230703135330.1865927-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-mm/20230714160407.4142030-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/linux-mm/20230726095146.2826796-1-ryan.roberts@arm.com/
[5] https://lore.kernel.org/linux-mm/20230810142942.3169679-1-ryan.roberts@arm.com/
[6] https://lore.kernel.org/linux-mm/1b03f4d6-634d-4786-81a0-5a104799b125@arm.com/
[7] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/


Thanks,
Ryan


Ryan Roberts (9):
  mm: Allow deferred splitting of arbitrary anon large folios
  mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
  mm: thp: Account pte-mapped anonymous THP usage
  mm: thp: Introduce anon_orders and anon_always_mask sysfs files
  mm: thp: Extend THP to allocate anonymous large folios
  mm: thp: Add "recommend" option for anon_orders
  arm64/mm: Override arch_wants_pte_order()
  selftests/mm/cow: Generalize do_run_with_thp() helper
  selftests/mm/cow: Add tests for small-order anon THP

 Documentation/ABI/testing/procfs-smaps_rollup |   1 +
 .../admin-guide/cgroup-v1/memory.rst          |   5 +-
 Documentation/admin-guide/cgroup-v2.rst       |   6 +-
 Documentation/admin-guide/mm/transhuge.rst    |  96 ++++++-
 Documentation/filesystems/proc.rst            |  20 +-
 arch/arm64/include/asm/pgtable.h              |  10 +
 drivers/base/node.c                           |   2 +
 fs/proc/meminfo.c                             |   2 +
 fs/proc/task_mmu.c                            |   7 +-
 include/linux/huge_mm.h                       |  95 +++++--
 include/linux/mmzone.h                        |   1 +
 include/linux/pgtable.h                       |  13 +
 mm/huge_memory.c                              | 172 ++++++++++--
 mm/khugepaged.c                               |  18 +-
 mm/memcontrol.c                               |   8 +
 mm/memory.c                                   | 114 +++++++-
 mm/page_vma_mapped.c                          |   3 +-
 mm/rmap.c                                     |  42 ++-
 mm/show_mem.c                                 |   2 +
 mm/vmstat.c                                   |   1 +
 tools/testing/selftests/mm/cow.c              | 244 +++++++++++++-----
 21 files changed, 696 insertions(+), 166 deletions(-)

--
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH v6 1/9] mm: Allow deferred splitting of arbitrary anon large folios
  2023-09-29 11:44 ` Ryan Roberts
@ 2023-09-29 11:44   ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

In preparation for the introduction of large folios for anonymous
memory, we would like to be able to split them when they have unmapped
subpages, in order to free those unused pages under memory pressure. So
remove the artificial requirement that the large folio needed to be at
least PMD-sized.

Reviewed-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/rmap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 9f795b93cf40..8600bd029acf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1446,11 +1446,11 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 		__lruvec_stat_mod_folio(folio, idx, -nr);
 
 		/*
-		 * Queue anon THP for deferred split if at least one
+		 * Queue anon large folio for deferred split if at least one
 		 * page of the folio is unmapped and at least one page
 		 * is still mapped.
 		 */
-		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
+		if (folio_test_large(folio) && folio_test_anon(folio))
 			if (!compound || nr < nr_pmdmapped)
 				deferred_split_folio(folio);
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 1/9] mm: Allow deferred splitting of arbitrary anon large folios
@ 2023-09-29 11:44   ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

In preparation for the introduction of large folios for anonymous
memory, we would like to be able to split them when they have unmapped
subpages, in order to free those unused pages under memory pressure. So
remove the artificial requirement that the large folio needed to be at
least PMD-sized.

Reviewed-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/rmap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 9f795b93cf40..8600bd029acf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1446,11 +1446,11 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 		__lruvec_stat_mod_folio(folio, idx, -nr);
 
 		/*
-		 * Queue anon THP for deferred split if at least one
+		 * Queue anon large folio for deferred split if at least one
 		 * page of the folio is unmapped and at least one page
 		 * is still mapped.
 		 */
-		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
+		if (folio_test_large(folio) && folio_test_anon(folio))
 			if (!compound || nr < nr_pmdmapped)
 				deferred_split_folio(folio);
 	}
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 2/9] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
  2023-09-29 11:44 ` Ryan Roberts
@ 2023-09-29 11:44   ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

In preparation for anonymous large folio support, improve
folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
passed to it. In this case, all contained pages are accounted using the
order-0 folio (or base page) scheme.

Reviewed-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/rmap.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 8600bd029acf..106149690366 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1266,31 +1266,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
  * This means the inc-and-test can be bypassed.
  * The folio does not have to be locked.
  *
- * If the folio is large, it is accounted as a THP.  As the folio
+ * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
  * is new, it's assumed to be mapped exclusively by a single process.
  */
 void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 		unsigned long address)
 {
-	int nr;
+	int nr = folio_nr_pages(folio);
 
-	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	VM_BUG_ON_VMA(address < vma->vm_start ||
+			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
 	__folio_set_swapbacked(folio);
 
-	if (likely(!folio_test_pmd_mappable(folio))) {
+	if (likely(!folio_test_large(folio))) {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_mapcount, 0);
-		nr = 1;
+		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
+	} else if (!folio_test_pmd_mappable(folio)) {
+		int i;
+
+		for (i = 0; i < nr; i++) {
+			struct page *page = folio_page(folio, i);
+
+			/* increment count (starts at -1) */
+			atomic_set(&page->_mapcount, 0);
+			__page_set_anon_rmap(folio, page, vma,
+					address + (i << PAGE_SHIFT), 1);
+		}
+
+		atomic_set(&folio->_nr_pages_mapped, nr);
 	} else {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_entire_mapcount, 0);
 		atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
-		nr = folio_nr_pages(folio);
+		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
 	}
 
 	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
-	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 }
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 2/9] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
@ 2023-09-29 11:44   ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

In preparation for anonymous large folio support, improve
folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
passed to it. In this case, all contained pages are accounted using the
order-0 folio (or base page) scheme.

Reviewed-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/rmap.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 8600bd029acf..106149690366 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1266,31 +1266,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
  * This means the inc-and-test can be bypassed.
  * The folio does not have to be locked.
  *
- * If the folio is large, it is accounted as a THP.  As the folio
+ * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
  * is new, it's assumed to be mapped exclusively by a single process.
  */
 void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 		unsigned long address)
 {
-	int nr;
+	int nr = folio_nr_pages(folio);
 
-	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	VM_BUG_ON_VMA(address < vma->vm_start ||
+			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
 	__folio_set_swapbacked(folio);
 
-	if (likely(!folio_test_pmd_mappable(folio))) {
+	if (likely(!folio_test_large(folio))) {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_mapcount, 0);
-		nr = 1;
+		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
+	} else if (!folio_test_pmd_mappable(folio)) {
+		int i;
+
+		for (i = 0; i < nr; i++) {
+			struct page *page = folio_page(folio, i);
+
+			/* increment count (starts at -1) */
+			atomic_set(&page->_mapcount, 0);
+			__page_set_anon_rmap(folio, page, vma,
+					address + (i << PAGE_SHIFT), 1);
+		}
+
+		atomic_set(&folio->_nr_pages_mapped, nr);
 	} else {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_entire_mapcount, 0);
 		atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
-		nr = folio_nr_pages(folio);
+		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
 	}
 
 	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
-	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 }
 
 /**
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 3/9] mm: thp: Account pte-mapped anonymous THP usage
  2023-09-29 11:44 ` Ryan Roberts
@ 2023-09-29 11:44   ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

Add accounting for pte-mapped anonymous transparent hugepages at various
locations. This visibility will aid in debugging and tuning performance
for the "small order" thp extension that will be added in a subsequent
commit, where hugepages can be allocated which are large (greater than
order-0) but smaller than PMD_ORDER. This new accounting follows a
similar pattern to the existing NR_ANON_THPS, which measures pmd-mapped
anonymous transparent hugepages.

We account pte-mapped anonymous thp mappings per-page, where the page is
mapped at least once via PTE and the page belongs to a large folio. So
when a page belonging to a large folio is PTE-mapped for the first time,
then we add 1 to NR_ANON_THPS_PTEMAPPED. And when a page belonging to a
large folio is PTE-unmapped for the last time, then we remove 1 from
NR_ANON_THPS_PTEMAPPED.

/proc/meminfo:
  Introduce new "AnonHugePteMap" field, which reports the amount of
  memory (in KiB) mapped from large folios globally (similar to
  AnonHugePages field).

/proc/vmstat:
  Introduce new "nr_anon_thp_pte" field, which reports the amount of
  memory (in pages) mapped from large folios globally (similar to
  nr_anon_transparent_hugepages field).

/sys/devices/system/node/nodeX/meminfo
  Introduce new "AnonHugePteMap" field, which reports the amount of
  memory (in KiB) mapped from large folios per-node (similar to
  AnonHugePages field).

show_mem (panic logger):
  Introduce new "anon_thp_pte" field, which reports the amount of memory
  (in KiB) mapped from large folios per-node (similar to anon_thp
  field).

memory.stat (cgroup v1 and v2):
  Introduce new "anon_thp_pte" field, which reports the amount of memory
  (in bytes) mapped from large folios in the memcg (similar to rss_huge
  (v1) / anon_thp (v2) fields).

/proc/<pid>/smaps & /proc/<pid>/smaps_rollup:
  Introduce new "AnonHugePteMap" field, which reports the amount of
  memory (in KiB) mapped from large folios within the vma/process
  (similar to AnonHugePages field).

NOTE on charge migration: The new NR_ANON_THPS_PTEMAPPED charge is NOT
moved between cgroups, even when the (v1)
memory.move_charge_at_immigrate feature is enabled. That feature is
marked deprecated and the current code does not attempt to move the
NR_ANON_MAPPED charge for large PTE-mapped folios anyway (see comment in
mem_cgroup_move_charge_pte_range()). If this code was enhanced to allow
moving the NR_ANON_MAPPED charge for large PTE-mapped folios, we would
also need to add support for moving the new NR_ANON_THPS_PTEMAPPED
charge. This would likely get quite fiddly. Given the deprecation of
memory.move_charge_at_immigrate, I assume it is not valuable to
implement.

NOTE on naming: Given the new small order anonymous thp feature will be
exposed to user space as an extension to thp, I've opted to call the new
counters after thp also (as aposed to "large"/"large folio"/etc.), so
"huge" no longer strictly means PMD - one could argue hugetlb already
breaks this rule anyway. I also did not want to risk breaking back
compat by renaming/redefining the existing counters (which would have
resulted in more consistent and clearer names). So the existing
NR_ANON_THPS counters remain and continue to only refer to PMD-mapped
THPs. And I've added new counters, which only refer to PTE-mapped THPs.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 Documentation/ABI/testing/procfs-smaps_rollup  |  1 +
 Documentation/admin-guide/cgroup-v1/memory.rst |  5 ++++-
 Documentation/admin-guide/cgroup-v2.rst        |  6 +++++-
 Documentation/admin-guide/mm/transhuge.rst     | 11 +++++++----
 Documentation/filesystems/proc.rst             | 14 ++++++++++++--
 drivers/base/node.c                            |  2 ++
 fs/proc/meminfo.c                              |  2 ++
 fs/proc/task_mmu.c                             |  4 ++++
 include/linux/mmzone.h                         |  1 +
 mm/memcontrol.c                                |  8 ++++++++
 mm/rmap.c                                      | 11 +++++++++--
 mm/show_mem.c                                  |  2 ++
 mm/vmstat.c                                    |  1 +
 13 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/Documentation/ABI/testing/procfs-smaps_rollup b/Documentation/ABI/testing/procfs-smaps_rollup
index b446a7154a1b..b50b3eda5a3f 100644
--- a/Documentation/ABI/testing/procfs-smaps_rollup
+++ b/Documentation/ABI/testing/procfs-smaps_rollup
@@ -34,6 +34,7 @@ Description:
 			Anonymous:	      68 kB
 			LazyFree:	       0 kB
 			AnonHugePages:	       0 kB
+			AnonHugePteMap:        0 kB
 			ShmemPmdMapped:	       0 kB
 			Shared_Hugetlb:	       0 kB
 			Private_Hugetlb:       0 kB
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 5f502bf68fbc..b7efc7531896 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -535,7 +535,10 @@ memory.stat file includes following statistics:
     cache           # of bytes of page cache memory.
     rss             # of bytes of anonymous and swap cache memory (includes
                     transparent hugepages).
-    rss_huge        # of bytes of anonymous transparent hugepages.
+    rss_huge        # of bytes of anonymous transparent hugepages, mapped by
+                    PMD.
+    anon_thp_pte    # of bytes of anonymous transparent hugepages, mapped by
+                    PTE.
     mapped_file     # of bytes of mapped file (includes tmpfs/shmem)
     pgpgin          # of charging events to the memory cgroup. The charging
                     event happens each time a page is accounted as either mapped
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index b26b5274eaaf..48b961b8fc6d 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1421,7 +1421,11 @@ PAGE_SIZE multiple when read back.
 
 	  anon_thp
 		Amount of memory used in anonymous mappings backed by
-		transparent hugepages
+		transparent hugepages, mapped by PMD
+
+	  anon_thp_pte
+		Amount of memory used in anonymous mappings backed by
+		transparent hugepages, mapped by PTE
 
 	  file_thp
 		Amount of cached filesystem data backed by transparent
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index b0cc8243e093..ebda57850643 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -291,10 +291,13 @@ Monitoring usage
 ================
 
 The number of anonymous transparent huge pages currently used by the
-system is available by reading the AnonHugePages field in ``/proc/meminfo``.
-To identify what applications are using anonymous transparent huge pages,
-it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
-for each mapping.
+system is available by reading the AnonHugePages and AnonHugePteMap
+fields in ``/proc/meminfo``. To identify what applications are using
+anonymous transparent huge pages, it is necessary to read
+``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap
+fields for each mapping. Note that in both cases, AnonHugePages refers
+only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped
+using PTEs.
 
 The number of file transparent huge pages mapped to userspace is available
 by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2b59cff8be17..ccbb76a509f0 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -464,6 +464,7 @@ Memory Area, or VMA) there is a series of lines such as the following::
     KSM:                   0 kB
     LazyFree:              0 kB
     AnonHugePages:         0 kB
+    AnonHugePteMap:        0 kB
     ShmemPmdMapped:        0 kB
     Shared_Hugetlb:        0 kB
     Private_Hugetlb:       0 kB
@@ -511,7 +512,11 @@ pressure if the memory is clean. Please note that the printed value might
 be lower than the real value due to optimizations used in the current
 implementation. If this is not desirable please file a bug report.
 
-"AnonHugePages" shows the amount of memory backed by transparent hugepage.
+"AnonHugePages" shows the amount of memory backed by transparent hugepage,
+mapped by PMD.
+
+"AnonHugePteMap" shows the amount of memory backed by transparent hugepage,
+mapped by PTE.
 
 "ShmemPmdMapped" shows the amount of shared (shmem/tmpfs) memory backed by
 huge pages.
@@ -1006,6 +1011,7 @@ Example output. You may not have all of these fields.
     EarlyMemtestBad:       0 kB
     HardwareCorrupted:     0 kB
     AnonHugePages:   4149248 kB
+    AnonHugePteMap:        0 kB
     ShmemHugePages:        0 kB
     ShmemPmdMapped:        0 kB
     FileHugePages:         0 kB
@@ -1165,7 +1171,11 @@ HardwareCorrupted
               The amount of RAM/memory in KB, the kernel identifies as
               corrupted.
 AnonHugePages
-              Non-file backed huge pages mapped into userspace page tables
+              Non-file backed huge pages mapped into userspace page tables by
+              PMD
+AnonHugePteMap
+              Non-file backed huge pages mapped into userspace page tables by
+              PTE
 ShmemHugePages
               Memory used by shared memory (shmem) and tmpfs allocated
               with huge pages
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 493d533f8375..08f1759387d2 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -443,6 +443,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     "Node %d SUnreclaim:     %8lu kB\n"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			     "Node %d AnonHugePages:  %8lu kB\n"
+			     "Node %d AnonHugePteMap: %8lu kB\n"
 			     "Node %d ShmemHugePages: %8lu kB\n"
 			     "Node %d ShmemPmdMapped: %8lu kB\n"
 			     "Node %d FileHugePages:  %8lu kB\n"
@@ -475,6 +476,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			     ,
 			     nid, K(node_page_state(pgdat, NR_ANON_THPS)),
+			     nid, K(node_page_state(pgdat, NR_ANON_THPS_PTEMAPPED)),
 			     nid, K(node_page_state(pgdat, NR_SHMEM_THPS)),
 			     nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
 			     nid, K(node_page_state(pgdat, NR_FILE_THPS)),
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 45af9a989d40..bac20cc60b6a 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -143,6 +143,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	show_val_kb(m, "AnonHugePages:  ",
 		    global_node_page_state(NR_ANON_THPS));
+	show_val_kb(m, "AnonHugePteMap: ",
+		    global_node_page_state(NR_ANON_THPS_PTEMAPPED));
 	show_val_kb(m, "ShmemHugePages: ",
 		    global_node_page_state(NR_SHMEM_THPS));
 	show_val_kb(m, "ShmemPmdMapped: ",
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3dd5be96691b..7b5dad163533 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -392,6 +392,7 @@ struct mem_size_stats {
 	unsigned long anonymous;
 	unsigned long lazyfree;
 	unsigned long anonymous_thp;
+	unsigned long anonymous_thp_pte;
 	unsigned long shmem_thp;
 	unsigned long file_thp;
 	unsigned long swap;
@@ -452,6 +453,8 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 		mss->anonymous += size;
 		if (!PageSwapBacked(page) && !dirty && !PageDirty(page))
 			mss->lazyfree += size;
+		if (!compound && PageTransCompound(page))
+			mss->anonymous_thp_pte += size;
 	}
 
 	if (PageKsm(page))
@@ -833,6 +836,7 @@ static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss,
 	SEQ_PUT_DEC(" kB\nKSM:            ", mss->ksm);
 	SEQ_PUT_DEC(" kB\nLazyFree:       ", mss->lazyfree);
 	SEQ_PUT_DEC(" kB\nAnonHugePages:  ", mss->anonymous_thp);
+	SEQ_PUT_DEC(" kB\nAnonHugePteMap: ", mss->anonymous_thp_pte);
 	SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp);
 	SEQ_PUT_DEC(" kB\nFilePmdMapped:  ", mss->file_thp);
 	SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4106fbc5b4b3..5032fc31c651 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -186,6 +186,7 @@ enum node_stat_item {
 	NR_FILE_THPS,
 	NR_FILE_PMDMAPPED,
 	NR_ANON_THPS,
+	NR_ANON_THPS_PTEMAPPED,
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d13dde2f8b56..07d8e0b55b0e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -809,6 +809,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 		case NR_ANON_MAPPED:
 		case NR_FILE_MAPPED:
 		case NR_ANON_THPS:
+		case NR_ANON_THPS_PTEMAPPED:
 		case NR_SHMEM_PMDMAPPED:
 		case NR_FILE_PMDMAPPED:
 			WARN_ON_ONCE(!in_task());
@@ -1512,6 +1513,7 @@ static const struct memory_stat memory_stats[] = {
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	{ "anon_thp",			NR_ANON_THPS			},
+	{ "anon_thp_pte",		NR_ANON_THPS_PTEMAPPED		},
 	{ "file_thp",			NR_FILE_THPS			},
 	{ "shmem_thp",			NR_SHMEM_THPS			},
 #endif
@@ -4052,6 +4054,7 @@ static const unsigned int memcg1_stats[] = {
 	NR_ANON_MAPPED,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	NR_ANON_THPS,
+	NR_ANON_THPS_PTEMAPPED,
 #endif
 	NR_SHMEM,
 	NR_FILE_MAPPED,
@@ -4067,6 +4070,7 @@ static const char *const memcg1_stat_names[] = {
 	"rss",
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	"rss_huge",
+	"anon_thp_pte",
 #endif
 	"shmem",
 	"mapped_file",
@@ -6259,6 +6263,10 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 			 * can be done but it would be too convoluted so simply
 			 * ignore such a partial THP and keep it in original
 			 * memcg. There should be somebody mapping the head.
+			 * This simplification also means that pte-mapped large
+			 * folios are never migrated, which means we don't need
+			 * to worry about migrating the NR_ANON_THPS_PTEMAPPED
+			 * accounting.
 			 */
 			if (PageTransCompound(page))
 				goto put;
diff --git a/mm/rmap.c b/mm/rmap.c
index 106149690366..52dabee73023 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1205,7 +1205,7 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 {
 	struct folio *folio = page_folio(page);
 	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int nr = 0, nr_pmdmapped = 0;
+	int nr = 0, nr_pmdmapped = 0, nr_lgmapped = 0;
 	bool compound = flags & RMAP_COMPOUND;
 	bool first = true;
 
@@ -1214,6 +1214,7 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 		first = atomic_inc_and_test(&page->_mapcount);
 		nr = first;
 		if (first && folio_test_large(folio)) {
+			nr_lgmapped = 1;
 			nr = atomic_inc_return_relaxed(mapped);
 			nr = (nr < COMPOUND_MAPPED);
 		}
@@ -1241,6 +1242,8 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 
 	if (nr_pmdmapped)
 		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr_pmdmapped);
+	if (nr_lgmapped)
+		__lruvec_stat_mod_folio(folio, NR_ANON_THPS_PTEMAPPED, nr_lgmapped);
 	if (nr)
 		__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
 
@@ -1295,6 +1298,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 		}
 
 		atomic_set(&folio->_nr_pages_mapped, nr);
+		__lruvec_stat_mod_folio(folio, NR_ANON_THPS_PTEMAPPED, nr);
 	} else {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_entire_mapcount, 0);
@@ -1405,7 +1409,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 {
 	struct folio *folio = page_folio(page);
 	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int nr = 0, nr_pmdmapped = 0;
+	int nr = 0, nr_pmdmapped = 0, nr_lgmapped = 0;
 	bool last;
 	enum node_stat_item idx;
 
@@ -1423,6 +1427,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 		last = atomic_add_negative(-1, &page->_mapcount);
 		nr = last;
 		if (last && folio_test_large(folio)) {
+			nr_lgmapped = 1;
 			nr = atomic_dec_return_relaxed(mapped);
 			nr = (nr < COMPOUND_MAPPED);
 		}
@@ -1454,6 +1459,8 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 			idx = NR_FILE_PMDMAPPED;
 		__lruvec_stat_mod_folio(folio, idx, -nr_pmdmapped);
 	}
+	if (nr_lgmapped && folio_test_anon(folio))
+		__lruvec_stat_mod_folio(folio, NR_ANON_THPS_PTEMAPPED, -nr_lgmapped);
 	if (nr) {
 		idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
 		__lruvec_stat_mod_folio(folio, idx, -nr);
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 4b888b18bdde..e648a815f0fb 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -254,6 +254,7 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 			" shmem_thp:%lukB"
 			" shmem_pmdmapped:%lukB"
 			" anon_thp:%lukB"
+			" anon_thp_pte:%lukB"
 #endif
 			" writeback_tmp:%lukB"
 			" kernel_stack:%lukB"
@@ -280,6 +281,7 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 			K(node_page_state(pgdat, NR_SHMEM_THPS)),
 			K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
 			K(node_page_state(pgdat, NR_ANON_THPS)),
+			K(node_page_state(pgdat, NR_ANON_THPS_PTEMAPPED)),
 #endif
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 			node_page_state(pgdat, NR_KERNEL_STACK_KB),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 00e81e99c6ee..267de0e4ddca 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1224,6 +1224,7 @@ const char * const vmstat_text[] = {
 	"nr_file_hugepages",
 	"nr_file_pmdmapped",
 	"nr_anon_transparent_hugepages",
+	"nr_anon_thp_pte",
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",
 	"nr_dirtied",
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 3/9] mm: thp: Account pte-mapped anonymous THP usage
@ 2023-09-29 11:44   ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

Add accounting for pte-mapped anonymous transparent hugepages at various
locations. This visibility will aid in debugging and tuning performance
for the "small order" thp extension that will be added in a subsequent
commit, where hugepages can be allocated which are large (greater than
order-0) but smaller than PMD_ORDER. This new accounting follows a
similar pattern to the existing NR_ANON_THPS, which measures pmd-mapped
anonymous transparent hugepages.

We account pte-mapped anonymous thp mappings per-page, where the page is
mapped at least once via PTE and the page belongs to a large folio. So
when a page belonging to a large folio is PTE-mapped for the first time,
then we add 1 to NR_ANON_THPS_PTEMAPPED. And when a page belonging to a
large folio is PTE-unmapped for the last time, then we remove 1 from
NR_ANON_THPS_PTEMAPPED.

/proc/meminfo:
  Introduce new "AnonHugePteMap" field, which reports the amount of
  memory (in KiB) mapped from large folios globally (similar to
  AnonHugePages field).

/proc/vmstat:
  Introduce new "nr_anon_thp_pte" field, which reports the amount of
  memory (in pages) mapped from large folios globally (similar to
  nr_anon_transparent_hugepages field).

/sys/devices/system/node/nodeX/meminfo
  Introduce new "AnonHugePteMap" field, which reports the amount of
  memory (in KiB) mapped from large folios per-node (similar to
  AnonHugePages field).

show_mem (panic logger):
  Introduce new "anon_thp_pte" field, which reports the amount of memory
  (in KiB) mapped from large folios per-node (similar to anon_thp
  field).

memory.stat (cgroup v1 and v2):
  Introduce new "anon_thp_pte" field, which reports the amount of memory
  (in bytes) mapped from large folios in the memcg (similar to rss_huge
  (v1) / anon_thp (v2) fields).

/proc/<pid>/smaps & /proc/<pid>/smaps_rollup:
  Introduce new "AnonHugePteMap" field, which reports the amount of
  memory (in KiB) mapped from large folios within the vma/process
  (similar to AnonHugePages field).

NOTE on charge migration: The new NR_ANON_THPS_PTEMAPPED charge is NOT
moved between cgroups, even when the (v1)
memory.move_charge_at_immigrate feature is enabled. That feature is
marked deprecated and the current code does not attempt to move the
NR_ANON_MAPPED charge for large PTE-mapped folios anyway (see comment in
mem_cgroup_move_charge_pte_range()). If this code was enhanced to allow
moving the NR_ANON_MAPPED charge for large PTE-mapped folios, we would
also need to add support for moving the new NR_ANON_THPS_PTEMAPPED
charge. This would likely get quite fiddly. Given the deprecation of
memory.move_charge_at_immigrate, I assume it is not valuable to
implement.

NOTE on naming: Given the new small order anonymous thp feature will be
exposed to user space as an extension to thp, I've opted to call the new
counters after thp also (as aposed to "large"/"large folio"/etc.), so
"huge" no longer strictly means PMD - one could argue hugetlb already
breaks this rule anyway. I also did not want to risk breaking back
compat by renaming/redefining the existing counters (which would have
resulted in more consistent and clearer names). So the existing
NR_ANON_THPS counters remain and continue to only refer to PMD-mapped
THPs. And I've added new counters, which only refer to PTE-mapped THPs.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 Documentation/ABI/testing/procfs-smaps_rollup  |  1 +
 Documentation/admin-guide/cgroup-v1/memory.rst |  5 ++++-
 Documentation/admin-guide/cgroup-v2.rst        |  6 +++++-
 Documentation/admin-guide/mm/transhuge.rst     | 11 +++++++----
 Documentation/filesystems/proc.rst             | 14 ++++++++++++--
 drivers/base/node.c                            |  2 ++
 fs/proc/meminfo.c                              |  2 ++
 fs/proc/task_mmu.c                             |  4 ++++
 include/linux/mmzone.h                         |  1 +
 mm/memcontrol.c                                |  8 ++++++++
 mm/rmap.c                                      | 11 +++++++++--
 mm/show_mem.c                                  |  2 ++
 mm/vmstat.c                                    |  1 +
 13 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/Documentation/ABI/testing/procfs-smaps_rollup b/Documentation/ABI/testing/procfs-smaps_rollup
index b446a7154a1b..b50b3eda5a3f 100644
--- a/Documentation/ABI/testing/procfs-smaps_rollup
+++ b/Documentation/ABI/testing/procfs-smaps_rollup
@@ -34,6 +34,7 @@ Description:
 			Anonymous:	      68 kB
 			LazyFree:	       0 kB
 			AnonHugePages:	       0 kB
+			AnonHugePteMap:        0 kB
 			ShmemPmdMapped:	       0 kB
 			Shared_Hugetlb:	       0 kB
 			Private_Hugetlb:       0 kB
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 5f502bf68fbc..b7efc7531896 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -535,7 +535,10 @@ memory.stat file includes following statistics:
     cache           # of bytes of page cache memory.
     rss             # of bytes of anonymous and swap cache memory (includes
                     transparent hugepages).
-    rss_huge        # of bytes of anonymous transparent hugepages.
+    rss_huge        # of bytes of anonymous transparent hugepages, mapped by
+                    PMD.
+    anon_thp_pte    # of bytes of anonymous transparent hugepages, mapped by
+                    PTE.
     mapped_file     # of bytes of mapped file (includes tmpfs/shmem)
     pgpgin          # of charging events to the memory cgroup. The charging
                     event happens each time a page is accounted as either mapped
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index b26b5274eaaf..48b961b8fc6d 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1421,7 +1421,11 @@ PAGE_SIZE multiple when read back.
 
 	  anon_thp
 		Amount of memory used in anonymous mappings backed by
-		transparent hugepages
+		transparent hugepages, mapped by PMD
+
+	  anon_thp_pte
+		Amount of memory used in anonymous mappings backed by
+		transparent hugepages, mapped by PTE
 
 	  file_thp
 		Amount of cached filesystem data backed by transparent
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index b0cc8243e093..ebda57850643 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -291,10 +291,13 @@ Monitoring usage
 ================
 
 The number of anonymous transparent huge pages currently used by the
-system is available by reading the AnonHugePages field in ``/proc/meminfo``.
-To identify what applications are using anonymous transparent huge pages,
-it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
-for each mapping.
+system is available by reading the AnonHugePages and AnonHugePteMap
+fields in ``/proc/meminfo``. To identify what applications are using
+anonymous transparent huge pages, it is necessary to read
+``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap
+fields for each mapping. Note that in both cases, AnonHugePages refers
+only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped
+using PTEs.
 
 The number of file transparent huge pages mapped to userspace is available
 by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2b59cff8be17..ccbb76a509f0 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -464,6 +464,7 @@ Memory Area, or VMA) there is a series of lines such as the following::
     KSM:                   0 kB
     LazyFree:              0 kB
     AnonHugePages:         0 kB
+    AnonHugePteMap:        0 kB
     ShmemPmdMapped:        0 kB
     Shared_Hugetlb:        0 kB
     Private_Hugetlb:       0 kB
@@ -511,7 +512,11 @@ pressure if the memory is clean. Please note that the printed value might
 be lower than the real value due to optimizations used in the current
 implementation. If this is not desirable please file a bug report.
 
-"AnonHugePages" shows the amount of memory backed by transparent hugepage.
+"AnonHugePages" shows the amount of memory backed by transparent hugepage,
+mapped by PMD.
+
+"AnonHugePteMap" shows the amount of memory backed by transparent hugepage,
+mapped by PTE.
 
 "ShmemPmdMapped" shows the amount of shared (shmem/tmpfs) memory backed by
 huge pages.
@@ -1006,6 +1011,7 @@ Example output. You may not have all of these fields.
     EarlyMemtestBad:       0 kB
     HardwareCorrupted:     0 kB
     AnonHugePages:   4149248 kB
+    AnonHugePteMap:        0 kB
     ShmemHugePages:        0 kB
     ShmemPmdMapped:        0 kB
     FileHugePages:         0 kB
@@ -1165,7 +1171,11 @@ HardwareCorrupted
               The amount of RAM/memory in KB, the kernel identifies as
               corrupted.
 AnonHugePages
-              Non-file backed huge pages mapped into userspace page tables
+              Non-file backed huge pages mapped into userspace page tables by
+              PMD
+AnonHugePteMap
+              Non-file backed huge pages mapped into userspace page tables by
+              PTE
 ShmemHugePages
               Memory used by shared memory (shmem) and tmpfs allocated
               with huge pages
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 493d533f8375..08f1759387d2 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -443,6 +443,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     "Node %d SUnreclaim:     %8lu kB\n"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			     "Node %d AnonHugePages:  %8lu kB\n"
+			     "Node %d AnonHugePteMap: %8lu kB\n"
 			     "Node %d ShmemHugePages: %8lu kB\n"
 			     "Node %d ShmemPmdMapped: %8lu kB\n"
 			     "Node %d FileHugePages:  %8lu kB\n"
@@ -475,6 +476,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			     ,
 			     nid, K(node_page_state(pgdat, NR_ANON_THPS)),
+			     nid, K(node_page_state(pgdat, NR_ANON_THPS_PTEMAPPED)),
 			     nid, K(node_page_state(pgdat, NR_SHMEM_THPS)),
 			     nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
 			     nid, K(node_page_state(pgdat, NR_FILE_THPS)),
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 45af9a989d40..bac20cc60b6a 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -143,6 +143,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	show_val_kb(m, "AnonHugePages:  ",
 		    global_node_page_state(NR_ANON_THPS));
+	show_val_kb(m, "AnonHugePteMap: ",
+		    global_node_page_state(NR_ANON_THPS_PTEMAPPED));
 	show_val_kb(m, "ShmemHugePages: ",
 		    global_node_page_state(NR_SHMEM_THPS));
 	show_val_kb(m, "ShmemPmdMapped: ",
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3dd5be96691b..7b5dad163533 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -392,6 +392,7 @@ struct mem_size_stats {
 	unsigned long anonymous;
 	unsigned long lazyfree;
 	unsigned long anonymous_thp;
+	unsigned long anonymous_thp_pte;
 	unsigned long shmem_thp;
 	unsigned long file_thp;
 	unsigned long swap;
@@ -452,6 +453,8 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 		mss->anonymous += size;
 		if (!PageSwapBacked(page) && !dirty && !PageDirty(page))
 			mss->lazyfree += size;
+		if (!compound && PageTransCompound(page))
+			mss->anonymous_thp_pte += size;
 	}
 
 	if (PageKsm(page))
@@ -833,6 +836,7 @@ static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss,
 	SEQ_PUT_DEC(" kB\nKSM:            ", mss->ksm);
 	SEQ_PUT_DEC(" kB\nLazyFree:       ", mss->lazyfree);
 	SEQ_PUT_DEC(" kB\nAnonHugePages:  ", mss->anonymous_thp);
+	SEQ_PUT_DEC(" kB\nAnonHugePteMap: ", mss->anonymous_thp_pte);
 	SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp);
 	SEQ_PUT_DEC(" kB\nFilePmdMapped:  ", mss->file_thp);
 	SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4106fbc5b4b3..5032fc31c651 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -186,6 +186,7 @@ enum node_stat_item {
 	NR_FILE_THPS,
 	NR_FILE_PMDMAPPED,
 	NR_ANON_THPS,
+	NR_ANON_THPS_PTEMAPPED,
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d13dde2f8b56..07d8e0b55b0e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -809,6 +809,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 		case NR_ANON_MAPPED:
 		case NR_FILE_MAPPED:
 		case NR_ANON_THPS:
+		case NR_ANON_THPS_PTEMAPPED:
 		case NR_SHMEM_PMDMAPPED:
 		case NR_FILE_PMDMAPPED:
 			WARN_ON_ONCE(!in_task());
@@ -1512,6 +1513,7 @@ static const struct memory_stat memory_stats[] = {
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	{ "anon_thp",			NR_ANON_THPS			},
+	{ "anon_thp_pte",		NR_ANON_THPS_PTEMAPPED		},
 	{ "file_thp",			NR_FILE_THPS			},
 	{ "shmem_thp",			NR_SHMEM_THPS			},
 #endif
@@ -4052,6 +4054,7 @@ static const unsigned int memcg1_stats[] = {
 	NR_ANON_MAPPED,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	NR_ANON_THPS,
+	NR_ANON_THPS_PTEMAPPED,
 #endif
 	NR_SHMEM,
 	NR_FILE_MAPPED,
@@ -4067,6 +4070,7 @@ static const char *const memcg1_stat_names[] = {
 	"rss",
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	"rss_huge",
+	"anon_thp_pte",
 #endif
 	"shmem",
 	"mapped_file",
@@ -6259,6 +6263,10 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 			 * can be done but it would be too convoluted so simply
 			 * ignore such a partial THP and keep it in original
 			 * memcg. There should be somebody mapping the head.
+			 * This simplification also means that pte-mapped large
+			 * folios are never migrated, which means we don't need
+			 * to worry about migrating the NR_ANON_THPS_PTEMAPPED
+			 * accounting.
 			 */
 			if (PageTransCompound(page))
 				goto put;
diff --git a/mm/rmap.c b/mm/rmap.c
index 106149690366..52dabee73023 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1205,7 +1205,7 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 {
 	struct folio *folio = page_folio(page);
 	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int nr = 0, nr_pmdmapped = 0;
+	int nr = 0, nr_pmdmapped = 0, nr_lgmapped = 0;
 	bool compound = flags & RMAP_COMPOUND;
 	bool first = true;
 
@@ -1214,6 +1214,7 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 		first = atomic_inc_and_test(&page->_mapcount);
 		nr = first;
 		if (first && folio_test_large(folio)) {
+			nr_lgmapped = 1;
 			nr = atomic_inc_return_relaxed(mapped);
 			nr = (nr < COMPOUND_MAPPED);
 		}
@@ -1241,6 +1242,8 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 
 	if (nr_pmdmapped)
 		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr_pmdmapped);
+	if (nr_lgmapped)
+		__lruvec_stat_mod_folio(folio, NR_ANON_THPS_PTEMAPPED, nr_lgmapped);
 	if (nr)
 		__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
 
@@ -1295,6 +1298,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 		}
 
 		atomic_set(&folio->_nr_pages_mapped, nr);
+		__lruvec_stat_mod_folio(folio, NR_ANON_THPS_PTEMAPPED, nr);
 	} else {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_entire_mapcount, 0);
@@ -1405,7 +1409,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 {
 	struct folio *folio = page_folio(page);
 	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int nr = 0, nr_pmdmapped = 0;
+	int nr = 0, nr_pmdmapped = 0, nr_lgmapped = 0;
 	bool last;
 	enum node_stat_item idx;
 
@@ -1423,6 +1427,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 		last = atomic_add_negative(-1, &page->_mapcount);
 		nr = last;
 		if (last && folio_test_large(folio)) {
+			nr_lgmapped = 1;
 			nr = atomic_dec_return_relaxed(mapped);
 			nr = (nr < COMPOUND_MAPPED);
 		}
@@ -1454,6 +1459,8 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 			idx = NR_FILE_PMDMAPPED;
 		__lruvec_stat_mod_folio(folio, idx, -nr_pmdmapped);
 	}
+	if (nr_lgmapped && folio_test_anon(folio))
+		__lruvec_stat_mod_folio(folio, NR_ANON_THPS_PTEMAPPED, -nr_lgmapped);
 	if (nr) {
 		idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
 		__lruvec_stat_mod_folio(folio, idx, -nr);
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 4b888b18bdde..e648a815f0fb 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -254,6 +254,7 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 			" shmem_thp:%lukB"
 			" shmem_pmdmapped:%lukB"
 			" anon_thp:%lukB"
+			" anon_thp_pte:%lukB"
 #endif
 			" writeback_tmp:%lukB"
 			" kernel_stack:%lukB"
@@ -280,6 +281,7 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 			K(node_page_state(pgdat, NR_SHMEM_THPS)),
 			K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
 			K(node_page_state(pgdat, NR_ANON_THPS)),
+			K(node_page_state(pgdat, NR_ANON_THPS_PTEMAPPED)),
 #endif
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 			node_page_state(pgdat, NR_KERNEL_STACK_KB),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 00e81e99c6ee..267de0e4ddca 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1224,6 +1224,7 @@ const char * const vmstat_text[] = {
 	"nr_file_hugepages",
 	"nr_file_pmdmapped",
 	"nr_anon_transparent_hugepages",
+	"nr_anon_thp_pte",
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",
 	"nr_dirtied",
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
  2023-09-29 11:44 ` Ryan Roberts
@ 2023-09-29 11:44   ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

In preparation for adding support for anonymous large folios that are
smaller than the PMD-size, introduce 2 new sysfs files that will be used
to control the new behaviours via the transparent_hugepage interface.
For now, the kernel still only supports PMD-order anonymous THP, so when
reading back anon_orders, it will reflect that. Therefore there are no
behavioural changes intended here.

The bulk of the change is implemented by converting
transhuge_vma_suitable() and hugepage_vma_check() so that they take a
bitfield of orders for which the user wants to determine support, and
the functions filter out all the orders that can't be supported. If
there is only 1 order set in the input then the output can continue to
be treated like a boolean; this is the case for most call sites.

The remainder is copied from Documentation/admin-guide/mm/transhuge.rst,
as modified by this commit. See that file for further details.

By default, allocation of anonymous THPs that are smaller than PMD-size
is disabled. These smaller allocation orders can be enabled by writing
an encoded set of orders as follows::

	echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders

Where an order refers to the number of pages in the large folio as
2^order, and where each order is encoded in the written value such that
each set bit represents an enabled order; So setting bit-2 indicates
that order-2 folios are in use, and order-2 means 2^2=4 pages (=16K if
the page size is 4K). The example above enables order-9 (PMD-order) and
order-3.

By enabling multiple orders, allocation of each order will be attempted,
highest to lowest, until a successful allocation is made. If the
PMD-order is unset, then no PMD-sized THPs will be allocated.

The kernel will ignore any orders that it does not support so read the
file back to determine which orders are enabled::

	cat /sys/kernel/mm/transparent_hugepage/anon_orders

For some workloads it may be desirable to limit some THP orders to be
used only for MADV_HUGEPAGE regions, while allowing others to be used
always. For example, a workload may only benefit from PMD-sized THP in
specific areas, but can take benefit of 32K sized THP more generally. In
this case, THP can be enabled in ``madvise`` mode as normal, but
specific orders can be configured to be allocated as if in ``always``
mode. The below example enables orders 9 and 3, with order-9 only
applied to MADV_HUGEPAGE regions, and order-3 applied always::

	echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
	echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders
	echo 0x008 >/sys/kernel/mm/transparent_hugepage/anon_always_mask

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  74 ++++++++--
 Documentation/filesystems/proc.rst         |   6 +-
 fs/proc/task_mmu.c                         |   3 +-
 include/linux/huge_mm.h                    |  93 +++++++++---
 mm/huge_memory.c                           | 164 ++++++++++++++++++---
 mm/khugepaged.c                            |  18 ++-
 mm/memory.c                                |   6 +-
 mm/page_vma_mapped.c                       |   3 +-
 8 files changed, 296 insertions(+), 71 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index ebda57850643..9f954e73a4ca 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -45,10 +45,22 @@ components:
    the two is using hugepages just because of the fact the TLB miss is
    going to run faster.
 
+Furthermore, it is possible to configure THP to allocate large folios
+to back anonymous memory, which are smaller than PMD-size (for example
+16K, 32K, 64K, etc). These THPs continue to be PTE-mapped, but in many
+cases can still provide the similar benefits to those outlined above:
+Page faults are significantly reduced (by a factor of e.g. 4, 8, 16,
+etc), but latency spikes are much less prominent because the size of
+each page isn't as huge as the PMD-sized variant and there is less
+memory to clear in each page fault. Some architectures also employ TLB
+compression mechanisms to squeeze more entries in when a set of PTEs
+are virtually and physically contiguous and approporiately aligned. In
+this case, TLB misses will occur less often.
+
 THP can be enabled system wide or restricted to certain tasks or even
 memory ranges inside task's address space. Unless THP is completely
 disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into huge pages.
+collapses sequences of basic pages into PMD-sized huge pages.
 
 The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
 interface and using madvise(2) and prctl(2) system calls.
@@ -146,25 +158,69 @@ madvise
 never
 	should be self-explanatory.
 
-By default kernel tries to use huge zero page on read page fault to
-anonymous mapping. It's possible to disable huge zero page by writing 0
-or enable it back by writing 1::
+By default kernel tries to use huge, PMD-mapped zero page on read page
+fault to anonymous mapping. It's possible to disable huge zero page by
+writing 0 or enable it back by writing 1::
 
 	echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
 	echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
 
 Some userspace (such as a test program, or an optimized memory allocation
-library) may want to know the size (in bytes) of a transparent hugepage::
+library) may want to know the size (in bytes) of a PMD-mappable
+transparent hugepage::
 
 	cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
 
+By default, allocation of anonymous THPs that are smaller than
+PMD-size is disabled. These smaller allocation orders can be enabled
+by writing an encoded set of orders as follows::
+
+	echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders
+
+Where an order refers to the number of pages in the large folio as
+2^order, and where each order is encoded in the written value such
+that each set bit represents an enabled order; So setting bit-2
+indicates that order-2 folios are in use, and order-2 means 2^2=4
+pages (=16K if the page size is 4K). The example above enables order-9
+(PMD-order) and order-3.
+
+By enabling multiple orders, allocation of each order will be
+attempted, highest to lowest, until a successful allocation is made.
+If the PMD-order is unset, then no PMD-sized THPs will be allocated.
+
+The kernel will ignore any orders that it does not support so read the
+file back to determine which orders are enabled::
+
+	cat /sys/kernel/mm/transparent_hugepage/anon_orders
+
+For some workloads it may be desirable to limit some THP orders to be
+used only for MADV_HUGEPAGE regions, while allowing others to be used
+always. For example, a workload may only benefit from PMD-sized THP in
+specific areas, but can take benefit of 32K sized THP more generally.
+In this case, THP can be enabled in ``madvise`` mode as normal, but
+specific orders can be configured to be allocated as if in ``always``
+mode. The below example enables orders 9 and 3, with order-9 only
+applied to MADV_HUGEPAGE regions, and order-3 applied always::
+
+	echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
+	echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders
+	echo 0x008 >/sys/kernel/mm/transparent_hugepage/anon_always_mask
+
 khugepaged will be automatically started when
-transparent_hugepage/enabled is set to "always" or "madvise, and it'll
-be automatically shutdown if it's set to "never".
+transparent_hugepage/enabled is set to "always" or "madvise",
+providing the PMD-order is enabled in
+transparent_hugepage/anon_orders, and it'll be automatically shutdown
+if it's set to "never" or the PMD-order is disabled in
+transparent_hugepage/anon_orders.
 
 Khugepaged controls
 -------------------
 
+.. note::
+   khugepaged currently only searches for opportunities to collapse to
+   PMD-sized THP and no attempt is made to collapse to smaller order
+   THP.
+
 khugepaged runs usually at low frequency so while one may not want to
 invoke defrag algorithms synchronously during the page faults, it
 should be worth invoking defrag at least in khugepaged. However it's
@@ -285,7 +341,7 @@ Need of application restart
 The transparent_hugepage/enabled values and tmpfs mount option only affect
 future behavior. So to make them effective you need to restart any
 application that could have been using hugepages. This also applies to the
-regions registered in khugepaged.
+regions registered in khugepaged, and transparent_hugepage/anon_orders.
 
 Monitoring usage
 ================
@@ -416,7 +472,7 @@ for huge pages.
 Optimizing the applications
 ===========================
 
-To be guaranteed that the kernel will map a 2M page immediately in any
+To be guaranteed that the kernel will map a thp immediately in any
 memory region, the mmap region has to be hugepage naturally
 aligned. posix_memalign() can provide that guarantee.
 
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index ccbb76a509f0..72526f8bb658 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -533,9 +533,9 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
 does not take into account swapped out page of underlying shmem objects.
 "Locked" indicates whether the mapping is locked in memory or not.
 
-"THPeligible" indicates whether the mapping is eligible for allocating THP
-pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
-It just shows the current status.
+"THPeligible" indicates whether the mapping is eligible for allocating
+naturally aligned THP pages of any currently enabled order. 1 if true, 0
+otherwise. It just shows the current status.
 
 "VmFlags" field deserves a separate description. This member represents the
 kernel flags associated with the particular virtual memory area in two letter
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 7b5dad163533..f978dce7f7ce 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -869,7 +869,8 @@ static int show_smap(struct seq_file *m, void *v)
 	__show_smap(m, &mss, false);
 
 	seq_printf(m, "THPeligible:    %8u\n",
-		   hugepage_vma_check(vma, vma->vm_flags, true, false, true));
+		   !!hugepage_vma_check(vma, vma->vm_flags, true, false, true,
+					THP_ORDERS_ALL));
 
 	if (arch_pkeys_enabled())
 		seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fa0350b0812a..2e7c338229a6 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -67,6 +67,21 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 
+/*
+ * Mask of all large folio orders supported for anonymous THP.
+ */
+#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
+
+/*
+ * Mask of all large folio orders supported for file THP.
+ */
+#define THP_ORDERS_ALL_FILE	(BIT(PMD_ORDER) | BIT(PUD_ORDER))
+
+/*
+ * Mask of all large folio orders supported for THP.
+ */
+#define THP_ORDERS_ALL		(THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE)
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define HPAGE_PMD_SHIFT PMD_SHIFT
 #define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
@@ -77,6 +92,7 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
 
 extern unsigned long transparent_hugepage_flags;
+extern unsigned int huge_anon_orders;
 
 #define hugepage_flags_enabled()					       \
 	(transparent_hugepage_flags &				       \
@@ -86,6 +102,17 @@ extern unsigned long transparent_hugepage_flags;
 	(transparent_hugepage_flags &			\
 	 (1<<TRANSPARENT_HUGEPAGE_FLAG))
 
+static inline int first_order(unsigned int orders)
+{
+	return fls(orders) - 1;
+}
+
+static inline int next_order(unsigned int *orders, int prev)
+{
+	*orders &= ~BIT(prev);
+	return first_order(*orders);
+}
+
 /*
  * Do the below checks:
  *   - For file vma, check if the linear page offset of vma is
@@ -97,23 +124,39 @@ extern unsigned long transparent_hugepage_flags;
  *   - For all vmas, check if the haddr is in an aligned HPAGE_PMD_SIZE
  *     area.
  */
-static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
-		unsigned long addr)
-{
-	unsigned long haddr;
-
-	/* Don't have to check pgoff for anonymous vma */
-	if (!vma_is_anonymous(vma)) {
-		if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
-				HPAGE_PMD_NR))
-			return false;
+static inline unsigned int transhuge_vma_suitable(struct vm_area_struct *vma,
+		unsigned long addr, unsigned int orders)
+{
+	int order;
+
+	/*
+	 * Iterate over orders, highest to lowest, removing orders that don't
+	 * meet alignment requirements from the set. Exit loop at first order
+	 * that meets requirements, since all lower orders must also meet
+	 * requirements.
+	 */
+
+	order = first_order(orders);
+
+	while (orders) {
+		unsigned long hpage_size = PAGE_SIZE << order;
+		unsigned long haddr = ALIGN_DOWN(addr, hpage_size);
+
+		if (haddr >= vma->vm_start &&
+		    haddr + hpage_size <= vma->vm_end) {
+			if (!vma_is_anonymous(vma)) {
+				if (IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
+						vma->vm_pgoff,
+						hpage_size >> PAGE_SHIFT))
+					break;
+			} else
+				break;
+		}
+
+		order = next_order(&orders, order);
 	}
 
-	haddr = addr & HPAGE_PMD_MASK;
-
-	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
-		return false;
-	return true;
+	return orders;
 }
 
 static inline bool file_thp_enabled(struct vm_area_struct *vma)
@@ -130,8 +173,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
 	       !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
 }
 
-bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
-			bool smaps, bool in_pf, bool enforce_sysfs);
+unsigned int hugepage_vma_check(struct vm_area_struct *vma,
+				unsigned long vm_flags, bool smaps, bool in_pf,
+				bool enforce_sysfs, unsigned int orders);
 
 #define transparent_hugepage_use_zero_page()				\
 	(transparent_hugepage_flags &					\
@@ -267,17 +311,18 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
 	return false;
 }
 
-static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
-		unsigned long addr)
+static inline unsigned int transhuge_vma_suitable(struct vm_area_struct *vma,
+		unsigned long addr, unsigned int orders)
 {
-	return false;
+	return 0;
 }
 
-static inline bool hugepage_vma_check(struct vm_area_struct *vma,
-				      unsigned long vm_flags, bool smaps,
-				      bool in_pf, bool enforce_sysfs)
+static inline unsigned int hugepage_vma_check(struct vm_area_struct *vma,
+					unsigned long vm_flags, bool smaps,
+					bool in_pf, bool enforce_sysfs,
+					unsigned int orders)
 {
-	return false;
+	return 0;
 }
 
 static inline void folio_prep_large_rmappable(struct folio *folio) {}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 064fbd90822b..bcecce769017 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -70,12 +70,48 @@ static struct shrinker deferred_split_shrinker;
 static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 unsigned long huge_zero_pfn __read_mostly = ~0UL;
+unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
+static unsigned int huge_anon_always_mask __read_mostly;
 
-bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
-			bool smaps, bool in_pf, bool enforce_sysfs)
+/**
+ * hugepage_vma_check - determine which hugepage orders can be applied to vma
+ * @vma:  the vm area to check
+ * @vm_flags: use these vm_flags instead of vma->vm_flags
+ * @smaps: whether answer will be used for smaps file
+ * @in_pf: whether answer will be used by page fault handler
+ * @enforce_sysfs: whether sysfs config should be taken into account
+ * @orders: bitfield of all orders to consider
+ *
+ * Calculates the intersection of the requested hugepage orders and the allowed
+ * hugepage orders for the provided vma. Permitted orders are encoded as a set
+ * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
+ * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
+ *
+ * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
+ * orders are allowed.
+ */
+unsigned int hugepage_vma_check(struct vm_area_struct *vma,
+				unsigned long vm_flags, bool smaps, bool in_pf,
+				bool enforce_sysfs, unsigned int orders)
 {
+	/*
+	 * Fix up the orders mask; Supported orders for file vmas are static.
+	 * Supported orders for anon vmas are configured dynamically - but only
+	 * use the dynamic set if enforce_sysfs=true, otherwise use the full
+	 * set.
+	 */
+	if (vma_is_anonymous(vma))
+		orders &= enforce_sysfs ? READ_ONCE(huge_anon_orders)
+					: THP_ORDERS_ALL_ANON;
+	else
+		orders &= THP_ORDERS_ALL_FILE;
+
+	/* No orders in the intersection. */
+	if (!orders)
+		return 0;
+
 	if (!vma->vm_mm)		/* vdso */
-		return false;
+		return 0;
 
 	/*
 	 * Explicitly disabled through madvise or prctl, or some
@@ -84,16 +120,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 	 * */
 	if ((vm_flags & VM_NOHUGEPAGE) ||
 	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
-		return false;
+		return 0;
 	/*
 	 * If the hardware/firmware marked hugepage support disabled.
 	 */
 	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
-		return false;
+		return 0;
 
 	/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
 	if (vma_is_dax(vma))
-		return in_pf;
+		return in_pf ? orders : 0;
 
 	/*
 	 * Special VMA and hugetlb VMA.
@@ -101,17 +137,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 	 * VM_MIXEDMAP set.
 	 */
 	if (vm_flags & VM_NO_KHUGEPAGED)
-		return false;
+		return 0;
 
 	/*
-	 * Check alignment for file vma and size for both file and anon vma.
+	 * Check alignment for file vma and size for both file and anon vma by
+	 * filtering out the unsuitable orders.
 	 *
 	 * Skip the check for page fault. Huge fault does the check in fault
-	 * handlers. And this check is not suitable for huge PUD fault.
+	 * handlers.
 	 */
-	if (!in_pf &&
-	    !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE)))
-		return false;
+	if (!in_pf) {
+		int order = first_order(orders);
+		unsigned long addr;
+
+		while (orders) {
+			addr = vma->vm_end - (PAGE_SIZE << order);
+			if (transhuge_vma_suitable(vma, addr, BIT(order)))
+				break;
+			order = next_order(&orders, order);
+		}
+
+		if (!orders)
+			return 0;
+	}
 
 	/*
 	 * Enabled via shmem mount options or sysfs settings.
@@ -120,23 +168,35 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 	 */
 	if (!in_pf && shmem_file(vma->vm_file))
 		return shmem_is_huge(file_inode(vma->vm_file), vma->vm_pgoff,
-				     !enforce_sysfs, vma->vm_mm, vm_flags);
+				     !enforce_sysfs, vma->vm_mm, vm_flags)
+			? orders : 0;
 
 	/* Enforce sysfs THP requirements as necessary */
-	if (enforce_sysfs &&
-	    (!hugepage_flags_enabled() || (!(vm_flags & VM_HUGEPAGE) &&
-					   !hugepage_flags_always())))
-		return false;
+	if (enforce_sysfs) {
+		/* enabled=never. */
+		if (!hugepage_flags_enabled())
+			return 0;
+
+		/* enabled=madvise without VM_HUGEPAGE. */
+		if (!(vm_flags & VM_HUGEPAGE) && !hugepage_flags_always()) {
+			if (vma_is_anonymous(vma)) {
+				orders &= READ_ONCE(huge_anon_always_mask);
+				if (!orders)
+					return 0;
+			} else
+				return 0;
+		}
+	}
 
 	/* Only regular file is valid */
 	if (!in_pf && file_thp_enabled(vma))
-		return true;
+		return orders;
 
 	if (!vma_is_anonymous(vma))
-		return false;
+		return 0;
 
 	if (vma_is_temporary_stack(vma))
-		return false;
+		return 0;
 
 	/*
 	 * THPeligible bit of smaps should show 1 for proper VMAs even
@@ -146,9 +206,9 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 	 * the first page fault.
 	 */
 	if (!vma->anon_vma)
-		return (smaps || in_pf);
+		return (smaps || in_pf) ? orders : 0;
 
-	return true;
+	return orders;
 }
 
 static bool get_huge_zero_page(void)
@@ -391,11 +451,69 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj,
 static struct kobj_attribute hpage_pmd_size_attr =
 	__ATTR_RO(hpage_pmd_size);
 
+static ssize_t anon_orders_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "0x%08x\n", READ_ONCE(huge_anon_orders));
+}
+
+static ssize_t anon_orders_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count)
+{
+	int err;
+	int ret = count;
+	unsigned int orders;
+
+	err = kstrtouint(buf, 0, &orders);
+	if (err)
+		ret = -EINVAL;
+
+	if (ret > 0) {
+		orders &= THP_ORDERS_ALL_ANON;
+		WRITE_ONCE(huge_anon_orders, orders);
+
+		err = start_stop_khugepaged();
+		if (err)
+			ret = err;
+	}
+
+	return ret;
+}
+
+static struct kobj_attribute anon_orders_attr = __ATTR_RW(anon_orders);
+
+static ssize_t anon_always_mask_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "0x%08x\n", READ_ONCE(huge_anon_always_mask));
+}
+
+static ssize_t anon_always_mask_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t count)
+{
+	int err;
+	unsigned int always_mask;
+
+	err = kstrtouint(buf, 0, &always_mask);
+	if (err)
+		return -EINVAL;
+
+	WRITE_ONCE(huge_anon_always_mask, always_mask);
+
+	return count;
+}
+
+static struct kobj_attribute anon_always_mask_attr = __ATTR_RW(anon_always_mask);
+
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
 	&use_zero_page_attr.attr,
 	&hpage_pmd_size_attr.attr,
+	&anon_orders_attr.attr,
+	&anon_always_mask_attr.attr,
 #ifdef CONFIG_SHMEM
 	&shmem_enabled_attr.attr,
 #endif
@@ -778,7 +896,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	struct folio *folio;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 
-	if (!transhuge_vma_suitable(vma, haddr))
+	if (!transhuge_vma_suitable(vma, haddr, BIT(PMD_ORDER)))
 		return VM_FAULT_FALLBACK;
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 88433cc25d8a..2b5c0321d96b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -446,7 +446,8 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 {
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
 	    hugepage_flags_enabled()) {
-		if (hugepage_vma_check(vma, vm_flags, false, false, true))
+		if (hugepage_vma_check(vma, vm_flags, false, false, true,
+				       BIT(PMD_ORDER)))
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
@@ -921,10 +922,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	if (!vma)
 		return SCAN_VMA_NULL;
 
-	if (!transhuge_vma_suitable(vma, address))
+	if (!transhuge_vma_suitable(vma, address, BIT(PMD_ORDER)))
 		return SCAN_ADDRESS_RANGE;
 	if (!hugepage_vma_check(vma, vma->vm_flags, false, false,
-				cc->is_khugepaged))
+				cc->is_khugepaged, BIT(PMD_ORDER)))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
@@ -1499,7 +1500,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	 * and map it by a PMD, regardless of sysfs THP settings. As such, let's
 	 * analogously elide sysfs THP settings here.
 	 */
-	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
+	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false,
+				BIT(PMD_ORDER)))
 		return SCAN_VMA_CHECK;
 
 	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
@@ -2369,7 +2371,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			progress++;
 			break;
 		}
-		if (!hugepage_vma_check(vma, vma->vm_flags, false, false, true)) {
+		if (!hugepage_vma_check(vma, vma->vm_flags, false, false, true,
+					BIT(PMD_ORDER))) {
 skip:
 			progress++;
 			continue;
@@ -2626,7 +2629,7 @@ int start_stop_khugepaged(void)
 	int err = 0;
 
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_flags_enabled()) {
+	if (hugepage_flags_enabled() && (huge_anon_orders & BIT(PMD_ORDER))) {
 		if (!khugepaged_thread)
 			khugepaged_thread = kthread_run(khugepaged, NULL,
 							"khugepaged");
@@ -2706,7 +2709,8 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	*prev = vma;
 
-	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
+	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false,
+				BIT(PMD_ORDER)))
 		return -EINVAL;
 
 	cc = kmalloc(sizeof(*cc), GFP_KERNEL);
diff --git a/mm/memory.c b/mm/memory.c
index e4b0f6a461d8..b5b82fc8e164 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4256,7 +4256,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 	pmd_t entry;
 	vm_fault_t ret = VM_FAULT_FALLBACK;
 
-	if (!transhuge_vma_suitable(vma, haddr))
+	if (!transhuge_vma_suitable(vma, haddr, BIT(PMD_ORDER)))
 		return ret;
 
 	page = compound_head(page);
@@ -5055,7 +5055,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		return VM_FAULT_OOM;
 retry_pud:
 	if (pud_none(*vmf.pud) &&
-	    hugepage_vma_check(vma, vm_flags, false, true, true)) {
+	    hugepage_vma_check(vma, vm_flags, false, true, true, BIT(PUD_ORDER))) {
 		ret = create_huge_pud(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
@@ -5089,7 +5089,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		goto retry_pud;
 
 	if (pmd_none(*vmf.pmd) &&
-	    hugepage_vma_check(vma, vm_flags, false, true, true)) {
+	    hugepage_vma_check(vma, vm_flags, false, true, true, BIT(PMD_ORDER))) {
 		ret = create_huge_pmd(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e0b368e545ed..5f7e89c5b595 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -268,7 +268,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			 * cleared *pmd but not decremented compound_mapcount().
 			 */
 			if ((pvmw->flags & PVMW_SYNC) &&
-			    transhuge_vma_suitable(vma, pvmw->address) &&
+			    transhuge_vma_suitable(vma, pvmw->address,
+						   BIT(PMD_ORDER)) &&
 			    (pvmw->nr_pages >= HPAGE_PMD_NR)) {
 				spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
@ 2023-09-29 11:44   ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

In preparation for adding support for anonymous large folios that are
smaller than the PMD-size, introduce 2 new sysfs files that will be used
to control the new behaviours via the transparent_hugepage interface.
For now, the kernel still only supports PMD-order anonymous THP, so when
reading back anon_orders, it will reflect that. Therefore there are no
behavioural changes intended here.

The bulk of the change is implemented by converting
transhuge_vma_suitable() and hugepage_vma_check() so that they take a
bitfield of orders for which the user wants to determine support, and
the functions filter out all the orders that can't be supported. If
there is only 1 order set in the input then the output can continue to
be treated like a boolean; this is the case for most call sites.

The remainder is copied from Documentation/admin-guide/mm/transhuge.rst,
as modified by this commit. See that file for further details.

By default, allocation of anonymous THPs that are smaller than PMD-size
is disabled. These smaller allocation orders can be enabled by writing
an encoded set of orders as follows::

	echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders

Where an order refers to the number of pages in the large folio as
2^order, and where each order is encoded in the written value such that
each set bit represents an enabled order; So setting bit-2 indicates
that order-2 folios are in use, and order-2 means 2^2=4 pages (=16K if
the page size is 4K). The example above enables order-9 (PMD-order) and
order-3.

By enabling multiple orders, allocation of each order will be attempted,
highest to lowest, until a successful allocation is made. If the
PMD-order is unset, then no PMD-sized THPs will be allocated.

The kernel will ignore any orders that it does not support so read the
file back to determine which orders are enabled::

	cat /sys/kernel/mm/transparent_hugepage/anon_orders

For some workloads it may be desirable to limit some THP orders to be
used only for MADV_HUGEPAGE regions, while allowing others to be used
always. For example, a workload may only benefit from PMD-sized THP in
specific areas, but can take benefit of 32K sized THP more generally. In
this case, THP can be enabled in ``madvise`` mode as normal, but
specific orders can be configured to be allocated as if in ``always``
mode. The below example enables orders 9 and 3, with order-9 only
applied to MADV_HUGEPAGE regions, and order-3 applied always::

	echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
	echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders
	echo 0x008 >/sys/kernel/mm/transparent_hugepage/anon_always_mask

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  74 ++++++++--
 Documentation/filesystems/proc.rst         |   6 +-
 fs/proc/task_mmu.c                         |   3 +-
 include/linux/huge_mm.h                    |  93 +++++++++---
 mm/huge_memory.c                           | 164 ++++++++++++++++++---
 mm/khugepaged.c                            |  18 ++-
 mm/memory.c                                |   6 +-
 mm/page_vma_mapped.c                       |   3 +-
 8 files changed, 296 insertions(+), 71 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index ebda57850643..9f954e73a4ca 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -45,10 +45,22 @@ components:
    the two is using hugepages just because of the fact the TLB miss is
    going to run faster.
 
+Furthermore, it is possible to configure THP to allocate large folios
+to back anonymous memory, which are smaller than PMD-size (for example
+16K, 32K, 64K, etc). These THPs continue to be PTE-mapped, but in many
+cases can still provide the similar benefits to those outlined above:
+Page faults are significantly reduced (by a factor of e.g. 4, 8, 16,
+etc), but latency spikes are much less prominent because the size of
+each page isn't as huge as the PMD-sized variant and there is less
+memory to clear in each page fault. Some architectures also employ TLB
+compression mechanisms to squeeze more entries in when a set of PTEs
+are virtually and physically contiguous and approporiately aligned. In
+this case, TLB misses will occur less often.
+
 THP can be enabled system wide or restricted to certain tasks or even
 memory ranges inside task's address space. Unless THP is completely
 disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into huge pages.
+collapses sequences of basic pages into PMD-sized huge pages.
 
 The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
 interface and using madvise(2) and prctl(2) system calls.
@@ -146,25 +158,69 @@ madvise
 never
 	should be self-explanatory.
 
-By default kernel tries to use huge zero page on read page fault to
-anonymous mapping. It's possible to disable huge zero page by writing 0
-or enable it back by writing 1::
+By default kernel tries to use huge, PMD-mapped zero page on read page
+fault to anonymous mapping. It's possible to disable huge zero page by
+writing 0 or enable it back by writing 1::
 
 	echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
 	echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
 
 Some userspace (such as a test program, or an optimized memory allocation
-library) may want to know the size (in bytes) of a transparent hugepage::
+library) may want to know the size (in bytes) of a PMD-mappable
+transparent hugepage::
 
 	cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
 
+By default, allocation of anonymous THPs that are smaller than
+PMD-size is disabled. These smaller allocation orders can be enabled
+by writing an encoded set of orders as follows::
+
+	echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders
+
+Where an order refers to the number of pages in the large folio as
+2^order, and where each order is encoded in the written value such
+that each set bit represents an enabled order; So setting bit-2
+indicates that order-2 folios are in use, and order-2 means 2^2=4
+pages (=16K if the page size is 4K). The example above enables order-9
+(PMD-order) and order-3.
+
+By enabling multiple orders, allocation of each order will be
+attempted, highest to lowest, until a successful allocation is made.
+If the PMD-order is unset, then no PMD-sized THPs will be allocated.
+
+The kernel will ignore any orders that it does not support so read the
+file back to determine which orders are enabled::
+
+	cat /sys/kernel/mm/transparent_hugepage/anon_orders
+
+For some workloads it may be desirable to limit some THP orders to be
+used only for MADV_HUGEPAGE regions, while allowing others to be used
+always. For example, a workload may only benefit from PMD-sized THP in
+specific areas, but can take benefit of 32K sized THP more generally.
+In this case, THP can be enabled in ``madvise`` mode as normal, but
+specific orders can be configured to be allocated as if in ``always``
+mode. The below example enables orders 9 and 3, with order-9 only
+applied to MADV_HUGEPAGE regions, and order-3 applied always::
+
+	echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
+	echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders
+	echo 0x008 >/sys/kernel/mm/transparent_hugepage/anon_always_mask
+
 khugepaged will be automatically started when
-transparent_hugepage/enabled is set to "always" or "madvise, and it'll
-be automatically shutdown if it's set to "never".
+transparent_hugepage/enabled is set to "always" or "madvise",
+providing the PMD-order is enabled in
+transparent_hugepage/anon_orders, and it'll be automatically shutdown
+if it's set to "never" or the PMD-order is disabled in
+transparent_hugepage/anon_orders.
 
 Khugepaged controls
 -------------------
 
+.. note::
+   khugepaged currently only searches for opportunities to collapse to
+   PMD-sized THP and no attempt is made to collapse to smaller order
+   THP.
+
 khugepaged runs usually at low frequency so while one may not want to
 invoke defrag algorithms synchronously during the page faults, it
 should be worth invoking defrag at least in khugepaged. However it's
@@ -285,7 +341,7 @@ Need of application restart
 The transparent_hugepage/enabled values and tmpfs mount option only affect
 future behavior. So to make them effective you need to restart any
 application that could have been using hugepages. This also applies to the
-regions registered in khugepaged.
+regions registered in khugepaged, and transparent_hugepage/anon_orders.
 
 Monitoring usage
 ================
@@ -416,7 +472,7 @@ for huge pages.
 Optimizing the applications
 ===========================
 
-To be guaranteed that the kernel will map a 2M page immediately in any
+To be guaranteed that the kernel will map a thp immediately in any
 memory region, the mmap region has to be hugepage naturally
 aligned. posix_memalign() can provide that guarantee.
 
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index ccbb76a509f0..72526f8bb658 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -533,9 +533,9 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
 does not take into account swapped out page of underlying shmem objects.
 "Locked" indicates whether the mapping is locked in memory or not.
 
-"THPeligible" indicates whether the mapping is eligible for allocating THP
-pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
-It just shows the current status.
+"THPeligible" indicates whether the mapping is eligible for allocating
+naturally aligned THP pages of any currently enabled order. 1 if true, 0
+otherwise. It just shows the current status.
 
 "VmFlags" field deserves a separate description. This member represents the
 kernel flags associated with the particular virtual memory area in two letter
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 7b5dad163533..f978dce7f7ce 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -869,7 +869,8 @@ static int show_smap(struct seq_file *m, void *v)
 	__show_smap(m, &mss, false);
 
 	seq_printf(m, "THPeligible:    %8u\n",
-		   hugepage_vma_check(vma, vma->vm_flags, true, false, true));
+		   !!hugepage_vma_check(vma, vma->vm_flags, true, false, true,
+					THP_ORDERS_ALL));
 
 	if (arch_pkeys_enabled())
 		seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fa0350b0812a..2e7c338229a6 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -67,6 +67,21 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 
+/*
+ * Mask of all large folio orders supported for anonymous THP.
+ */
+#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
+
+/*
+ * Mask of all large folio orders supported for file THP.
+ */
+#define THP_ORDERS_ALL_FILE	(BIT(PMD_ORDER) | BIT(PUD_ORDER))
+
+/*
+ * Mask of all large folio orders supported for THP.
+ */
+#define THP_ORDERS_ALL		(THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE)
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define HPAGE_PMD_SHIFT PMD_SHIFT
 #define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
@@ -77,6 +92,7 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
 
 extern unsigned long transparent_hugepage_flags;
+extern unsigned int huge_anon_orders;
 
 #define hugepage_flags_enabled()					       \
 	(transparent_hugepage_flags &				       \
@@ -86,6 +102,17 @@ extern unsigned long transparent_hugepage_flags;
 	(transparent_hugepage_flags &			\
 	 (1<<TRANSPARENT_HUGEPAGE_FLAG))
 
+static inline int first_order(unsigned int orders)
+{
+	return fls(orders) - 1;
+}
+
+static inline int next_order(unsigned int *orders, int prev)
+{
+	*orders &= ~BIT(prev);
+	return first_order(*orders);
+}
+
 /*
  * Do the below checks:
  *   - For file vma, check if the linear page offset of vma is
@@ -97,23 +124,39 @@ extern unsigned long transparent_hugepage_flags;
  *   - For all vmas, check if the haddr is in an aligned HPAGE_PMD_SIZE
  *     area.
  */
-static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
-		unsigned long addr)
-{
-	unsigned long haddr;
-
-	/* Don't have to check pgoff for anonymous vma */
-	if (!vma_is_anonymous(vma)) {
-		if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
-				HPAGE_PMD_NR))
-			return false;
+static inline unsigned int transhuge_vma_suitable(struct vm_area_struct *vma,
+		unsigned long addr, unsigned int orders)
+{
+	int order;
+
+	/*
+	 * Iterate over orders, highest to lowest, removing orders that don't
+	 * meet alignment requirements from the set. Exit loop at first order
+	 * that meets requirements, since all lower orders must also meet
+	 * requirements.
+	 */
+
+	order = first_order(orders);
+
+	while (orders) {
+		unsigned long hpage_size = PAGE_SIZE << order;
+		unsigned long haddr = ALIGN_DOWN(addr, hpage_size);
+
+		if (haddr >= vma->vm_start &&
+		    haddr + hpage_size <= vma->vm_end) {
+			if (!vma_is_anonymous(vma)) {
+				if (IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
+						vma->vm_pgoff,
+						hpage_size >> PAGE_SHIFT))
+					break;
+			} else
+				break;
+		}
+
+		order = next_order(&orders, order);
 	}
 
-	haddr = addr & HPAGE_PMD_MASK;
-
-	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
-		return false;
-	return true;
+	return orders;
 }
 
 static inline bool file_thp_enabled(struct vm_area_struct *vma)
@@ -130,8 +173,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
 	       !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
 }
 
-bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
-			bool smaps, bool in_pf, bool enforce_sysfs);
+unsigned int hugepage_vma_check(struct vm_area_struct *vma,
+				unsigned long vm_flags, bool smaps, bool in_pf,
+				bool enforce_sysfs, unsigned int orders);
 
 #define transparent_hugepage_use_zero_page()				\
 	(transparent_hugepage_flags &					\
@@ -267,17 +311,18 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
 	return false;
 }
 
-static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
-		unsigned long addr)
+static inline unsigned int transhuge_vma_suitable(struct vm_area_struct *vma,
+		unsigned long addr, unsigned int orders)
 {
-	return false;
+	return 0;
 }
 
-static inline bool hugepage_vma_check(struct vm_area_struct *vma,
-				      unsigned long vm_flags, bool smaps,
-				      bool in_pf, bool enforce_sysfs)
+static inline unsigned int hugepage_vma_check(struct vm_area_struct *vma,
+					unsigned long vm_flags, bool smaps,
+					bool in_pf, bool enforce_sysfs,
+					unsigned int orders)
 {
-	return false;
+	return 0;
 }
 
 static inline void folio_prep_large_rmappable(struct folio *folio) {}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 064fbd90822b..bcecce769017 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -70,12 +70,48 @@ static struct shrinker deferred_split_shrinker;
 static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 unsigned long huge_zero_pfn __read_mostly = ~0UL;
+unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
+static unsigned int huge_anon_always_mask __read_mostly;
 
-bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
-			bool smaps, bool in_pf, bool enforce_sysfs)
+/**
+ * hugepage_vma_check - determine which hugepage orders can be applied to vma
+ * @vma:  the vm area to check
+ * @vm_flags: use these vm_flags instead of vma->vm_flags
+ * @smaps: whether answer will be used for smaps file
+ * @in_pf: whether answer will be used by page fault handler
+ * @enforce_sysfs: whether sysfs config should be taken into account
+ * @orders: bitfield of all orders to consider
+ *
+ * Calculates the intersection of the requested hugepage orders and the allowed
+ * hugepage orders for the provided vma. Permitted orders are encoded as a set
+ * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
+ * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
+ *
+ * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
+ * orders are allowed.
+ */
+unsigned int hugepage_vma_check(struct vm_area_struct *vma,
+				unsigned long vm_flags, bool smaps, bool in_pf,
+				bool enforce_sysfs, unsigned int orders)
 {
+	/*
+	 * Fix up the orders mask; Supported orders for file vmas are static.
+	 * Supported orders for anon vmas are configured dynamically - but only
+	 * use the dynamic set if enforce_sysfs=true, otherwise use the full
+	 * set.
+	 */
+	if (vma_is_anonymous(vma))
+		orders &= enforce_sysfs ? READ_ONCE(huge_anon_orders)
+					: THP_ORDERS_ALL_ANON;
+	else
+		orders &= THP_ORDERS_ALL_FILE;
+
+	/* No orders in the intersection. */
+	if (!orders)
+		return 0;
+
 	if (!vma->vm_mm)		/* vdso */
-		return false;
+		return 0;
 
 	/*
 	 * Explicitly disabled through madvise or prctl, or some
@@ -84,16 +120,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 	 * */
 	if ((vm_flags & VM_NOHUGEPAGE) ||
 	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
-		return false;
+		return 0;
 	/*
 	 * If the hardware/firmware marked hugepage support disabled.
 	 */
 	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
-		return false;
+		return 0;
 
 	/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
 	if (vma_is_dax(vma))
-		return in_pf;
+		return in_pf ? orders : 0;
 
 	/*
 	 * Special VMA and hugetlb VMA.
@@ -101,17 +137,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 	 * VM_MIXEDMAP set.
 	 */
 	if (vm_flags & VM_NO_KHUGEPAGED)
-		return false;
+		return 0;
 
 	/*
-	 * Check alignment for file vma and size for both file and anon vma.
+	 * Check alignment for file vma and size for both file and anon vma by
+	 * filtering out the unsuitable orders.
 	 *
 	 * Skip the check for page fault. Huge fault does the check in fault
-	 * handlers. And this check is not suitable for huge PUD fault.
+	 * handlers.
 	 */
-	if (!in_pf &&
-	    !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE)))
-		return false;
+	if (!in_pf) {
+		int order = first_order(orders);
+		unsigned long addr;
+
+		while (orders) {
+			addr = vma->vm_end - (PAGE_SIZE << order);
+			if (transhuge_vma_suitable(vma, addr, BIT(order)))
+				break;
+			order = next_order(&orders, order);
+		}
+
+		if (!orders)
+			return 0;
+	}
 
 	/*
 	 * Enabled via shmem mount options or sysfs settings.
@@ -120,23 +168,35 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 	 */
 	if (!in_pf && shmem_file(vma->vm_file))
 		return shmem_is_huge(file_inode(vma->vm_file), vma->vm_pgoff,
-				     !enforce_sysfs, vma->vm_mm, vm_flags);
+				     !enforce_sysfs, vma->vm_mm, vm_flags)
+			? orders : 0;
 
 	/* Enforce sysfs THP requirements as necessary */
-	if (enforce_sysfs &&
-	    (!hugepage_flags_enabled() || (!(vm_flags & VM_HUGEPAGE) &&
-					   !hugepage_flags_always())))
-		return false;
+	if (enforce_sysfs) {
+		/* enabled=never. */
+		if (!hugepage_flags_enabled())
+			return 0;
+
+		/* enabled=madvise without VM_HUGEPAGE. */
+		if (!(vm_flags & VM_HUGEPAGE) && !hugepage_flags_always()) {
+			if (vma_is_anonymous(vma)) {
+				orders &= READ_ONCE(huge_anon_always_mask);
+				if (!orders)
+					return 0;
+			} else
+				return 0;
+		}
+	}
 
 	/* Only regular file is valid */
 	if (!in_pf && file_thp_enabled(vma))
-		return true;
+		return orders;
 
 	if (!vma_is_anonymous(vma))
-		return false;
+		return 0;
 
 	if (vma_is_temporary_stack(vma))
-		return false;
+		return 0;
 
 	/*
 	 * THPeligible bit of smaps should show 1 for proper VMAs even
@@ -146,9 +206,9 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 	 * the first page fault.
 	 */
 	if (!vma->anon_vma)
-		return (smaps || in_pf);
+		return (smaps || in_pf) ? orders : 0;
 
-	return true;
+	return orders;
 }
 
 static bool get_huge_zero_page(void)
@@ -391,11 +451,69 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj,
 static struct kobj_attribute hpage_pmd_size_attr =
 	__ATTR_RO(hpage_pmd_size);
 
+static ssize_t anon_orders_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "0x%08x\n", READ_ONCE(huge_anon_orders));
+}
+
+static ssize_t anon_orders_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count)
+{
+	int err;
+	int ret = count;
+	unsigned int orders;
+
+	err = kstrtouint(buf, 0, &orders);
+	if (err)
+		ret = -EINVAL;
+
+	if (ret > 0) {
+		orders &= THP_ORDERS_ALL_ANON;
+		WRITE_ONCE(huge_anon_orders, orders);
+
+		err = start_stop_khugepaged();
+		if (err)
+			ret = err;
+	}
+
+	return ret;
+}
+
+static struct kobj_attribute anon_orders_attr = __ATTR_RW(anon_orders);
+
+static ssize_t anon_always_mask_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "0x%08x\n", READ_ONCE(huge_anon_always_mask));
+}
+
+static ssize_t anon_always_mask_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t count)
+{
+	int err;
+	unsigned int always_mask;
+
+	err = kstrtouint(buf, 0, &always_mask);
+	if (err)
+		return -EINVAL;
+
+	WRITE_ONCE(huge_anon_always_mask, always_mask);
+
+	return count;
+}
+
+static struct kobj_attribute anon_always_mask_attr = __ATTR_RW(anon_always_mask);
+
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
 	&use_zero_page_attr.attr,
 	&hpage_pmd_size_attr.attr,
+	&anon_orders_attr.attr,
+	&anon_always_mask_attr.attr,
 #ifdef CONFIG_SHMEM
 	&shmem_enabled_attr.attr,
 #endif
@@ -778,7 +896,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	struct folio *folio;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 
-	if (!transhuge_vma_suitable(vma, haddr))
+	if (!transhuge_vma_suitable(vma, haddr, BIT(PMD_ORDER)))
 		return VM_FAULT_FALLBACK;
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 88433cc25d8a..2b5c0321d96b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -446,7 +446,8 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 {
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
 	    hugepage_flags_enabled()) {
-		if (hugepage_vma_check(vma, vm_flags, false, false, true))
+		if (hugepage_vma_check(vma, vm_flags, false, false, true,
+				       BIT(PMD_ORDER)))
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
@@ -921,10 +922,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	if (!vma)
 		return SCAN_VMA_NULL;
 
-	if (!transhuge_vma_suitable(vma, address))
+	if (!transhuge_vma_suitable(vma, address, BIT(PMD_ORDER)))
 		return SCAN_ADDRESS_RANGE;
 	if (!hugepage_vma_check(vma, vma->vm_flags, false, false,
-				cc->is_khugepaged))
+				cc->is_khugepaged, BIT(PMD_ORDER)))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
@@ -1499,7 +1500,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	 * and map it by a PMD, regardless of sysfs THP settings. As such, let's
 	 * analogously elide sysfs THP settings here.
 	 */
-	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
+	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false,
+				BIT(PMD_ORDER)))
 		return SCAN_VMA_CHECK;
 
 	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
@@ -2369,7 +2371,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			progress++;
 			break;
 		}
-		if (!hugepage_vma_check(vma, vma->vm_flags, false, false, true)) {
+		if (!hugepage_vma_check(vma, vma->vm_flags, false, false, true,
+					BIT(PMD_ORDER))) {
 skip:
 			progress++;
 			continue;
@@ -2626,7 +2629,7 @@ int start_stop_khugepaged(void)
 	int err = 0;
 
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_flags_enabled()) {
+	if (hugepage_flags_enabled() && (huge_anon_orders & BIT(PMD_ORDER))) {
 		if (!khugepaged_thread)
 			khugepaged_thread = kthread_run(khugepaged, NULL,
 							"khugepaged");
@@ -2706,7 +2709,8 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	*prev = vma;
 
-	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
+	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false,
+				BIT(PMD_ORDER)))
 		return -EINVAL;
 
 	cc = kmalloc(sizeof(*cc), GFP_KERNEL);
diff --git a/mm/memory.c b/mm/memory.c
index e4b0f6a461d8..b5b82fc8e164 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4256,7 +4256,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 	pmd_t entry;
 	vm_fault_t ret = VM_FAULT_FALLBACK;
 
-	if (!transhuge_vma_suitable(vma, haddr))
+	if (!transhuge_vma_suitable(vma, haddr, BIT(PMD_ORDER)))
 		return ret;
 
 	page = compound_head(page);
@@ -5055,7 +5055,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		return VM_FAULT_OOM;
 retry_pud:
 	if (pud_none(*vmf.pud) &&
-	    hugepage_vma_check(vma, vm_flags, false, true, true)) {
+	    hugepage_vma_check(vma, vm_flags, false, true, true, BIT(PUD_ORDER))) {
 		ret = create_huge_pud(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
@@ -5089,7 +5089,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		goto retry_pud;
 
 	if (pmd_none(*vmf.pmd) &&
-	    hugepage_vma_check(vma, vm_flags, false, true, true)) {
+	    hugepage_vma_check(vma, vm_flags, false, true, true, BIT(PMD_ORDER))) {
 		ret = create_huge_pmd(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e0b368e545ed..5f7e89c5b595 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -268,7 +268,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			 * cleared *pmd but not decremented compound_mapcount().
 			 */
 			if ((pvmw->flags & PVMW_SYNC) &&
-			    transhuge_vma_suitable(vma, pvmw->address) &&
+			    transhuge_vma_suitable(vma, pvmw->address,
+						   BIT(PMD_ORDER)) &&
 			    (pvmw->nr_pages >= HPAGE_PMD_NR)) {
 				spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
  2023-09-29 11:44 ` Ryan Roberts
@ 2023-09-29 11:44   ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

Introduce the logic to allow THP to be configured (through the new
anon_orders interface we just added) to allocate large folios to back
anonymous memory, which are smaller than PMD-size (for example order-2,
order-3, order-4, etc).

These THPs continue to be PTE-mapped, but in many cases can still
provide similar benefits to traditional PMD-sized THP: Page faults are
significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
the configured order), but latency spikes are much less prominent
because the size of each page isn't as huge as the PMD-sized variant and
there is less memory to clear in each page fault. The number of per-page
operations (e.g. ref counting, rmap management, lru list management) are
also significantly reduced since those ops now become per-folio.

Some architectures also employ TLB compression mechanisms to squeeze
more entries in when a set of PTEs are virtually and physically
contiguous and approporiately aligned. In this case, TLB misses will
occur less often.

The new behaviour is disabled by default because the anon_orders
defaults to only enabling PMD-order, but can be enabled at runtime by
writing to anon_orders (see documentation in previous commit). The long
term aim is to default anon_orders to include suitable lower orders, but
there are some risks around internal fragmentation that need to be
better understood first.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst |   9 +-
 include/linux/huge_mm.h                    |   6 +-
 mm/memory.c                                | 108 +++++++++++++++++++--
 3 files changed, 111 insertions(+), 12 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 9f954e73a4ca..732c3b2f4ba8 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -353,7 +353,9 @@ anonymous transparent huge pages, it is necessary to read
 ``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap
 fields for each mapping. Note that in both cases, AnonHugePages refers
 only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped
-using PTEs.
+using PTEs. This includes all THPs whose order is smaller than
+PMD-order, as well as any PMD-order THPs that happen to be PTE-mapped
+for other reasons.
 
 The number of file transparent huge pages mapped to userspace is available
 by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
@@ -367,6 +369,11 @@ frequently will incur overhead.
 There are a number of counters in ``/proc/vmstat`` that may be used to
 monitor how successfully the system is providing huge pages for use.
 
+.. note::
+   Currently the below counters only record events relating to
+   PMD-order THPs. Events relating to smaller order THPs are not
+   included.
+
 thp_fault_alloc
 	is incremented every time a huge page is successfully
 	allocated to handle a page fault.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2e7c338229a6..c4860476a1f5 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 
 /*
- * Mask of all large folio orders supported for anonymous THP.
+ * Mask of all large folio orders supported for anonymous THP; all orders up to
+ * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
+ * (which is a limitation of the THP implementation).
  */
-#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
+#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
 
 /*
  * Mask of all large folio orders supported for file THP.
diff --git a/mm/memory.c b/mm/memory.c
index b5b82fc8e164..92ed9c782dc9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	return ret;
 }
 
+static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+{
+	int i;
+
+	if (nr_pages == 1)
+		return vmf_pte_changed(vmf);
+
+	for (i = 0; i < nr_pages; i++) {
+		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
+			return true;
+	}
+
+	return false;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+{
+	gfp_t gfp;
+	pte_t *pte;
+	unsigned long addr;
+	struct folio *folio;
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned int orders;
+	int order;
+
+	/*
+	 * If uffd is active for the vma we need per-page fault fidelity to
+	 * maintain the uffd semantics.
+	 */
+	if (userfaultfd_armed(vma))
+		goto fallback;
+
+	/*
+	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
+	 * for this vma. Then filter out the orders that can't be allocated over
+	 * the faulting address and still be fully contained in the vma.
+	 */
+	orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
+				    BIT(PMD_ORDER) - 1);
+	orders = transhuge_vma_suitable(vma, vmf->address, orders);
+
+	if (!orders)
+		goto fallback;
+
+	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+	if (!pte)
+		return ERR_PTR(-EAGAIN);
+
+	order = first_order(orders);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		vmf->pte = pte + pte_index(addr);
+		if (!vmf_pte_range_changed(vmf, 1 << order))
+			break;
+		order = next_order(&orders, order);
+	}
+
+	vmf->pte = NULL;
+	pte_unmap(pte);
+
+	gfp = vma_thp_gfp_mask(vma);
+
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		folio = vma_alloc_folio(gfp, order, vma, addr, true);
+		if (folio) {
+			clear_huge_page(&folio->page, addr, 1 << order);
+			return folio;
+		}
+		order = next_order(&orders, order);
+	}
+
+fallback:
+	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
+}
+#else
+#define alloc_anon_folio(vmf) \
+		vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
+#endif
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4066,6 +4147,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
  */
 static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 {
+	int i;
+	int nr_pages = 1;
+	unsigned long addr = vmf->address;
 	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
@@ -4110,10 +4194,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	folio = alloc_anon_folio(vmf);
+	if (IS_ERR(folio))
+		return 0;
 	if (!folio)
 		goto oom;
 
+	nr_pages = folio_nr_pages(folio);
+	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
 	folio_throttle_swaprate(folio, GFP_KERNEL);
@@ -4130,12 +4219,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry), vma);
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	if (!vmf->pte)
 		goto release;
-	if (vmf_pte_changed(vmf)) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
+	if (vmf_pte_range_changed(vmf, nr_pages)) {
+		for (i = 0; i < nr_pages; i++)
+			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
 		goto release;
 	}
 
@@ -4150,16 +4239,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	folio_add_new_anon_rmap(folio, vma, vmf->address);
+	folio_ref_add(folio, nr_pages - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+	folio_add_new_anon_rmap(folio, vma, addr);
 	folio_add_lru_vma(folio, vma);
 setpte:
 	if (uffd_wp)
 		entry = pte_mkuffd_wp(entry);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
@ 2023-09-29 11:44   ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

Introduce the logic to allow THP to be configured (through the new
anon_orders interface we just added) to allocate large folios to back
anonymous memory, which are smaller than PMD-size (for example order-2,
order-3, order-4, etc).

These THPs continue to be PTE-mapped, but in many cases can still
provide similar benefits to traditional PMD-sized THP: Page faults are
significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
the configured order), but latency spikes are much less prominent
because the size of each page isn't as huge as the PMD-sized variant and
there is less memory to clear in each page fault. The number of per-page
operations (e.g. ref counting, rmap management, lru list management) are
also significantly reduced since those ops now become per-folio.

Some architectures also employ TLB compression mechanisms to squeeze
more entries in when a set of PTEs are virtually and physically
contiguous and approporiately aligned. In this case, TLB misses will
occur less often.

The new behaviour is disabled by default because the anon_orders
defaults to only enabling PMD-order, but can be enabled at runtime by
writing to anon_orders (see documentation in previous commit). The long
term aim is to default anon_orders to include suitable lower orders, but
there are some risks around internal fragmentation that need to be
better understood first.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst |   9 +-
 include/linux/huge_mm.h                    |   6 +-
 mm/memory.c                                | 108 +++++++++++++++++++--
 3 files changed, 111 insertions(+), 12 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 9f954e73a4ca..732c3b2f4ba8 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -353,7 +353,9 @@ anonymous transparent huge pages, it is necessary to read
 ``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap
 fields for each mapping. Note that in both cases, AnonHugePages refers
 only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped
-using PTEs.
+using PTEs. This includes all THPs whose order is smaller than
+PMD-order, as well as any PMD-order THPs that happen to be PTE-mapped
+for other reasons.
 
 The number of file transparent huge pages mapped to userspace is available
 by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
@@ -367,6 +369,11 @@ frequently will incur overhead.
 There are a number of counters in ``/proc/vmstat`` that may be used to
 monitor how successfully the system is providing huge pages for use.
 
+.. note::
+   Currently the below counters only record events relating to
+   PMD-order THPs. Events relating to smaller order THPs are not
+   included.
+
 thp_fault_alloc
 	is incremented every time a huge page is successfully
 	allocated to handle a page fault.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2e7c338229a6..c4860476a1f5 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 
 /*
- * Mask of all large folio orders supported for anonymous THP.
+ * Mask of all large folio orders supported for anonymous THP; all orders up to
+ * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
+ * (which is a limitation of the THP implementation).
  */
-#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
+#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
 
 /*
  * Mask of all large folio orders supported for file THP.
diff --git a/mm/memory.c b/mm/memory.c
index b5b82fc8e164..92ed9c782dc9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	return ret;
 }
 
+static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+{
+	int i;
+
+	if (nr_pages == 1)
+		return vmf_pte_changed(vmf);
+
+	for (i = 0; i < nr_pages; i++) {
+		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
+			return true;
+	}
+
+	return false;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+{
+	gfp_t gfp;
+	pte_t *pte;
+	unsigned long addr;
+	struct folio *folio;
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned int orders;
+	int order;
+
+	/*
+	 * If uffd is active for the vma we need per-page fault fidelity to
+	 * maintain the uffd semantics.
+	 */
+	if (userfaultfd_armed(vma))
+		goto fallback;
+
+	/*
+	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
+	 * for this vma. Then filter out the orders that can't be allocated over
+	 * the faulting address and still be fully contained in the vma.
+	 */
+	orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
+				    BIT(PMD_ORDER) - 1);
+	orders = transhuge_vma_suitable(vma, vmf->address, orders);
+
+	if (!orders)
+		goto fallback;
+
+	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+	if (!pte)
+		return ERR_PTR(-EAGAIN);
+
+	order = first_order(orders);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		vmf->pte = pte + pte_index(addr);
+		if (!vmf_pte_range_changed(vmf, 1 << order))
+			break;
+		order = next_order(&orders, order);
+	}
+
+	vmf->pte = NULL;
+	pte_unmap(pte);
+
+	gfp = vma_thp_gfp_mask(vma);
+
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		folio = vma_alloc_folio(gfp, order, vma, addr, true);
+		if (folio) {
+			clear_huge_page(&folio->page, addr, 1 << order);
+			return folio;
+		}
+		order = next_order(&orders, order);
+	}
+
+fallback:
+	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
+}
+#else
+#define alloc_anon_folio(vmf) \
+		vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
+#endif
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4066,6 +4147,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
  */
 static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 {
+	int i;
+	int nr_pages = 1;
+	unsigned long addr = vmf->address;
 	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
@@ -4110,10 +4194,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	folio = alloc_anon_folio(vmf);
+	if (IS_ERR(folio))
+		return 0;
 	if (!folio)
 		goto oom;
 
+	nr_pages = folio_nr_pages(folio);
+	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
 	folio_throttle_swaprate(folio, GFP_KERNEL);
@@ -4130,12 +4219,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry), vma);
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	if (!vmf->pte)
 		goto release;
-	if (vmf_pte_changed(vmf)) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
+	if (vmf_pte_range_changed(vmf, nr_pages)) {
+		for (i = 0; i < nr_pages; i++)
+			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
 		goto release;
 	}
 
@@ -4150,16 +4239,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	folio_add_new_anon_rmap(folio, vma, vmf->address);
+	folio_ref_add(folio, nr_pages - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+	folio_add_new_anon_rmap(folio, vma, addr);
 	folio_add_lru_vma(folio, vma);
 setpte:
 	if (uffd_wp)
 		entry = pte_mkuffd_wp(entry);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders
  2023-09-29 11:44 ` Ryan Roberts
@ 2023-09-29 11:44   ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

In addition to passing a bitfield of folio orders to enable for THP,
allow the string "recommend" to be written, which has the effect of
causing the system to enable the orders preferred by the architecture
and by the mm. The user can see what these orders are by subsequently
reading back the file.

Note that these recommended orders are expected to be static for a given
boot of the system, and so the keyword "auto" was deliberately not used,
as I want to reserve it for a possible future use where the "best" order
is chosen more dynamically at runtime.

Recommended orders are determined as follows:
  - PMD_ORDER: The traditional THP size
  - arch_wants_pte_order() if implemented by the arch
  - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list

arch_wants_pte_order() can be overridden by the architecture if desired.
Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
set of ptes map physically contigious, naturally aligned memory, so this
mechanism allows the architecture to optimize as required.

Here we add the default implementation of arch_wants_pte_order(), used
when the architecture does not define it, which returns -1, implying
that the HW has no preference.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  4 ++++
 include/linux/pgtable.h                    | 13 +++++++++++++
 mm/huge_memory.c                           | 14 +++++++++++---
 3 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 732c3b2f4ba8..d6363d4efa3a 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9
 By enabling multiple orders, allocation of each order will be
 attempted, highest to lowest, until a successful allocation is made.
 If the PMD-order is unset, then no PMD-sized THPs will be allocated.
+It is also possible to enable the recommended set of orders, which
+will be optimized for the architecture and mm::
+
+	echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
 
 The kernel will ignore any orders that it does not support so read the
 file back to determine which orders are enabled::
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c3b0a3..0e110ce57cc3 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
 }
 #endif
 
+#ifndef arch_wants_pte_order
+/*
+ * Returns preferred folio order for pte-mapped memory. Must be in range [0,
+ * PMD_ORDER) and must not be order-1 since THP requires large folios to be at
+ * least order-2. Negative value implies that the HW has no preference and mm
+ * will choose it's own default order.
+ */
+static inline int arch_wants_pte_order(void)
+{
+	return -1;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bcecce769017..e2e2d3906a21 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj,
 	int err;
 	int ret = count;
 	unsigned int orders;
+	int arch;
 
-	err = kstrtouint(buf, 0, &orders);
-	if (err)
-		ret = -EINVAL;
+	if (sysfs_streq(buf, "recommend")) {
+		arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
+		orders = BIT(arch);
+		orders |= BIT(PAGE_ALLOC_COSTLY_ORDER);
+		orders |= BIT(PMD_ORDER);
+	} else {
+		err = kstrtouint(buf, 0, &orders);
+		if (err)
+			ret = -EINVAL;
+	}
 
 	if (ret > 0) {
 		orders &= THP_ORDERS_ALL_ANON;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders
@ 2023-09-29 11:44   ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

In addition to passing a bitfield of folio orders to enable for THP,
allow the string "recommend" to be written, which has the effect of
causing the system to enable the orders preferred by the architecture
and by the mm. The user can see what these orders are by subsequently
reading back the file.

Note that these recommended orders are expected to be static for a given
boot of the system, and so the keyword "auto" was deliberately not used,
as I want to reserve it for a possible future use where the "best" order
is chosen more dynamically at runtime.

Recommended orders are determined as follows:
  - PMD_ORDER: The traditional THP size
  - arch_wants_pte_order() if implemented by the arch
  - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list

arch_wants_pte_order() can be overridden by the architecture if desired.
Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
set of ptes map physically contigious, naturally aligned memory, so this
mechanism allows the architecture to optimize as required.

Here we add the default implementation of arch_wants_pte_order(), used
when the architecture does not define it, which returns -1, implying
that the HW has no preference.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  4 ++++
 include/linux/pgtable.h                    | 13 +++++++++++++
 mm/huge_memory.c                           | 14 +++++++++++---
 3 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 732c3b2f4ba8..d6363d4efa3a 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9
 By enabling multiple orders, allocation of each order will be
 attempted, highest to lowest, until a successful allocation is made.
 If the PMD-order is unset, then no PMD-sized THPs will be allocated.
+It is also possible to enable the recommended set of orders, which
+will be optimized for the architecture and mm::
+
+	echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
 
 The kernel will ignore any orders that it does not support so read the
 file back to determine which orders are enabled::
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c3b0a3..0e110ce57cc3 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
 }
 #endif
 
+#ifndef arch_wants_pte_order
+/*
+ * Returns preferred folio order for pte-mapped memory. Must be in range [0,
+ * PMD_ORDER) and must not be order-1 since THP requires large folios to be at
+ * least order-2. Negative value implies that the HW has no preference and mm
+ * will choose it's own default order.
+ */
+static inline int arch_wants_pte_order(void)
+{
+	return -1;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bcecce769017..e2e2d3906a21 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj,
 	int err;
 	int ret = count;
 	unsigned int orders;
+	int arch;
 
-	err = kstrtouint(buf, 0, &orders);
-	if (err)
-		ret = -EINVAL;
+	if (sysfs_streq(buf, "recommend")) {
+		arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
+		orders = BIT(arch);
+		orders |= BIT(PAGE_ALLOC_COSTLY_ORDER);
+		orders |= BIT(PMD_ORDER);
+	} else {
+		err = kstrtouint(buf, 0, &orders);
+		if (err)
+			ret = -EINVAL;
+	}
 
 	if (ret > 0) {
 		orders &= THP_ORDERS_ALL_ANON;
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 7/9] arm64/mm: Override arch_wants_pte_order()
  2023-09-29 11:44 ` Ryan Roberts
@ 2023-09-29 11:44   ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

Define an arch-specific override of arch_wants_pte_order() so that when
anon_orders=recommend is set, large folios will be allocated for
anonymous memory with an order that is compatible with arm64's HPA uarch
feature.

Reviewed-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7f7d9b1df4e5..e3d2449dec5c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1110,6 +1110,16 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
 extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    pte_t old_pte, pte_t new_pte);
+
+#define arch_wants_pte_order arch_wants_pte_order
+static inline int arch_wants_pte_order(void)
+{
+	/*
+	 * Many arm64 CPUs support hardware page aggregation (HPA), which can
+	 * coalesce 4 contiguous pages into a single TLB entry.
+	 */
+	return 2;
+}
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 7/9] arm64/mm: Override arch_wants_pte_order()
@ 2023-09-29 11:44   ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

Define an arch-specific override of arch_wants_pte_order() so that when
anon_orders=recommend is set, large folios will be allocated for
anonymous memory with an order that is compatible with arm64's HPA uarch
feature.

Reviewed-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7f7d9b1df4e5..e3d2449dec5c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1110,6 +1110,16 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
 extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    pte_t old_pte, pte_t new_pte);
+
+#define arch_wants_pte_order arch_wants_pte_order
+static inline int arch_wants_pte_order(void)
+{
+	/*
+	 * Many arm64 CPUs support hardware page aggregation (HPA), which can
+	 * coalesce 4 contiguous pages into a single TLB entry.
+	 */
+	return 2;
+}
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 8/9] selftests/mm/cow: Generalize do_run_with_thp() helper
  2023-09-29 11:44 ` Ryan Roberts
@ 2023-09-29 11:44   ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

do_run_with_thp() prepares (PMD-sized) THP memory into different states
before running tests. With the introduction of THP orders that are
smaller than PMD_ORDER, we would like to reuse this logic to also test
those smaller orders. So let's add a size parameter which tells the
function what size THP it should operate on.

No functional change intended here, but a separate commit will add new
tests for smaller order THP, where available.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 tools/testing/selftests/mm/cow.c | 151 +++++++++++++++++--------------
 1 file changed, 84 insertions(+), 67 deletions(-)

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index 7324ce5363c0..d887ce454e34 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -32,7 +32,7 @@
 
 static size_t pagesize;
 static int pagemap_fd;
-static size_t thpsize;
+static size_t pmdsize;
 static int nr_hugetlbsizes;
 static size_t hugetlbsizes[10];
 static int gup_fd;
@@ -734,14 +734,14 @@ enum thp_run {
 	THP_RUN_PARTIAL_SHARED,
 };
 
-static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
+static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t size)
 {
 	char *mem, *mmap_mem, *tmp, *mremap_mem = MAP_FAILED;
-	size_t size, mmap_size, mremap_size;
+	size_t mmap_size, mremap_size;
 	int ret;
 
-	/* For alignment purposes, we need twice the thp size. */
-	mmap_size = 2 * thpsize;
+	/* For alignment purposes, we need twice the requested size. */
+	mmap_size = 2 * size;
 	mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
 			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 	if (mmap_mem == MAP_FAILED) {
@@ -749,36 +749,40 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		return;
 	}
 
-	/* We need a THP-aligned memory area. */
-	mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1));
+	/* We need to naturally align the memory area. */
+	mem = (char *)(((uintptr_t)mmap_mem + size) & ~(size - 1));
 
-	ret = madvise(mem, thpsize, MADV_HUGEPAGE);
+	ret = madvise(mem, size, MADV_HUGEPAGE);
 	if (ret) {
 		ksft_test_result_fail("MADV_HUGEPAGE failed\n");
 		goto munmap;
 	}
 
 	/*
-	 * Try to populate a THP. Touch the first sub-page and test if we get
-	 * another sub-page populated automatically.
+	 * Try to populate a THP. Touch the first sub-page and test if
+	 * we get the last sub-page populated automatically.
 	 */
 	mem[0] = 0;
-	if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) {
+	if (!pagemap_is_populated(pagemap_fd, mem + size - pagesize)) {
 		ksft_test_result_skip("Did not get a THP populated\n");
 		goto munmap;
 	}
-	memset(mem, 0, thpsize);
+	memset(mem, 0, size);
 
-	size = thpsize;
 	switch (thp_run) {
 	case THP_RUN_PMD:
 	case THP_RUN_PMD_SWAPOUT:
+		if (size != pmdsize) {
+			ksft_test_result_fail("test bug: can't PMD-map size\n");
+			goto munmap;
+		}
 		break;
 	case THP_RUN_PTE:
 	case THP_RUN_PTE_SWAPOUT:
 		/*
 		 * Trigger PTE-mapping the THP by temporarily mapping a single
-		 * subpage R/O.
+		 * subpage R/O. This is a noop if the THP is not pmdsize (and
+		 * therefore already PTE-mapped).
 		 */
 		ret = mprotect(mem + pagesize, pagesize, PROT_READ);
 		if (ret) {
@@ -797,7 +801,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		 * Discard all but a single subpage of that PTE-mapped THP. What
 		 * remains is a single PTE mapping a single subpage.
 		 */
-		ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DONTNEED);
+		ret = madvise(mem + pagesize, size - pagesize, MADV_DONTNEED);
 		if (ret) {
 			ksft_test_result_fail("MADV_DONTNEED failed\n");
 			goto munmap;
@@ -809,7 +813,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		 * Remap half of the THP. We need some new memory location
 		 * for that.
 		 */
-		mremap_size = thpsize / 2;
+		mremap_size = size / 2;
 		mremap_mem = mmap(NULL, mremap_size, PROT_NONE,
 				  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 		if (mem == MAP_FAILED) {
@@ -830,7 +834,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		 * child. This will result in some parts of the THP never
 		 * have been shared.
 		 */
-		ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DONTFORK);
+		ret = madvise(mem + pagesize, size - pagesize, MADV_DONTFORK);
 		if (ret) {
 			ksft_test_result_fail("MADV_DONTFORK failed\n");
 			goto munmap;
@@ -844,7 +848,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		}
 		wait(&ret);
 		/* Allow for sharing all pages again. */
-		ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DOFORK);
+		ret = madvise(mem + pagesize, size - pagesize, MADV_DOFORK);
 		if (ret) {
 			ksft_test_result_fail("MADV_DOFORK failed\n");
 			goto munmap;
@@ -875,52 +879,65 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		munmap(mremap_mem, mremap_size);
 }
 
-static void run_with_thp(test_fn fn, const char *desc)
+static int sz2ord(size_t size)
+{
+	return __builtin_ctzll(size / pagesize);
+}
+
+static void run_with_thp(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PMD);
+	ksft_print_msg("[RUN] %s ... with order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PMD, size);
 }
 
-static void run_with_thp_swap(test_fn fn, const char *desc)
+static void run_with_thp_swap(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with swapped-out THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PMD_SWAPOUT);
+	ksft_print_msg("[RUN] %s ... with swapped-out order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PMD_SWAPOUT, size);
 }
 
-static void run_with_pte_mapped_thp(test_fn fn, const char *desc)
+static void run_with_pte_mapped_thp(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with PTE-mapped THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PTE);
+	ksft_print_msg("[RUN] %s ... with PTE-mapped order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PTE, size);
 }
 
-static void run_with_pte_mapped_thp_swap(test_fn fn, const char *desc)
+static void run_with_pte_mapped_thp_swap(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with swapped-out, PTE-mapped THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PTE_SWAPOUT);
+	ksft_print_msg("[RUN] %s ... with swapped-out, PTE-mapped order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PTE_SWAPOUT, size);
 }
 
-static void run_with_single_pte_of_thp(test_fn fn, const char *desc)
+static void run_with_single_pte_of_thp(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with single PTE of THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_SINGLE_PTE);
+	ksft_print_msg("[RUN] %s ... with single PTE of order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_SINGLE_PTE, size);
 }
 
-static void run_with_single_pte_of_thp_swap(test_fn fn, const char *desc)
+static void run_with_single_pte_of_thp_swap(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with single PTE of swapped-out THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_SINGLE_PTE_SWAPOUT);
+	ksft_print_msg("[RUN] %s ... with single PTE of swapped-out order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_SINGLE_PTE_SWAPOUT, size);
 }
 
-static void run_with_partial_mremap_thp(test_fn fn, const char *desc)
+static void run_with_partial_mremap_thp(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with partially mremap()'ed THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP);
+	ksft_print_msg("[RUN] %s ... with partially mremap()'ed order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP, size);
 }
 
-static void run_with_partial_shared_thp(test_fn fn, const char *desc)
+static void run_with_partial_shared_thp(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with partially shared THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PARTIAL_SHARED);
+	ksft_print_msg("[RUN] %s ... with partially shared order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PARTIAL_SHARED, size);
 }
 
 static void run_with_hugetlb(test_fn fn, const char *desc, size_t hugetlbsize)
@@ -1091,15 +1108,15 @@ static void run_anon_test_case(struct test_case const *test_case)
 
 	run_with_base_page(test_case->fn, test_case->desc);
 	run_with_base_page_swap(test_case->fn, test_case->desc);
-	if (thpsize) {
-		run_with_thp(test_case->fn, test_case->desc);
-		run_with_thp_swap(test_case->fn, test_case->desc);
-		run_with_pte_mapped_thp(test_case->fn, test_case->desc);
-		run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc);
-		run_with_single_pte_of_thp(test_case->fn, test_case->desc);
-		run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc);
-		run_with_partial_mremap_thp(test_case->fn, test_case->desc);
-		run_with_partial_shared_thp(test_case->fn, test_case->desc);
+	if (pmdsize) {
+		run_with_thp(test_case->fn, test_case->desc, pmdsize);
+		run_with_thp_swap(test_case->fn, test_case->desc, pmdsize);
+		run_with_pte_mapped_thp(test_case->fn, test_case->desc, pmdsize);
+		run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc, pmdsize);
+		run_with_single_pte_of_thp(test_case->fn, test_case->desc, pmdsize);
+		run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc, pmdsize);
+		run_with_partial_mremap_thp(test_case->fn, test_case->desc, pmdsize);
+		run_with_partial_shared_thp(test_case->fn, test_case->desc, pmdsize);
 	}
 	for (i = 0; i < nr_hugetlbsizes; i++)
 		run_with_hugetlb(test_case->fn, test_case->desc,
@@ -1120,7 +1137,7 @@ static int tests_per_anon_test_case(void)
 {
 	int tests = 2 + nr_hugetlbsizes;
 
-	if (thpsize)
+	if (pmdsize)
 		tests += 8;
 	return tests;
 }
@@ -1329,7 +1346,7 @@ static void run_anon_thp_test_cases(void)
 {
 	int i;
 
-	if (!thpsize)
+	if (!pmdsize)
 		return;
 
 	ksft_print_msg("[INFO] Anonymous THP tests\n");
@@ -1338,13 +1355,13 @@ static void run_anon_thp_test_cases(void)
 		struct test_case const *test_case = &anon_thp_test_cases[i];
 
 		ksft_print_msg("[RUN] %s\n", test_case->desc);
-		do_run_with_thp(test_case->fn, THP_RUN_PMD);
+		do_run_with_thp(test_case->fn, THP_RUN_PMD, pmdsize);
 	}
 }
 
 static int tests_per_anon_thp_test_case(void)
 {
-	return thpsize ? 1 : 0;
+	return pmdsize ? 1 : 0;
 }
 
 typedef void (*non_anon_test_fn)(char *mem, const char *smem, size_t size);
@@ -1419,7 +1436,7 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc)
 	}
 
 	/* For alignment purposes, we need twice the thp size. */
-	mmap_size = 2 * thpsize;
+	mmap_size = 2 * pmdsize;
 	mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
 			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 	if (mmap_mem == MAP_FAILED) {
@@ -1434,11 +1451,11 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc)
 	}
 
 	/* We need a THP-aligned memory area. */
-	mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1));
-	smem = (char *)(((uintptr_t)mmap_smem + thpsize) & ~(thpsize - 1));
+	mem = (char *)(((uintptr_t)mmap_mem + pmdsize) & ~(pmdsize - 1));
+	smem = (char *)(((uintptr_t)mmap_smem + pmdsize) & ~(pmdsize - 1));
 
-	ret = madvise(mem, thpsize, MADV_HUGEPAGE);
-	ret |= madvise(smem, thpsize, MADV_HUGEPAGE);
+	ret = madvise(mem, pmdsize, MADV_HUGEPAGE);
+	ret |= madvise(smem, pmdsize, MADV_HUGEPAGE);
 	if (ret) {
 		ksft_test_result_fail("MADV_HUGEPAGE failed\n");
 		goto munmap;
@@ -1457,7 +1474,7 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc)
 		goto munmap;
 	}
 
-	fn(mem, smem, thpsize);
+	fn(mem, smem, pmdsize);
 munmap:
 	munmap(mmap_mem, mmap_size);
 	if (mmap_smem != MAP_FAILED)
@@ -1650,7 +1667,7 @@ static void run_non_anon_test_case(struct non_anon_test_case const *test_case)
 	run_with_zeropage(test_case->fn, test_case->desc);
 	run_with_memfd(test_case->fn, test_case->desc);
 	run_with_tmpfile(test_case->fn, test_case->desc);
-	if (thpsize)
+	if (pmdsize)
 		run_with_huge_zeropage(test_case->fn, test_case->desc);
 	for (i = 0; i < nr_hugetlbsizes; i++)
 		run_with_memfd_hugetlb(test_case->fn, test_case->desc,
@@ -1671,7 +1688,7 @@ static int tests_per_non_anon_test_case(void)
 {
 	int tests = 3 + nr_hugetlbsizes;
 
-	if (thpsize)
+	if (pmdsize)
 		tests += 1;
 	return tests;
 }
@@ -1681,10 +1698,10 @@ int main(int argc, char **argv)
 	int err;
 
 	pagesize = getpagesize();
-	thpsize = read_pmd_pagesize();
-	if (thpsize)
-		ksft_print_msg("[INFO] detected THP size: %zu KiB\n",
-			       thpsize / 1024);
+	pmdsize = read_pmd_pagesize();
+	if (pmdsize)
+		ksft_print_msg("[INFO] detected PMD-mapped THP size: %zu KiB\n",
+			       pmdsize / 1024);
 	nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes,
 						    ARRAY_SIZE(hugetlbsizes));
 	detect_huge_zeropage();
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 8/9] selftests/mm/cow: Generalize do_run_with_thp() helper
@ 2023-09-29 11:44   ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

do_run_with_thp() prepares (PMD-sized) THP memory into different states
before running tests. With the introduction of THP orders that are
smaller than PMD_ORDER, we would like to reuse this logic to also test
those smaller orders. So let's add a size parameter which tells the
function what size THP it should operate on.

No functional change intended here, but a separate commit will add new
tests for smaller order THP, where available.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 tools/testing/selftests/mm/cow.c | 151 +++++++++++++++++--------------
 1 file changed, 84 insertions(+), 67 deletions(-)

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index 7324ce5363c0..d887ce454e34 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -32,7 +32,7 @@
 
 static size_t pagesize;
 static int pagemap_fd;
-static size_t thpsize;
+static size_t pmdsize;
 static int nr_hugetlbsizes;
 static size_t hugetlbsizes[10];
 static int gup_fd;
@@ -734,14 +734,14 @@ enum thp_run {
 	THP_RUN_PARTIAL_SHARED,
 };
 
-static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
+static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t size)
 {
 	char *mem, *mmap_mem, *tmp, *mremap_mem = MAP_FAILED;
-	size_t size, mmap_size, mremap_size;
+	size_t mmap_size, mremap_size;
 	int ret;
 
-	/* For alignment purposes, we need twice the thp size. */
-	mmap_size = 2 * thpsize;
+	/* For alignment purposes, we need twice the requested size. */
+	mmap_size = 2 * size;
 	mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
 			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 	if (mmap_mem == MAP_FAILED) {
@@ -749,36 +749,40 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		return;
 	}
 
-	/* We need a THP-aligned memory area. */
-	mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1));
+	/* We need to naturally align the memory area. */
+	mem = (char *)(((uintptr_t)mmap_mem + size) & ~(size - 1));
 
-	ret = madvise(mem, thpsize, MADV_HUGEPAGE);
+	ret = madvise(mem, size, MADV_HUGEPAGE);
 	if (ret) {
 		ksft_test_result_fail("MADV_HUGEPAGE failed\n");
 		goto munmap;
 	}
 
 	/*
-	 * Try to populate a THP. Touch the first sub-page and test if we get
-	 * another sub-page populated automatically.
+	 * Try to populate a THP. Touch the first sub-page and test if
+	 * we get the last sub-page populated automatically.
 	 */
 	mem[0] = 0;
-	if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) {
+	if (!pagemap_is_populated(pagemap_fd, mem + size - pagesize)) {
 		ksft_test_result_skip("Did not get a THP populated\n");
 		goto munmap;
 	}
-	memset(mem, 0, thpsize);
+	memset(mem, 0, size);
 
-	size = thpsize;
 	switch (thp_run) {
 	case THP_RUN_PMD:
 	case THP_RUN_PMD_SWAPOUT:
+		if (size != pmdsize) {
+			ksft_test_result_fail("test bug: can't PMD-map size\n");
+			goto munmap;
+		}
 		break;
 	case THP_RUN_PTE:
 	case THP_RUN_PTE_SWAPOUT:
 		/*
 		 * Trigger PTE-mapping the THP by temporarily mapping a single
-		 * subpage R/O.
+		 * subpage R/O. This is a noop if the THP is not pmdsize (and
+		 * therefore already PTE-mapped).
 		 */
 		ret = mprotect(mem + pagesize, pagesize, PROT_READ);
 		if (ret) {
@@ -797,7 +801,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		 * Discard all but a single subpage of that PTE-mapped THP. What
 		 * remains is a single PTE mapping a single subpage.
 		 */
-		ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DONTNEED);
+		ret = madvise(mem + pagesize, size - pagesize, MADV_DONTNEED);
 		if (ret) {
 			ksft_test_result_fail("MADV_DONTNEED failed\n");
 			goto munmap;
@@ -809,7 +813,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		 * Remap half of the THP. We need some new memory location
 		 * for that.
 		 */
-		mremap_size = thpsize / 2;
+		mremap_size = size / 2;
 		mremap_mem = mmap(NULL, mremap_size, PROT_NONE,
 				  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 		if (mem == MAP_FAILED) {
@@ -830,7 +834,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		 * child. This will result in some parts of the THP never
 		 * have been shared.
 		 */
-		ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DONTFORK);
+		ret = madvise(mem + pagesize, size - pagesize, MADV_DONTFORK);
 		if (ret) {
 			ksft_test_result_fail("MADV_DONTFORK failed\n");
 			goto munmap;
@@ -844,7 +848,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		}
 		wait(&ret);
 		/* Allow for sharing all pages again. */
-		ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DOFORK);
+		ret = madvise(mem + pagesize, size - pagesize, MADV_DOFORK);
 		if (ret) {
 			ksft_test_result_fail("MADV_DOFORK failed\n");
 			goto munmap;
@@ -875,52 +879,65 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		munmap(mremap_mem, mremap_size);
 }
 
-static void run_with_thp(test_fn fn, const char *desc)
+static int sz2ord(size_t size)
+{
+	return __builtin_ctzll(size / pagesize);
+}
+
+static void run_with_thp(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PMD);
+	ksft_print_msg("[RUN] %s ... with order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PMD, size);
 }
 
-static void run_with_thp_swap(test_fn fn, const char *desc)
+static void run_with_thp_swap(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with swapped-out THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PMD_SWAPOUT);
+	ksft_print_msg("[RUN] %s ... with swapped-out order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PMD_SWAPOUT, size);
 }
 
-static void run_with_pte_mapped_thp(test_fn fn, const char *desc)
+static void run_with_pte_mapped_thp(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with PTE-mapped THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PTE);
+	ksft_print_msg("[RUN] %s ... with PTE-mapped order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PTE, size);
 }
 
-static void run_with_pte_mapped_thp_swap(test_fn fn, const char *desc)
+static void run_with_pte_mapped_thp_swap(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with swapped-out, PTE-mapped THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PTE_SWAPOUT);
+	ksft_print_msg("[RUN] %s ... with swapped-out, PTE-mapped order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PTE_SWAPOUT, size);
 }
 
-static void run_with_single_pte_of_thp(test_fn fn, const char *desc)
+static void run_with_single_pte_of_thp(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with single PTE of THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_SINGLE_PTE);
+	ksft_print_msg("[RUN] %s ... with single PTE of order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_SINGLE_PTE, size);
 }
 
-static void run_with_single_pte_of_thp_swap(test_fn fn, const char *desc)
+static void run_with_single_pte_of_thp_swap(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with single PTE of swapped-out THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_SINGLE_PTE_SWAPOUT);
+	ksft_print_msg("[RUN] %s ... with single PTE of swapped-out order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_SINGLE_PTE_SWAPOUT, size);
 }
 
-static void run_with_partial_mremap_thp(test_fn fn, const char *desc)
+static void run_with_partial_mremap_thp(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with partially mremap()'ed THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP);
+	ksft_print_msg("[RUN] %s ... with partially mremap()'ed order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP, size);
 }
 
-static void run_with_partial_shared_thp(test_fn fn, const char *desc)
+static void run_with_partial_shared_thp(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with partially shared THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PARTIAL_SHARED);
+	ksft_print_msg("[RUN] %s ... with partially shared order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PARTIAL_SHARED, size);
 }
 
 static void run_with_hugetlb(test_fn fn, const char *desc, size_t hugetlbsize)
@@ -1091,15 +1108,15 @@ static void run_anon_test_case(struct test_case const *test_case)
 
 	run_with_base_page(test_case->fn, test_case->desc);
 	run_with_base_page_swap(test_case->fn, test_case->desc);
-	if (thpsize) {
-		run_with_thp(test_case->fn, test_case->desc);
-		run_with_thp_swap(test_case->fn, test_case->desc);
-		run_with_pte_mapped_thp(test_case->fn, test_case->desc);
-		run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc);
-		run_with_single_pte_of_thp(test_case->fn, test_case->desc);
-		run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc);
-		run_with_partial_mremap_thp(test_case->fn, test_case->desc);
-		run_with_partial_shared_thp(test_case->fn, test_case->desc);
+	if (pmdsize) {
+		run_with_thp(test_case->fn, test_case->desc, pmdsize);
+		run_with_thp_swap(test_case->fn, test_case->desc, pmdsize);
+		run_with_pte_mapped_thp(test_case->fn, test_case->desc, pmdsize);
+		run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc, pmdsize);
+		run_with_single_pte_of_thp(test_case->fn, test_case->desc, pmdsize);
+		run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc, pmdsize);
+		run_with_partial_mremap_thp(test_case->fn, test_case->desc, pmdsize);
+		run_with_partial_shared_thp(test_case->fn, test_case->desc, pmdsize);
 	}
 	for (i = 0; i < nr_hugetlbsizes; i++)
 		run_with_hugetlb(test_case->fn, test_case->desc,
@@ -1120,7 +1137,7 @@ static int tests_per_anon_test_case(void)
 {
 	int tests = 2 + nr_hugetlbsizes;
 
-	if (thpsize)
+	if (pmdsize)
 		tests += 8;
 	return tests;
 }
@@ -1329,7 +1346,7 @@ static void run_anon_thp_test_cases(void)
 {
 	int i;
 
-	if (!thpsize)
+	if (!pmdsize)
 		return;
 
 	ksft_print_msg("[INFO] Anonymous THP tests\n");
@@ -1338,13 +1355,13 @@ static void run_anon_thp_test_cases(void)
 		struct test_case const *test_case = &anon_thp_test_cases[i];
 
 		ksft_print_msg("[RUN] %s\n", test_case->desc);
-		do_run_with_thp(test_case->fn, THP_RUN_PMD);
+		do_run_with_thp(test_case->fn, THP_RUN_PMD, pmdsize);
 	}
 }
 
 static int tests_per_anon_thp_test_case(void)
 {
-	return thpsize ? 1 : 0;
+	return pmdsize ? 1 : 0;
 }
 
 typedef void (*non_anon_test_fn)(char *mem, const char *smem, size_t size);
@@ -1419,7 +1436,7 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc)
 	}
 
 	/* For alignment purposes, we need twice the thp size. */
-	mmap_size = 2 * thpsize;
+	mmap_size = 2 * pmdsize;
 	mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
 			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 	if (mmap_mem == MAP_FAILED) {
@@ -1434,11 +1451,11 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc)
 	}
 
 	/* We need a THP-aligned memory area. */
-	mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1));
-	smem = (char *)(((uintptr_t)mmap_smem + thpsize) & ~(thpsize - 1));
+	mem = (char *)(((uintptr_t)mmap_mem + pmdsize) & ~(pmdsize - 1));
+	smem = (char *)(((uintptr_t)mmap_smem + pmdsize) & ~(pmdsize - 1));
 
-	ret = madvise(mem, thpsize, MADV_HUGEPAGE);
-	ret |= madvise(smem, thpsize, MADV_HUGEPAGE);
+	ret = madvise(mem, pmdsize, MADV_HUGEPAGE);
+	ret |= madvise(smem, pmdsize, MADV_HUGEPAGE);
 	if (ret) {
 		ksft_test_result_fail("MADV_HUGEPAGE failed\n");
 		goto munmap;
@@ -1457,7 +1474,7 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc)
 		goto munmap;
 	}
 
-	fn(mem, smem, thpsize);
+	fn(mem, smem, pmdsize);
 munmap:
 	munmap(mmap_mem, mmap_size);
 	if (mmap_smem != MAP_FAILED)
@@ -1650,7 +1667,7 @@ static void run_non_anon_test_case(struct non_anon_test_case const *test_case)
 	run_with_zeropage(test_case->fn, test_case->desc);
 	run_with_memfd(test_case->fn, test_case->desc);
 	run_with_tmpfile(test_case->fn, test_case->desc);
-	if (thpsize)
+	if (pmdsize)
 		run_with_huge_zeropage(test_case->fn, test_case->desc);
 	for (i = 0; i < nr_hugetlbsizes; i++)
 		run_with_memfd_hugetlb(test_case->fn, test_case->desc,
@@ -1671,7 +1688,7 @@ static int tests_per_non_anon_test_case(void)
 {
 	int tests = 3 + nr_hugetlbsizes;
 
-	if (thpsize)
+	if (pmdsize)
 		tests += 1;
 	return tests;
 }
@@ -1681,10 +1698,10 @@ int main(int argc, char **argv)
 	int err;
 
 	pagesize = getpagesize();
-	thpsize = read_pmd_pagesize();
-	if (thpsize)
-		ksft_print_msg("[INFO] detected THP size: %zu KiB\n",
-			       thpsize / 1024);
+	pmdsize = read_pmd_pagesize();
+	if (pmdsize)
+		ksft_print_msg("[INFO] detected PMD-mapped THP size: %zu KiB\n",
+			       pmdsize / 1024);
 	nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes,
 						    ARRAY_SIZE(hugetlbsizes));
 	detect_huge_zeropage();
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 9/9] selftests/mm/cow: Add tests for small-order anon THP
  2023-09-29 11:44 ` Ryan Roberts
@ 2023-09-29 11:44   ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

Add tests similar to the existing THP tests, but which operate on memory
backed by smaller-order, PTE-mapped THP. This reuses all the existing
infrastructure. If the test suite detects that small-order THP is not
supported by the kernel, the new tests are skipped.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 tools/testing/selftests/mm/cow.c | 93 ++++++++++++++++++++++++++++++++
 1 file changed, 93 insertions(+)

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index d887ce454e34..6c5e37d8bb69 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -33,10 +33,13 @@
 static size_t pagesize;
 static int pagemap_fd;
 static size_t pmdsize;
+static size_t ptesize;
 static int nr_hugetlbsizes;
 static size_t hugetlbsizes[10];
 static int gup_fd;
 static bool has_huge_zeropage;
+static unsigned int orig_anon_orders;
+static bool orig_anon_orders_valid;
 
 static void detect_huge_zeropage(void)
 {
@@ -1118,6 +1121,14 @@ static void run_anon_test_case(struct test_case const *test_case)
 		run_with_partial_mremap_thp(test_case->fn, test_case->desc, pmdsize);
 		run_with_partial_shared_thp(test_case->fn, test_case->desc, pmdsize);
 	}
+	if (ptesize) {
+		run_with_pte_mapped_thp(test_case->fn, test_case->desc, ptesize);
+		run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc, ptesize);
+		run_with_single_pte_of_thp(test_case->fn, test_case->desc, ptesize);
+		run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc, ptesize);
+		run_with_partial_mremap_thp(test_case->fn, test_case->desc, ptesize);
+		run_with_partial_shared_thp(test_case->fn, test_case->desc, ptesize);
+	}
 	for (i = 0; i < nr_hugetlbsizes; i++)
 		run_with_hugetlb(test_case->fn, test_case->desc,
 				 hugetlbsizes[i]);
@@ -1139,6 +1150,8 @@ static int tests_per_anon_test_case(void)
 
 	if (pmdsize)
 		tests += 8;
+	if (ptesize)
+		tests += 6;
 	return tests;
 }
 
@@ -1693,6 +1706,80 @@ static int tests_per_non_anon_test_case(void)
 	return tests;
 }
 
+#define ANON_ORDERS_FILE "/sys/kernel/mm/transparent_hugepage/anon_orders"
+
+static int read_anon_orders(unsigned int *orders)
+{
+	ssize_t buflen = 80;
+	char buf[buflen];
+	int fd;
+
+	fd = open(ANON_ORDERS_FILE, O_RDONLY);
+	if (fd == -1)
+		return -1;
+
+	buflen = read(fd, buf, buflen);
+	close(fd);
+
+	if (buflen < 1)
+		return -1;
+
+	*orders = strtoul(buf, NULL, 16);
+
+	return 0;
+}
+
+static int write_anon_orders(unsigned int orders)
+{
+	ssize_t buflen = 80;
+	char buf[buflen];
+	int fd;
+
+	fd = open(ANON_ORDERS_FILE, O_WRONLY);
+	if (fd == -1)
+		return -1;
+
+	buflen = snprintf(buf, buflen, "0x%08x\n", orders);
+	buflen = write(fd, buf, buflen);
+	close(fd);
+
+	if (buflen < 1)
+		return -1;
+
+	return 0;
+}
+
+static size_t save_thp_anon_orders(void)
+{
+	/*
+	 * If the kernel supports multiple orders for anon THP (indicated by the
+	 * presence of anon_orders file), configure it for the PMD-order and the
+	 * PMD-order - 1, which we will report back and use as the PTE-order THP
+	 * size. Save the original value so that it can be restored on exit. If
+	 * the kernel does not support multiple orders, report back 0 for the
+	 * PTE-size so those tests are skipped.
+	 */
+
+	int pteorder = sz2ord(pmdsize) - 1;
+	unsigned int orders = (1UL << sz2ord(pmdsize)) | (1UL << pteorder);
+
+	if (read_anon_orders(&orig_anon_orders))
+		return 0;
+
+	orig_anon_orders_valid = true;
+
+	if (write_anon_orders(orders))
+		return 0;
+
+	return pagesize << pteorder;
+}
+
+static void restore_thp_anon_orders(void)
+{
+	if (orig_anon_orders_valid)
+		write_anon_orders(orig_anon_orders);
+}
+
 int main(int argc, char **argv)
 {
 	int err;
@@ -1702,6 +1789,10 @@ int main(int argc, char **argv)
 	if (pmdsize)
 		ksft_print_msg("[INFO] detected PMD-mapped THP size: %zu KiB\n",
 			       pmdsize / 1024);
+	ptesize = save_thp_anon_orders();
+	if (ptesize)
+		ksft_print_msg("[INFO] configured PTE-mapped THP size: %zu KiB\n",
+			       ptesize / 1024);
 	nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes,
 						    ARRAY_SIZE(hugetlbsizes));
 	detect_huge_zeropage();
@@ -1720,6 +1811,8 @@ int main(int argc, char **argv)
 	run_anon_thp_test_cases();
 	run_non_anon_test_cases();
 
+	restore_thp_anon_orders();
+
 	err = ksft_get_fail_cnt();
 	if (err)
 		ksft_exit_fail_msg("%d out of %d tests failed\n",
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH v6 9/9] selftests/mm/cow: Add tests for small-order anon THP
@ 2023-09-29 11:44   ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 11:44 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: Ryan Roberts, linux-mm, linux-kernel, linux-arm-kernel

Add tests similar to the existing THP tests, but which operate on memory
backed by smaller-order, PTE-mapped THP. This reuses all the existing
infrastructure. If the test suite detects that small-order THP is not
supported by the kernel, the new tests are skipped.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 tools/testing/selftests/mm/cow.c | 93 ++++++++++++++++++++++++++++++++
 1 file changed, 93 insertions(+)

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index d887ce454e34..6c5e37d8bb69 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -33,10 +33,13 @@
 static size_t pagesize;
 static int pagemap_fd;
 static size_t pmdsize;
+static size_t ptesize;
 static int nr_hugetlbsizes;
 static size_t hugetlbsizes[10];
 static int gup_fd;
 static bool has_huge_zeropage;
+static unsigned int orig_anon_orders;
+static bool orig_anon_orders_valid;
 
 static void detect_huge_zeropage(void)
 {
@@ -1118,6 +1121,14 @@ static void run_anon_test_case(struct test_case const *test_case)
 		run_with_partial_mremap_thp(test_case->fn, test_case->desc, pmdsize);
 		run_with_partial_shared_thp(test_case->fn, test_case->desc, pmdsize);
 	}
+	if (ptesize) {
+		run_with_pte_mapped_thp(test_case->fn, test_case->desc, ptesize);
+		run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc, ptesize);
+		run_with_single_pte_of_thp(test_case->fn, test_case->desc, ptesize);
+		run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc, ptesize);
+		run_with_partial_mremap_thp(test_case->fn, test_case->desc, ptesize);
+		run_with_partial_shared_thp(test_case->fn, test_case->desc, ptesize);
+	}
 	for (i = 0; i < nr_hugetlbsizes; i++)
 		run_with_hugetlb(test_case->fn, test_case->desc,
 				 hugetlbsizes[i]);
@@ -1139,6 +1150,8 @@ static int tests_per_anon_test_case(void)
 
 	if (pmdsize)
 		tests += 8;
+	if (ptesize)
+		tests += 6;
 	return tests;
 }
 
@@ -1693,6 +1706,80 @@ static int tests_per_non_anon_test_case(void)
 	return tests;
 }
 
+#define ANON_ORDERS_FILE "/sys/kernel/mm/transparent_hugepage/anon_orders"
+
+static int read_anon_orders(unsigned int *orders)
+{
+	ssize_t buflen = 80;
+	char buf[buflen];
+	int fd;
+
+	fd = open(ANON_ORDERS_FILE, O_RDONLY);
+	if (fd == -1)
+		return -1;
+
+	buflen = read(fd, buf, buflen);
+	close(fd);
+
+	if (buflen < 1)
+		return -1;
+
+	*orders = strtoul(buf, NULL, 16);
+
+	return 0;
+}
+
+static int write_anon_orders(unsigned int orders)
+{
+	ssize_t buflen = 80;
+	char buf[buflen];
+	int fd;
+
+	fd = open(ANON_ORDERS_FILE, O_WRONLY);
+	if (fd == -1)
+		return -1;
+
+	buflen = snprintf(buf, buflen, "0x%08x\n", orders);
+	buflen = write(fd, buf, buflen);
+	close(fd);
+
+	if (buflen < 1)
+		return -1;
+
+	return 0;
+}
+
+static size_t save_thp_anon_orders(void)
+{
+	/*
+	 * If the kernel supports multiple orders for anon THP (indicated by the
+	 * presence of anon_orders file), configure it for the PMD-order and the
+	 * PMD-order - 1, which we will report back and use as the PTE-order THP
+	 * size. Save the original value so that it can be restored on exit. If
+	 * the kernel does not support multiple orders, report back 0 for the
+	 * PTE-size so those tests are skipped.
+	 */
+
+	int pteorder = sz2ord(pmdsize) - 1;
+	unsigned int orders = (1UL << sz2ord(pmdsize)) | (1UL << pteorder);
+
+	if (read_anon_orders(&orig_anon_orders))
+		return 0;
+
+	orig_anon_orders_valid = true;
+
+	if (write_anon_orders(orders))
+		return 0;
+
+	return pagesize << pteorder;
+}
+
+static void restore_thp_anon_orders(void)
+{
+	if (orig_anon_orders_valid)
+		write_anon_orders(orig_anon_orders);
+}
+
 int main(int argc, char **argv)
 {
 	int err;
@@ -1702,6 +1789,10 @@ int main(int argc, char **argv)
 	if (pmdsize)
 		ksft_print_msg("[INFO] detected PMD-mapped THP size: %zu KiB\n",
 			       pmdsize / 1024);
+	ptesize = save_thp_anon_orders();
+	if (ptesize)
+		ksft_print_msg("[INFO] configured PTE-mapped THP size: %zu KiB\n",
+			       ptesize / 1024);
 	nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes,
 						    ARRAY_SIZE(hugetlbsizes));
 	detect_huge_zeropage();
@@ -1720,6 +1811,8 @@ int main(int argc, char **argv)
 	run_anon_thp_test_cases();
 	run_non_anon_test_cases();
 
+	restore_thp_anon_orders();
+
 	err = ksft_get_fail_cnt();
 	if (err)
 		ksft_exit_fail_msg("%d out of %d tests failed\n",
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 2/9] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
  2023-09-29 11:44   ` Ryan Roberts
@ 2023-09-29 13:45     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 140+ messages in thread
From: Kirill A. Shutemov @ 2023-09-29 13:45 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama, John Hubbard,
	David Rientjes, Vlastimil Babka, Hugh Dickins, linux-mm,
	linux-kernel, linux-arm-kernel

On Fri, Sep 29, 2023 at 12:44:13PM +0100, Ryan Roberts wrote:
> In preparation for anonymous large folio support, improve
> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
> passed to it. In this case, all contained pages are accounted using the
> order-0 folio (or base page) scheme.
> 
> Reviewed-by: Yu Zhao <yuzhao@google.com>
> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/rmap.c | 27 ++++++++++++++++++++-------
>  1 file changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8600bd029acf..106149690366 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1266,31 +1266,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>   * This means the inc-and-test can be bypassed.
>   * The folio does not have to be locked.
>   *
> - * If the folio is large, it is accounted as a THP.  As the folio
> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>   * is new, it's assumed to be mapped exclusively by a single process.
>   */
>  void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>  		unsigned long address)
>  {
> -	int nr;
> +	int nr = folio_nr_pages(folio);
>  
> -	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> +	VM_BUG_ON_VMA(address < vma->vm_start ||
> +			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>  	__folio_set_swapbacked(folio);
>  
> -	if (likely(!folio_test_pmd_mappable(folio))) {
> +	if (likely(!folio_test_large(folio))) {
>  		/* increment count (starts at -1) */
>  		atomic_set(&folio->_mapcount, 0);
> -		nr = 1;
> +		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
> +	} else if (!folio_test_pmd_mappable(folio)) {
> +		int i;
> +
> +		for (i = 0; i < nr; i++) {
> +			struct page *page = folio_page(folio, i);
> +
> +			/* increment count (starts at -1) */
> +			atomic_set(&page->_mapcount, 0);
> +			__page_set_anon_rmap(folio, page, vma,
> +					address + (i << PAGE_SHIFT), 1);
> +		}
> +
> +		atomic_set(&folio->_nr_pages_mapped, nr);

This code should work for !folio_test_large() case too, no?

>  	} else {
>  		/* increment count (starts at -1) */
>  		atomic_set(&folio->_entire_mapcount, 0);
>  		atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
> -		nr = folio_nr_pages(folio);
> +		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
>  	}
>  
>  	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
> -	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  }
>  
>  /**
> -- 
> 2.25.1
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 2/9] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
@ 2023-09-29 13:45     ` Kirill A. Shutemov
  0 siblings, 0 replies; 140+ messages in thread
From: Kirill A. Shutemov @ 2023-09-29 13:45 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama, John Hubbard,
	David Rientjes, Vlastimil Babka, Hugh Dickins, linux-mm,
	linux-kernel, linux-arm-kernel

On Fri, Sep 29, 2023 at 12:44:13PM +0100, Ryan Roberts wrote:
> In preparation for anonymous large folio support, improve
> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
> passed to it. In this case, all contained pages are accounted using the
> order-0 folio (or base page) scheme.
> 
> Reviewed-by: Yu Zhao <yuzhao@google.com>
> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/rmap.c | 27 ++++++++++++++++++++-------
>  1 file changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8600bd029acf..106149690366 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1266,31 +1266,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>   * This means the inc-and-test can be bypassed.
>   * The folio does not have to be locked.
>   *
> - * If the folio is large, it is accounted as a THP.  As the folio
> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>   * is new, it's assumed to be mapped exclusively by a single process.
>   */
>  void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>  		unsigned long address)
>  {
> -	int nr;
> +	int nr = folio_nr_pages(folio);
>  
> -	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> +	VM_BUG_ON_VMA(address < vma->vm_start ||
> +			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>  	__folio_set_swapbacked(folio);
>  
> -	if (likely(!folio_test_pmd_mappable(folio))) {
> +	if (likely(!folio_test_large(folio))) {
>  		/* increment count (starts at -1) */
>  		atomic_set(&folio->_mapcount, 0);
> -		nr = 1;
> +		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
> +	} else if (!folio_test_pmd_mappable(folio)) {
> +		int i;
> +
> +		for (i = 0; i < nr; i++) {
> +			struct page *page = folio_page(folio, i);
> +
> +			/* increment count (starts at -1) */
> +			atomic_set(&page->_mapcount, 0);
> +			__page_set_anon_rmap(folio, page, vma,
> +					address + (i << PAGE_SHIFT), 1);
> +		}
> +
> +		atomic_set(&folio->_nr_pages_mapped, nr);

This code should work for !folio_test_large() case too, no?

>  	} else {
>  		/* increment count (starts at -1) */
>  		atomic_set(&folio->_entire_mapcount, 0);
>  		atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
> -		nr = folio_nr_pages(folio);
> +		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
>  	}
>  
>  	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
> -	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  }
>  
>  /**
> -- 
> 2.25.1
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 2/9] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
  2023-09-29 13:45     ` Kirill A. Shutemov
@ 2023-09-29 14:39       ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 14:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama, John Hubbard,
	David Rientjes, Vlastimil Babka, Hugh Dickins, linux-mm,
	linux-kernel, linux-arm-kernel

On 29/09/2023 14:45, Kirill A. Shutemov wrote:
> On Fri, Sep 29, 2023 at 12:44:13PM +0100, Ryan Roberts wrote:
>> In preparation for anonymous large folio support, improve
>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
>> passed to it. In this case, all contained pages are accounted using the
>> order-0 folio (or base page) scheme.
>>
>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/rmap.c | 27 ++++++++++++++++++++-------
>>  1 file changed, 20 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 8600bd029acf..106149690366 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1266,31 +1266,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>>   * This means the inc-and-test can be bypassed.
>>   * The folio does not have to be locked.
>>   *
>> - * If the folio is large, it is accounted as a THP.  As the folio
>> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>>   * is new, it's assumed to be mapped exclusively by a single process.
>>   */
>>  void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>  		unsigned long address)
>>  {
>> -	int nr;
>> +	int nr = folio_nr_pages(folio);
>>  
>> -	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>> +	VM_BUG_ON_VMA(address < vma->vm_start ||
>> +			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>  	__folio_set_swapbacked(folio);
>>  
>> -	if (likely(!folio_test_pmd_mappable(folio))) {
>> +	if (likely(!folio_test_large(folio))) {
>>  		/* increment count (starts at -1) */
>>  		atomic_set(&folio->_mapcount, 0);
>> -		nr = 1;
>> +		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>> +	} else if (!folio_test_pmd_mappable(folio)) {
>> +		int i;
>> +
>> +		for (i = 0; i < nr; i++) {
>> +			struct page *page = folio_page(folio, i);
>> +
>> +			/* increment count (starts at -1) */
>> +			atomic_set(&page->_mapcount, 0);
>> +			__page_set_anon_rmap(folio, page, vma,
>> +					address + (i << PAGE_SHIFT), 1);
>> +		}
>> +
>> +		atomic_set(&folio->_nr_pages_mapped, nr);
> 
> This code should work for !folio_test_large() case too, no?

Not quite; for !folio_test_large() we don't set _nr_pages_mapped - that's a
compound-only field in the second struct page. So I could make most of this
common but would still have a conditional around that last line, and at that
point I thought it was better to split it the way I've done it to avoid the loop
overhead for the !large case.

> 
>>  	} else {
>>  		/* increment count (starts at -1) */
>>  		atomic_set(&folio->_entire_mapcount, 0);
>>  		atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
>> -		nr = folio_nr_pages(folio);
>> +		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>  		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
>>  	}
>>  
>>  	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
>> -	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>  }
>>  
>>  /**
>> -- 
>> 2.25.1
>>
> 


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 2/9] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
@ 2023-09-29 14:39       ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-09-29 14:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama, John Hubbard,
	David Rientjes, Vlastimil Babka, Hugh Dickins, linux-mm,
	linux-kernel, linux-arm-kernel

On 29/09/2023 14:45, Kirill A. Shutemov wrote:
> On Fri, Sep 29, 2023 at 12:44:13PM +0100, Ryan Roberts wrote:
>> In preparation for anonymous large folio support, improve
>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
>> passed to it. In this case, all contained pages are accounted using the
>> order-0 folio (or base page) scheme.
>>
>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/rmap.c | 27 ++++++++++++++++++++-------
>>  1 file changed, 20 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 8600bd029acf..106149690366 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1266,31 +1266,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>>   * This means the inc-and-test can be bypassed.
>>   * The folio does not have to be locked.
>>   *
>> - * If the folio is large, it is accounted as a THP.  As the folio
>> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>>   * is new, it's assumed to be mapped exclusively by a single process.
>>   */
>>  void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>  		unsigned long address)
>>  {
>> -	int nr;
>> +	int nr = folio_nr_pages(folio);
>>  
>> -	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>> +	VM_BUG_ON_VMA(address < vma->vm_start ||
>> +			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>  	__folio_set_swapbacked(folio);
>>  
>> -	if (likely(!folio_test_pmd_mappable(folio))) {
>> +	if (likely(!folio_test_large(folio))) {
>>  		/* increment count (starts at -1) */
>>  		atomic_set(&folio->_mapcount, 0);
>> -		nr = 1;
>> +		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>> +	} else if (!folio_test_pmd_mappable(folio)) {
>> +		int i;
>> +
>> +		for (i = 0; i < nr; i++) {
>> +			struct page *page = folio_page(folio, i);
>> +
>> +			/* increment count (starts at -1) */
>> +			atomic_set(&page->_mapcount, 0);
>> +			__page_set_anon_rmap(folio, page, vma,
>> +					address + (i << PAGE_SHIFT), 1);
>> +		}
>> +
>> +		atomic_set(&folio->_nr_pages_mapped, nr);
> 
> This code should work for !folio_test_large() case too, no?

Not quite; for !folio_test_large() we don't set _nr_pages_mapped - that's a
compound-only field in the second struct page. So I could make most of this
common but would still have a conditional around that last line, and at that
point I thought it was better to split it the way I've done it to avoid the loop
overhead for the !large case.

> 
>>  	} else {
>>  		/* increment count (starts at -1) */
>>  		atomic_set(&folio->_entire_mapcount, 0);
>>  		atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
>> -		nr = folio_nr_pages(folio);
>> +		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>  		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
>>  	}
>>  
>>  	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
>> -	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>  }
>>  
>>  /**
>> -- 
>> 2.25.1
>>
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
  2023-09-29 11:44   ` Ryan Roberts
  (?)
@ 2023-09-29 22:55     ` Andrew Morton
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrew Morton @ 2023-09-29 22:55 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Matthew Wilcox, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel, linuxppc-dev

On Fri, 29 Sep 2023 12:44:15 +0100 Ryan Roberts <ryan.roberts@arm.com> wrote:

> In preparation for adding support for anonymous large folios that are
> smaller than the PMD-size, introduce 2 new sysfs files that will be used
> to control the new behaviours via the transparent_hugepage interface.
> For now, the kernel still only supports PMD-order anonymous THP, so when
> reading back anon_orders, it will reflect that. Therefore there are no
> behavioural changes intended here.

powerpc strikes again.  ARCH=powerpc allmodconfig:


In file included from ./include/linux/bits.h:6,
                 from ./include/linux/ratelimit_types.h:5,
                 from ./include/linux/printk.h:9,
                 from ./include/asm-generic/bug.h:22,
                 from ./arch/powerpc/include/asm/bug.h:116,
                 from ./include/linux/bug.h:5,
                 from ./include/linux/mmdebug.h:5,
                 from ./include/linux/mm.h:6,
                 from mm/huge_memory.c:8:
./include/vdso/bits.h:7:33: error: initializer element is not constant
    7 | #define BIT(nr)                 (UL(1) << (nr))
      |                                 ^
mm/huge_memory.c:77:47: note: in expansion of macro 'BIT'
   77 | unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
      |                                               ^~~

We keep tripping over this.  I wish there was a way to fix it.



Style whine: an all-caps identifier is supposed to be a constant,
dammit.

	#define PTE_INDEX_SIZE  __pte_index_size

Nope.



I did this:

--- a/mm/huge_memory.c~mm-thp-introduce-anon_orders-and-anon_always_mask-sysfs-files-fix
+++ a/mm/huge_memory.c
@@ -74,7 +74,7 @@ static unsigned long deferred_split_scan
 static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 unsigned long huge_zero_pfn __read_mostly = ~0UL;
-unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
+unsigned int huge_anon_orders __read_mostly;
 static unsigned int huge_anon_always_mask __read_mostly;
 
 /**
@@ -528,6 +528,9 @@ static int __init hugepage_init_sysfs(st
 {
 	int err;
 
+	/* powerpc's PMD_ORDER isn't a compile-time constant */
+	huge_anon_orders = BIT(PMD_ORDER);
+
 	*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
 	if (unlikely(!*hugepage_kobj)) {
 		pr_err("failed to create transparent hugepage kobject\n");
_


I assume this is set up early enough.

I don't know why powerpc's PTE_INDEX_SIZE is variable.  Hopefully it
has been set up by this time and it won't get altered.  


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
@ 2023-09-29 22:55     ` Andrew Morton
  0 siblings, 0 replies; 140+ messages in thread
From: Andrew Morton @ 2023-09-29 22:55 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: linux-arm-kernel, David Rientjes, Yu Zhao, John Hubbard,
	Anshuman Khandual, Catalin Marinas, Yang Shi, David Hildenbrand,
	Hugh Dickins, Yin Fengwei, Matthew Wilcox, linux-kernel,
	linux-mm, Luis Chamberlain, Vlastimil Babka, Zi Yan, Huang, Ying,
	Itaru Kitayama, linuxppc-dev, Kirill A. Shutemov

On Fri, 29 Sep 2023 12:44:15 +0100 Ryan Roberts <ryan.roberts@arm.com> wrote:

> In preparation for adding support for anonymous large folios that are
> smaller than the PMD-size, introduce 2 new sysfs files that will be used
> to control the new behaviours via the transparent_hugepage interface.
> For now, the kernel still only supports PMD-order anonymous THP, so when
> reading back anon_orders, it will reflect that. Therefore there are no
> behavioural changes intended here.

powerpc strikes again.  ARCH=powerpc allmodconfig:


In file included from ./include/linux/bits.h:6,
                 from ./include/linux/ratelimit_types.h:5,
                 from ./include/linux/printk.h:9,
                 from ./include/asm-generic/bug.h:22,
                 from ./arch/powerpc/include/asm/bug.h:116,
                 from ./include/linux/bug.h:5,
                 from ./include/linux/mmdebug.h:5,
                 from ./include/linux/mm.h:6,
                 from mm/huge_memory.c:8:
./include/vdso/bits.h:7:33: error: initializer element is not constant
    7 | #define BIT(nr)                 (UL(1) << (nr))
      |                                 ^
mm/huge_memory.c:77:47: note: in expansion of macro 'BIT'
   77 | unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
      |                                               ^~~

We keep tripping over this.  I wish there was a way to fix it.



Style whine: an all-caps identifier is supposed to be a constant,
dammit.

	#define PTE_INDEX_SIZE  __pte_index_size

Nope.



I did this:

--- a/mm/huge_memory.c~mm-thp-introduce-anon_orders-and-anon_always_mask-sysfs-files-fix
+++ a/mm/huge_memory.c
@@ -74,7 +74,7 @@ static unsigned long deferred_split_scan
 static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 unsigned long huge_zero_pfn __read_mostly = ~0UL;
-unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
+unsigned int huge_anon_orders __read_mostly;
 static unsigned int huge_anon_always_mask __read_mostly;
 
 /**
@@ -528,6 +528,9 @@ static int __init hugepage_init_sysfs(st
 {
 	int err;
 
+	/* powerpc's PMD_ORDER isn't a compile-time constant */
+	huge_anon_orders = BIT(PMD_ORDER);
+
 	*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
 	if (unlikely(!*hugepage_kobj)) {
 		pr_err("failed to create transparent hugepage kobject\n");
_


I assume this is set up early enough.

I don't know why powerpc's PTE_INDEX_SIZE is variable.  Hopefully it
has been set up by this time and it won't get altered.  


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
@ 2023-09-29 22:55     ` Andrew Morton
  0 siblings, 0 replies; 140+ messages in thread
From: Andrew Morton @ 2023-09-29 22:55 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Matthew Wilcox, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel, linuxppc-dev

On Fri, 29 Sep 2023 12:44:15 +0100 Ryan Roberts <ryan.roberts@arm.com> wrote:

> In preparation for adding support for anonymous large folios that are
> smaller than the PMD-size, introduce 2 new sysfs files that will be used
> to control the new behaviours via the transparent_hugepage interface.
> For now, the kernel still only supports PMD-order anonymous THP, so when
> reading back anon_orders, it will reflect that. Therefore there are no
> behavioural changes intended here.

powerpc strikes again.  ARCH=powerpc allmodconfig:


In file included from ./include/linux/bits.h:6,
                 from ./include/linux/ratelimit_types.h:5,
                 from ./include/linux/printk.h:9,
                 from ./include/asm-generic/bug.h:22,
                 from ./arch/powerpc/include/asm/bug.h:116,
                 from ./include/linux/bug.h:5,
                 from ./include/linux/mmdebug.h:5,
                 from ./include/linux/mm.h:6,
                 from mm/huge_memory.c:8:
./include/vdso/bits.h:7:33: error: initializer element is not constant
    7 | #define BIT(nr)                 (UL(1) << (nr))
      |                                 ^
mm/huge_memory.c:77:47: note: in expansion of macro 'BIT'
   77 | unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
      |                                               ^~~

We keep tripping over this.  I wish there was a way to fix it.



Style whine: an all-caps identifier is supposed to be a constant,
dammit.

	#define PTE_INDEX_SIZE  __pte_index_size

Nope.



I did this:

--- a/mm/huge_memory.c~mm-thp-introduce-anon_orders-and-anon_always_mask-sysfs-files-fix
+++ a/mm/huge_memory.c
@@ -74,7 +74,7 @@ static unsigned long deferred_split_scan
 static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 unsigned long huge_zero_pfn __read_mostly = ~0UL;
-unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
+unsigned int huge_anon_orders __read_mostly;
 static unsigned int huge_anon_always_mask __read_mostly;
 
 /**
@@ -528,6 +528,9 @@ static int __init hugepage_init_sysfs(st
 {
 	int err;
 
+	/* powerpc's PMD_ORDER isn't a compile-time constant */
+	huge_anon_orders = BIT(PMD_ORDER);
+
 	*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
 	if (unlikely(!*hugepage_kobj)) {
 		pr_err("failed to create transparent hugepage kobject\n");
_


I assume this is set up early enough.

I don't know why powerpc's PTE_INDEX_SIZE is variable.  Hopefully it
has been set up by this time and it won't get altered.  


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
  2023-09-29 22:55     ` Andrew Morton
  (?)
@ 2023-10-02 10:15       ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-02 10:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel, linuxppc-dev

On 29/09/2023 23:55, Andrew Morton wrote:
> On Fri, 29 Sep 2023 12:44:15 +0100 Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
>> In preparation for adding support for anonymous large folios that are
>> smaller than the PMD-size, introduce 2 new sysfs files that will be used
>> to control the new behaviours via the transparent_hugepage interface.
>> For now, the kernel still only supports PMD-order anonymous THP, so when
>> reading back anon_orders, it will reflect that. Therefore there are no
>> behavioural changes intended here.
> 
> powerpc strikes again.  ARCH=powerpc allmodconfig:
> 
> 
> In file included from ./include/linux/bits.h:6,
>                  from ./include/linux/ratelimit_types.h:5,
>                  from ./include/linux/printk.h:9,
>                  from ./include/asm-generic/bug.h:22,
>                  from ./arch/powerpc/include/asm/bug.h:116,
>                  from ./include/linux/bug.h:5,
>                  from ./include/linux/mmdebug.h:5,
>                  from ./include/linux/mm.h:6,
>                  from mm/huge_memory.c:8:
> ./include/vdso/bits.h:7:33: error: initializer element is not constant
>     7 | #define BIT(nr)                 (UL(1) << (nr))
>       |                                 ^
> mm/huge_memory.c:77:47: note: in expansion of macro 'BIT'
>    77 | unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
>       |                                               ^~~

Ahh my bad, sorry about that - I built various configs and arches but not powerpc.

> 
> We keep tripping over this.  I wish there was a way to fix it.
> 
> 
> 
> Style whine: an all-caps identifier is supposed to be a constant,
> dammit.
> 
> 	#define PTE_INDEX_SIZE  __pte_index_size
> 
> Nope.
> 
> 
> 
> I did this:
> 
> --- a/mm/huge_memory.c~mm-thp-introduce-anon_orders-and-anon_always_mask-sysfs-files-fix
> +++ a/mm/huge_memory.c
> @@ -74,7 +74,7 @@ static unsigned long deferred_split_scan
>  static atomic_t huge_zero_refcount;
>  struct page *huge_zero_page __read_mostly;
>  unsigned long huge_zero_pfn __read_mostly = ~0UL;
> -unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
> +unsigned int huge_anon_orders __read_mostly;
>  static unsigned int huge_anon_always_mask __read_mostly;
>  
>  /**
> @@ -528,6 +528,9 @@ static int __init hugepage_init_sysfs(st
>  {
>  	int err;
>  
> +	/* powerpc's PMD_ORDER isn't a compile-time constant */
> +	huge_anon_orders = BIT(PMD_ORDER);
> +
>  	*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
>  	if (unlikely(!*hugepage_kobj)) {
>  		pr_err("failed to create transparent hugepage kobject\n");
> _
> 
> 
> I assume this is set up early enough.

Yes this should be fine.

> 
> I don't know why powerpc's PTE_INDEX_SIZE is variable.  Hopefully it
> has been set up by this time and it won't get altered.  

Looks that way from the code; its set during early_init_mmu().

Anyway, I'll take the fix into my next spin if I need to do one. I see you've
taken it into mm-unstable - thanks! But given I'm introducing UABI, I was
expecting some comments and a probably need for a new rev. I'd like to think we
are getting there though.

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
@ 2023-10-02 10:15       ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-02 10:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel, linuxppc-dev

On 29/09/2023 23:55, Andrew Morton wrote:
> On Fri, 29 Sep 2023 12:44:15 +0100 Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
>> In preparation for adding support for anonymous large folios that are
>> smaller than the PMD-size, introduce 2 new sysfs files that will be used
>> to control the new behaviours via the transparent_hugepage interface.
>> For now, the kernel still only supports PMD-order anonymous THP, so when
>> reading back anon_orders, it will reflect that. Therefore there are no
>> behavioural changes intended here.
> 
> powerpc strikes again.  ARCH=powerpc allmodconfig:
> 
> 
> In file included from ./include/linux/bits.h:6,
>                  from ./include/linux/ratelimit_types.h:5,
>                  from ./include/linux/printk.h:9,
>                  from ./include/asm-generic/bug.h:22,
>                  from ./arch/powerpc/include/asm/bug.h:116,
>                  from ./include/linux/bug.h:5,
>                  from ./include/linux/mmdebug.h:5,
>                  from ./include/linux/mm.h:6,
>                  from mm/huge_memory.c:8:
> ./include/vdso/bits.h:7:33: error: initializer element is not constant
>     7 | #define BIT(nr)                 (UL(1) << (nr))
>       |                                 ^
> mm/huge_memory.c:77:47: note: in expansion of macro 'BIT'
>    77 | unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
>       |                                               ^~~

Ahh my bad, sorry about that - I built various configs and arches but not powerpc.

> 
> We keep tripping over this.  I wish there was a way to fix it.
> 
> 
> 
> Style whine: an all-caps identifier is supposed to be a constant,
> dammit.
> 
> 	#define PTE_INDEX_SIZE  __pte_index_size
> 
> Nope.
> 
> 
> 
> I did this:
> 
> --- a/mm/huge_memory.c~mm-thp-introduce-anon_orders-and-anon_always_mask-sysfs-files-fix
> +++ a/mm/huge_memory.c
> @@ -74,7 +74,7 @@ static unsigned long deferred_split_scan
>  static atomic_t huge_zero_refcount;
>  struct page *huge_zero_page __read_mostly;
>  unsigned long huge_zero_pfn __read_mostly = ~0UL;
> -unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
> +unsigned int huge_anon_orders __read_mostly;
>  static unsigned int huge_anon_always_mask __read_mostly;
>  
>  /**
> @@ -528,6 +528,9 @@ static int __init hugepage_init_sysfs(st
>  {
>  	int err;
>  
> +	/* powerpc's PMD_ORDER isn't a compile-time constant */
> +	huge_anon_orders = BIT(PMD_ORDER);
> +
>  	*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
>  	if (unlikely(!*hugepage_kobj)) {
>  		pr_err("failed to create transparent hugepage kobject\n");
> _
> 
> 
> I assume this is set up early enough.

Yes this should be fine.

> 
> I don't know why powerpc's PTE_INDEX_SIZE is variable.  Hopefully it
> has been set up by this time and it won't get altered.  

Looks that way from the code; its set during early_init_mmu().

Anyway, I'll take the fix into my next spin if I need to do one. I see you've
taken it into mm-unstable - thanks! But given I'm introducing UABI, I was
expecting some comments and a probably need for a new rev. I'd like to think we
are getting there though.

Thanks,
Ryan


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
@ 2023-10-02 10:15       ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-02 10:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arm-kernel, David Rientjes, Yu Zhao, John Hubbard,
	Anshuman Khandual, Catalin Marinas, Yang Shi, David Hildenbrand,
	Hugh Dickins, Yin Fengwei, Matthew Wilcox, linux-kernel,
	linux-mm, Luis Chamberlain, Vlastimil Babka, Zi Yan, Huang, Ying,
	Itaru Kitayama, linuxppc-dev, Kirill A. Shutemov

On 29/09/2023 23:55, Andrew Morton wrote:
> On Fri, 29 Sep 2023 12:44:15 +0100 Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
>> In preparation for adding support for anonymous large folios that are
>> smaller than the PMD-size, introduce 2 new sysfs files that will be used
>> to control the new behaviours via the transparent_hugepage interface.
>> For now, the kernel still only supports PMD-order anonymous THP, so when
>> reading back anon_orders, it will reflect that. Therefore there are no
>> behavioural changes intended here.
> 
> powerpc strikes again.  ARCH=powerpc allmodconfig:
> 
> 
> In file included from ./include/linux/bits.h:6,
>                  from ./include/linux/ratelimit_types.h:5,
>                  from ./include/linux/printk.h:9,
>                  from ./include/asm-generic/bug.h:22,
>                  from ./arch/powerpc/include/asm/bug.h:116,
>                  from ./include/linux/bug.h:5,
>                  from ./include/linux/mmdebug.h:5,
>                  from ./include/linux/mm.h:6,
>                  from mm/huge_memory.c:8:
> ./include/vdso/bits.h:7:33: error: initializer element is not constant
>     7 | #define BIT(nr)                 (UL(1) << (nr))
>       |                                 ^
> mm/huge_memory.c:77:47: note: in expansion of macro 'BIT'
>    77 | unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
>       |                                               ^~~

Ahh my bad, sorry about that - I built various configs and arches but not powerpc.

> 
> We keep tripping over this.  I wish there was a way to fix it.
> 
> 
> 
> Style whine: an all-caps identifier is supposed to be a constant,
> dammit.
> 
> 	#define PTE_INDEX_SIZE  __pte_index_size
> 
> Nope.
> 
> 
> 
> I did this:
> 
> --- a/mm/huge_memory.c~mm-thp-introduce-anon_orders-and-anon_always_mask-sysfs-files-fix
> +++ a/mm/huge_memory.c
> @@ -74,7 +74,7 @@ static unsigned long deferred_split_scan
>  static atomic_t huge_zero_refcount;
>  struct page *huge_zero_page __read_mostly;
>  unsigned long huge_zero_pfn __read_mostly = ~0UL;
> -unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
> +unsigned int huge_anon_orders __read_mostly;
>  static unsigned int huge_anon_always_mask __read_mostly;
>  
>  /**
> @@ -528,6 +528,9 @@ static int __init hugepage_init_sysfs(st
>  {
>  	int err;
>  
> +	/* powerpc's PMD_ORDER isn't a compile-time constant */
> +	huge_anon_orders = BIT(PMD_ORDER);
> +
>  	*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
>  	if (unlikely(!*hugepage_kobj)) {
>  		pr_err("failed to create transparent hugepage kobject\n");
> _
> 
> 
> I assume this is set up early enough.

Yes this should be fine.

> 
> I don't know why powerpc's PTE_INDEX_SIZE is variable.  Hopefully it
> has been set up by this time and it won't get altered.  

Looks that way from the code; its set during early_init_mmu().

Anyway, I'll take the fix into my next spin if I need to do one. I see you've
taken it into mm-unstable - thanks! But given I'm introducing UABI, I was
expecting some comments and a probably need for a new rev. I'd like to think we
are getting there though.

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 7/9] arm64/mm: Override arch_wants_pte_order()
  2023-09-29 11:44   ` Ryan Roberts
@ 2023-10-02 15:21     ` Catalin Marinas
  -1 siblings, 0 replies; 140+ messages in thread
From: Catalin Marinas @ 2023-10-02 15:21 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On Fri, Sep 29, 2023 at 12:44:18PM +0100, Ryan Roberts wrote:
> Define an arch-specific override of arch_wants_pte_order() so that when
> anon_orders=recommend is set, large folios will be allocated for
> anonymous memory with an order that is compatible with arm64's HPA uarch
> feature.
> 
> Reviewed-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Acked-by: Catalin Marinas <catalin.marinas@arm.com>

> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 7f7d9b1df4e5..e3d2449dec5c 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1110,6 +1110,16 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
>  extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  				    unsigned long addr, pte_t *ptep,
>  				    pte_t old_pte, pte_t new_pte);
> +
> +#define arch_wants_pte_order arch_wants_pte_order
> +static inline int arch_wants_pte_order(void)
> +{
> +	/*
> +	 * Many arm64 CPUs support hardware page aggregation (HPA), which can
> +	 * coalesce 4 contiguous pages into a single TLB entry.
> +	 */
> +	return 2;
> +}

I haven't followed the discussions on previous revisions of this series
but I wonder why not return a bitmap from arch_wants_pte_order(). For
arm64 we may want an order 6 at some point (contiguous ptes) with a
fallback to order 2 as the next best.

-- 
Catalin

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 7/9] arm64/mm: Override arch_wants_pte_order()
@ 2023-10-02 15:21     ` Catalin Marinas
  0 siblings, 0 replies; 140+ messages in thread
From: Catalin Marinas @ 2023-10-02 15:21 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On Fri, Sep 29, 2023 at 12:44:18PM +0100, Ryan Roberts wrote:
> Define an arch-specific override of arch_wants_pte_order() so that when
> anon_orders=recommend is set, large folios will be allocated for
> anonymous memory with an order that is compatible with arm64's HPA uarch
> feature.
> 
> Reviewed-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Acked-by: Catalin Marinas <catalin.marinas@arm.com>

> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 7f7d9b1df4e5..e3d2449dec5c 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1110,6 +1110,16 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
>  extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  				    unsigned long addr, pte_t *ptep,
>  				    pte_t old_pte, pte_t new_pte);
> +
> +#define arch_wants_pte_order arch_wants_pte_order
> +static inline int arch_wants_pte_order(void)
> +{
> +	/*
> +	 * Many arm64 CPUs support hardware page aggregation (HPA), which can
> +	 * coalesce 4 contiguous pages into a single TLB entry.
> +	 */
> +	return 2;
> +}

I haven't followed the discussions on previous revisions of this series
but I wonder why not return a bitmap from arch_wants_pte_order(). For
arm64 we may want an order 6 at some point (contiguous ptes) with a
fallback to order 2 as the next best.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 7/9] arm64/mm: Override arch_wants_pte_order()
  2023-10-02 15:21     ` Catalin Marinas
@ 2023-10-03  7:32       ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-03  7:32 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On 02/10/2023 16:21, Catalin Marinas wrote:
> On Fri, Sep 29, 2023 at 12:44:18PM +0100, Ryan Roberts wrote:
>> Define an arch-specific override of arch_wants_pte_order() so that when
>> anon_orders=recommend is set, large folios will be allocated for
>> anonymous memory with an order that is compatible with arm64's HPA uarch
>> feature.
>>
>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> 
> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
> 
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 7f7d9b1df4e5..e3d2449dec5c 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1110,6 +1110,16 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
>>  extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>  				    unsigned long addr, pte_t *ptep,
>>  				    pte_t old_pte, pte_t new_pte);
>> +
>> +#define arch_wants_pte_order arch_wants_pte_order
>> +static inline int arch_wants_pte_order(void)
>> +{
>> +	/*
>> +	 * Many arm64 CPUs support hardware page aggregation (HPA), which can
>> +	 * coalesce 4 contiguous pages into a single TLB entry.
>> +	 */
>> +	return 2;
>> +}
> 
> I haven't followed the discussions on previous revisions of this series
> but I wonder why not return a bitmap from arch_wants_pte_order(). For
> arm64 we may want an order 6 at some point (contiguous ptes) with a
> fallback to order 2 as the next best.
> 

This sounds like good idea to me - I'll implement it, assuming there is a next
rev. (Or in the unlikely event that this is the only pending change, I'd rather
defer it to when we actually need it with the contpte series).

This is just a hangover from the "MVP" approach that I was persuing in v5, where
we didn't want to configure too many orders for fear of fragmentation. But in v6
I've introduced UABI to configure the set of orders, and this function feeds
into the special "recommend" set. So I think it is appropriate that this API
allows expression of multiple orders as you suggest.

Side note: I don't think order-6 is ever a contpte size? Its order-4 for 4K,
order-7 for 16k and order-5 for 64k.


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 7/9] arm64/mm: Override arch_wants_pte_order()
@ 2023-10-03  7:32       ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-03  7:32 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On 02/10/2023 16:21, Catalin Marinas wrote:
> On Fri, Sep 29, 2023 at 12:44:18PM +0100, Ryan Roberts wrote:
>> Define an arch-specific override of arch_wants_pte_order() so that when
>> anon_orders=recommend is set, large folios will be allocated for
>> anonymous memory with an order that is compatible with arm64's HPA uarch
>> feature.
>>
>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> 
> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
> 
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 7f7d9b1df4e5..e3d2449dec5c 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1110,6 +1110,16 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
>>  extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>  				    unsigned long addr, pte_t *ptep,
>>  				    pte_t old_pte, pte_t new_pte);
>> +
>> +#define arch_wants_pte_order arch_wants_pte_order
>> +static inline int arch_wants_pte_order(void)
>> +{
>> +	/*
>> +	 * Many arm64 CPUs support hardware page aggregation (HPA), which can
>> +	 * coalesce 4 contiguous pages into a single TLB entry.
>> +	 */
>> +	return 2;
>> +}
> 
> I haven't followed the discussions on previous revisions of this series
> but I wonder why not return a bitmap from arch_wants_pte_order(). For
> arm64 we may want an order 6 at some point (contiguous ptes) with a
> fallback to order 2 as the next best.
> 

This sounds like good idea to me - I'll implement it, assuming there is a next
rev. (Or in the unlikely event that this is the only pending change, I'd rather
defer it to when we actually need it with the contpte series).

This is just a hangover from the "MVP" approach that I was persuing in v5, where
we didn't want to configure too many orders for fear of fragmentation. But in v6
I've introduced UABI to configure the set of orders, and this function feeds
into the special "recommend" set. So I think it is appropriate that this API
allows expression of multiple orders as you suggest.

Side note: I don't think order-6 is ever a contpte size? Its order-4 for 4K,
order-7 for 16k and order-5 for 64k.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 7/9] arm64/mm: Override arch_wants_pte_order()
  2023-10-03  7:32       ` Ryan Roberts
@ 2023-10-03 12:05         ` Catalin Marinas
  -1 siblings, 0 replies; 140+ messages in thread
From: Catalin Marinas @ 2023-10-03 12:05 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On Tue, Oct 03, 2023 at 08:32:29AM +0100, Ryan Roberts wrote:
> On 02/10/2023 16:21, Catalin Marinas wrote:
> > On Fri, Sep 29, 2023 at 12:44:18PM +0100, Ryan Roberts wrote:
> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> index 7f7d9b1df4e5..e3d2449dec5c 100644
> >> --- a/arch/arm64/include/asm/pgtable.h
> >> +++ b/arch/arm64/include/asm/pgtable.h
> >> @@ -1110,6 +1110,16 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
> >>  extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
> >>  				    unsigned long addr, pte_t *ptep,
> >>  				    pte_t old_pte, pte_t new_pte);
> >> +
> >> +#define arch_wants_pte_order arch_wants_pte_order
> >> +static inline int arch_wants_pte_order(void)
> >> +{
> >> +	/*
> >> +	 * Many arm64 CPUs support hardware page aggregation (HPA), which can
> >> +	 * coalesce 4 contiguous pages into a single TLB entry.
> >> +	 */
> >> +	return 2;
> >> +}
> > 
> > I haven't followed the discussions on previous revisions of this series
> > but I wonder why not return a bitmap from arch_wants_pte_order(). For
> > arm64 we may want an order 6 at some point (contiguous ptes) with a
> > fallback to order 2 as the next best.
> 
> This sounds like good idea to me - I'll implement it, assuming there is a next
> rev. (Or in the unlikely event that this is the only pending change, I'd rather
> defer it to when we actually need it with the contpte series).

Fine by me, at the moment there wouldn't be any user, so a patch on top
later would do.

> Side note: I don't think order-6 is ever a contpte size? Its order-4 for 4K,
> order-7 for 16k and order-5 for 64k.

Yes, it's order-4 for 4K pages (I was thinking too much of the "64" in 64KB).

-- 
Catalin

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 7/9] arm64/mm: Override arch_wants_pte_order()
@ 2023-10-03 12:05         ` Catalin Marinas
  0 siblings, 0 replies; 140+ messages in thread
From: Catalin Marinas @ 2023-10-03 12:05 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On Tue, Oct 03, 2023 at 08:32:29AM +0100, Ryan Roberts wrote:
> On 02/10/2023 16:21, Catalin Marinas wrote:
> > On Fri, Sep 29, 2023 at 12:44:18PM +0100, Ryan Roberts wrote:
> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> index 7f7d9b1df4e5..e3d2449dec5c 100644
> >> --- a/arch/arm64/include/asm/pgtable.h
> >> +++ b/arch/arm64/include/asm/pgtable.h
> >> @@ -1110,6 +1110,16 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
> >>  extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
> >>  				    unsigned long addr, pte_t *ptep,
> >>  				    pte_t old_pte, pte_t new_pte);
> >> +
> >> +#define arch_wants_pte_order arch_wants_pte_order
> >> +static inline int arch_wants_pte_order(void)
> >> +{
> >> +	/*
> >> +	 * Many arm64 CPUs support hardware page aggregation (HPA), which can
> >> +	 * coalesce 4 contiguous pages into a single TLB entry.
> >> +	 */
> >> +	return 2;
> >> +}
> > 
> > I haven't followed the discussions on previous revisions of this series
> > but I wonder why not return a bitmap from arch_wants_pte_order(). For
> > arm64 we may want an order 6 at some point (contiguous ptes) with a
> > fallback to order 2 as the next best.
> 
> This sounds like good idea to me - I'll implement it, assuming there is a next
> rev. (Or in the unlikely event that this is the only pending change, I'd rather
> defer it to when we actually need it with the contpte series).

Fine by me, at the moment there wouldn't be any user, so a patch on top
later would do.

> Side note: I don't think order-6 is ever a contpte size? Its order-4 for 4K,
> order-7 for 16k and order-5 for 64k.

Yes, it's order-4 for 4K pages (I was thinking too much of the "64" in 64KB).

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 1/9] mm: Allow deferred splitting of arbitrary anon large folios
  2023-09-29 11:44   ` Ryan Roberts
@ 2023-10-05  8:19     ` David Hildenbrand
  -1 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-05  8:19 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 29.09.23 13:44, Ryan Roberts wrote:
> In preparation for the introduction of large folios for anonymous
> memory, we would like to be able to split them when they have unmapped
> subpages, in order to free those unused pages under memory pressure. So
> remove the artificial requirement that the large folio needed to be at
> least PMD-sized.
> 
> Reviewed-by: Yu Zhao <yuzhao@google.com>
> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   mm/rmap.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9f795b93cf40..8600bd029acf 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1446,11 +1446,11 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>   		__lruvec_stat_mod_folio(folio, idx, -nr);
>   
>   		/*
> -		 * Queue anon THP for deferred split if at least one
> +		 * Queue anon large folio for deferred split if at least one
>   		 * page of the folio is unmapped and at least one page
>   		 * is still mapped.
>   		 */
> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
> +		if (folio_test_large(folio) && folio_test_anon(folio))
>   			if (!compound || nr < nr_pmdmapped)
>   				deferred_split_folio(folio);
>   	}

This patch can be picked up early I think.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 1/9] mm: Allow deferred splitting of arbitrary anon large folios
@ 2023-10-05  8:19     ` David Hildenbrand
  0 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-05  8:19 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 29.09.23 13:44, Ryan Roberts wrote:
> In preparation for the introduction of large folios for anonymous
> memory, we would like to be able to split them when they have unmapped
> subpages, in order to free those unused pages under memory pressure. So
> remove the artificial requirement that the large folio needed to be at
> least PMD-sized.
> 
> Reviewed-by: Yu Zhao <yuzhao@google.com>
> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   mm/rmap.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9f795b93cf40..8600bd029acf 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1446,11 +1446,11 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>   		__lruvec_stat_mod_folio(folio, idx, -nr);
>   
>   		/*
> -		 * Queue anon THP for deferred split if at least one
> +		 * Queue anon large folio for deferred split if at least one
>   		 * page of the folio is unmapped and at least one page
>   		 * is still mapped.
>   		 */
> -		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
> +		if (folio_test_large(folio) && folio_test_anon(folio))
>   			if (!compound || nr < nr_pmdmapped)
>   				deferred_split_folio(folio);
>   	}

This patch can be picked up early I think.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
       [not found]   ` <CGME20231005120507eucas1p13f50fa99f52808818840ee7db194e12e@eucas1p1.samsung.com>
@ 2023-10-05 12:05       ` Daniel Gomez
  0 siblings, 0 replies; 140+ messages in thread
From: Daniel Gomez @ 2023-10-05 12:05 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On Fri, Sep 29, 2023 at 12:44:16PM +0100, Ryan Roberts wrote:

Hi Ryan,
> Introduce the logic to allow THP to be configured (through the new
> anon_orders interface we just added) to allocate large folios to back
> anonymous memory, which are smaller than PMD-size (for example order-2,
> order-3, order-4, etc).
>
> These THPs continue to be PTE-mapped, but in many cases can still
> provide similar benefits to traditional PMD-sized THP: Page faults are
> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
> the configured order), but latency spikes are much less prominent
> because the size of each page isn't as huge as the PMD-sized variant and
> there is less memory to clear in each page fault. The number of per-page
> operations (e.g. ref counting, rmap management, lru list management) are
> also significantly reduced since those ops now become per-folio.
>
> Some architectures also employ TLB compression mechanisms to squeeze
> more entries in when a set of PTEs are virtually and physically
> contiguous and approporiately aligned. In this case, TLB misses will
> occur less often.
>
> The new behaviour is disabled by default because the anon_orders
> defaults to only enabling PMD-order, but can be enabled at runtime by
> writing to anon_orders (see documentation in previous commit). The long
> term aim is to default anon_orders to include suitable lower orders, but
> there are some risks around internal fragmentation that need to be
> better understood first.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst |   9 +-
>  include/linux/huge_mm.h                    |   6 +-
>  mm/memory.c                                | 108 +++++++++++++++++++--
>  3 files changed, 111 insertions(+), 12 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 9f954e73a4ca..732c3b2f4ba8 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -353,7 +353,9 @@ anonymous transparent huge pages, it is necessary to read
>  ``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap
>  fields for each mapping. Note that in both cases, AnonHugePages refers
>  only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped
> -using PTEs.
> +using PTEs. This includes all THPs whose order is smaller than
> +PMD-order, as well as any PMD-order THPs that happen to be PTE-mapped
> +for other reasons.
>
>  The number of file transparent huge pages mapped to userspace is available
>  by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
> @@ -367,6 +369,11 @@ frequently will incur overhead.
>  There are a number of counters in ``/proc/vmstat`` that may be used to
>  monitor how successfully the system is providing huge pages for use.
>
> +.. note::
> +   Currently the below counters only record events relating to
> +   PMD-order THPs. Events relating to smaller order THPs are not
> +   included.
> +
>  thp_fault_alloc
>  	is incremented every time a huge page is successfully
>  	allocated to handle a page fault.
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2e7c338229a6..c4860476a1f5 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>  #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>
>  /*
> - * Mask of all large folio orders supported for anonymous THP.
> + * Mask of all large folio orders supported for anonymous THP; all orders up to
> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> + * (which is a limitation of the THP implementation).
>   */
> -#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
> +#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>
>  /*
>   * Mask of all large folio orders supported for file THP.
> diff --git a/mm/memory.c b/mm/memory.c
> index b5b82fc8e164..92ed9c782dc9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	return ret;
>  }
>
> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> +{
> +	int i;
> +
> +	if (nr_pages == 1)
> +		return vmf_pte_changed(vmf);
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> +{
> +	gfp_t gfp;
> +	pte_t *pte;
> +	unsigned long addr;
> +	struct folio *folio;
> +	struct vm_area_struct *vma = vmf->vma;
> +	unsigned int orders;
> +	int order;
> +
> +	/*
> +	 * If uffd is active for the vma we need per-page fault fidelity to
> +	 * maintain the uffd semantics.
> +	 */
> +	if (userfaultfd_armed(vma))
> +		goto fallback;
> +
> +	/*
> +	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
> +	 * for this vma. Then filter out the orders that can't be allocated over
> +	 * the faulting address and still be fully contained in the vma.
> +	 */
> +	orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
> +				    BIT(PMD_ORDER) - 1);
> +	orders = transhuge_vma_suitable(vma, vmf->address, orders);
> +
> +	if (!orders)
> +		goto fallback;
> +
> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> +	if (!pte)
> +		return ERR_PTR(-EAGAIN);
> +
> +	order = first_order(orders);
> +	while (orders) {
> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +		vmf->pte = pte + pte_index(addr);
> +		if (!vmf_pte_range_changed(vmf, 1 << order))
> +			break;
> +		order = next_order(&orders, order);
> +	}
> +
> +	vmf->pte = NULL;
> +	pte_unmap(pte);
> +
> +	gfp = vma_thp_gfp_mask(vma);
> +
> +	while (orders) {
> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +		folio = vma_alloc_folio(gfp, order, vma, addr, true);

I was checking your series and noticed about the hugepage flag. I think
you've changed it from v1 -> v2 from being false to true when orders >=2
but I'm not sure about the reasoning. Is this because of your statement
in the cover letter [1]?

[1] cover letter snippet:

"to implement variable order, large folios for anonymous memory.
(previously called ..., but now exposed as an extension to THP;
"small-order THP")"

Thanks,
Daniel

> +		if (folio) {
> +			clear_huge_page(&folio->page, addr, 1 << order);
> +			return folio;
> +		}
> +		order = next_order(&orders, order);
> +	}
> +
> +fallback:
> +	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +}
> +#else
> +#define alloc_anon_folio(vmf) \
> +		vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
> +#endif
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4066,6 +4147,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   */
>  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  {
> +	int i;
> +	int nr_pages = 1;
> +	unsigned long addr = vmf->address;
>  	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct folio *folio;
> @@ -4110,10 +4194,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	/* Allocate our own private page. */
>  	if (unlikely(anon_vma_prepare(vma)))
>  		goto oom;
> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +	folio = alloc_anon_folio(vmf);
> +	if (IS_ERR(folio))
> +		return 0;
>  	if (!folio)
>  		goto oom;
>
> +	nr_pages = folio_nr_pages(folio);
> +	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +
>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>  		goto oom_free_page;
>  	folio_throttle_swaprate(folio, GFP_KERNEL);
> @@ -4130,12 +4219,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	if (vma->vm_flags & VM_WRITE)
>  		entry = pte_mkwrite(pte_mkdirty(entry), vma);
>
> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -			&vmf->ptl);
> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>  	if (!vmf->pte)
>  		goto release;
> -	if (vmf_pte_changed(vmf)) {
> -		update_mmu_tlb(vma, vmf->address, vmf->pte);
> +	if (vmf_pte_range_changed(vmf, nr_pages)) {
> +		for (i = 0; i < nr_pages; i++)
> +			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>  		goto release;
>  	}
>
> @@ -4150,16 +4239,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>  	}
>
> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
> +	folio_ref_add(folio, nr_pages - 1);
> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> +	folio_add_new_anon_rmap(folio, vma, addr);
>  	folio_add_lru_vma(folio, vma);
>  setpte:
>  	if (uffd_wp)
>  		entry = pte_mkuffd_wp(entry);
> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>
>  	/* No need to invalidate - it was non-present before */
> -	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> +	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
>  unlock:
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
@ 2023-10-05 12:05       ` Daniel Gomez
  0 siblings, 0 replies; 140+ messages in thread
From: Daniel Gomez @ 2023-10-05 12:05 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On Fri, Sep 29, 2023 at 12:44:16PM +0100, Ryan Roberts wrote:

Hi Ryan,
> Introduce the logic to allow THP to be configured (through the new
> anon_orders interface we just added) to allocate large folios to back
> anonymous memory, which are smaller than PMD-size (for example order-2,
> order-3, order-4, etc).
>
> These THPs continue to be PTE-mapped, but in many cases can still
> provide similar benefits to traditional PMD-sized THP: Page faults are
> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
> the configured order), but latency spikes are much less prominent
> because the size of each page isn't as huge as the PMD-sized variant and
> there is less memory to clear in each page fault. The number of per-page
> operations (e.g. ref counting, rmap management, lru list management) are
> also significantly reduced since those ops now become per-folio.
>
> Some architectures also employ TLB compression mechanisms to squeeze
> more entries in when a set of PTEs are virtually and physically
> contiguous and approporiately aligned. In this case, TLB misses will
> occur less often.
>
> The new behaviour is disabled by default because the anon_orders
> defaults to only enabling PMD-order, but can be enabled at runtime by
> writing to anon_orders (see documentation in previous commit). The long
> term aim is to default anon_orders to include suitable lower orders, but
> there are some risks around internal fragmentation that need to be
> better understood first.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst |   9 +-
>  include/linux/huge_mm.h                    |   6 +-
>  mm/memory.c                                | 108 +++++++++++++++++++--
>  3 files changed, 111 insertions(+), 12 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 9f954e73a4ca..732c3b2f4ba8 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -353,7 +353,9 @@ anonymous transparent huge pages, it is necessary to read
>  ``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap
>  fields for each mapping. Note that in both cases, AnonHugePages refers
>  only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped
> -using PTEs.
> +using PTEs. This includes all THPs whose order is smaller than
> +PMD-order, as well as any PMD-order THPs that happen to be PTE-mapped
> +for other reasons.
>
>  The number of file transparent huge pages mapped to userspace is available
>  by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
> @@ -367,6 +369,11 @@ frequently will incur overhead.
>  There are a number of counters in ``/proc/vmstat`` that may be used to
>  monitor how successfully the system is providing huge pages for use.
>
> +.. note::
> +   Currently the below counters only record events relating to
> +   PMD-order THPs. Events relating to smaller order THPs are not
> +   included.
> +
>  thp_fault_alloc
>  	is incremented every time a huge page is successfully
>  	allocated to handle a page fault.
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2e7c338229a6..c4860476a1f5 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>  #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>
>  /*
> - * Mask of all large folio orders supported for anonymous THP.
> + * Mask of all large folio orders supported for anonymous THP; all orders up to
> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> + * (which is a limitation of the THP implementation).
>   */
> -#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
> +#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>
>  /*
>   * Mask of all large folio orders supported for file THP.
> diff --git a/mm/memory.c b/mm/memory.c
> index b5b82fc8e164..92ed9c782dc9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	return ret;
>  }
>
> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> +{
> +	int i;
> +
> +	if (nr_pages == 1)
> +		return vmf_pte_changed(vmf);
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> +{
> +	gfp_t gfp;
> +	pte_t *pte;
> +	unsigned long addr;
> +	struct folio *folio;
> +	struct vm_area_struct *vma = vmf->vma;
> +	unsigned int orders;
> +	int order;
> +
> +	/*
> +	 * If uffd is active for the vma we need per-page fault fidelity to
> +	 * maintain the uffd semantics.
> +	 */
> +	if (userfaultfd_armed(vma))
> +		goto fallback;
> +
> +	/*
> +	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
> +	 * for this vma. Then filter out the orders that can't be allocated over
> +	 * the faulting address and still be fully contained in the vma.
> +	 */
> +	orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
> +				    BIT(PMD_ORDER) - 1);
> +	orders = transhuge_vma_suitable(vma, vmf->address, orders);
> +
> +	if (!orders)
> +		goto fallback;
> +
> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> +	if (!pte)
> +		return ERR_PTR(-EAGAIN);
> +
> +	order = first_order(orders);
> +	while (orders) {
> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +		vmf->pte = pte + pte_index(addr);
> +		if (!vmf_pte_range_changed(vmf, 1 << order))
> +			break;
> +		order = next_order(&orders, order);
> +	}
> +
> +	vmf->pte = NULL;
> +	pte_unmap(pte);
> +
> +	gfp = vma_thp_gfp_mask(vma);
> +
> +	while (orders) {
> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +		folio = vma_alloc_folio(gfp, order, vma, addr, true);

I was checking your series and noticed about the hugepage flag. I think
you've changed it from v1 -> v2 from being false to true when orders >=2
but I'm not sure about the reasoning. Is this because of your statement
in the cover letter [1]?

[1] cover letter snippet:

"to implement variable order, large folios for anonymous memory.
(previously called ..., but now exposed as an extension to THP;
"small-order THP")"

Thanks,
Daniel

> +		if (folio) {
> +			clear_huge_page(&folio->page, addr, 1 << order);
> +			return folio;
> +		}
> +		order = next_order(&orders, order);
> +	}
> +
> +fallback:
> +	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +}
> +#else
> +#define alloc_anon_folio(vmf) \
> +		vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
> +#endif
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4066,6 +4147,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   */
>  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  {
> +	int i;
> +	int nr_pages = 1;
> +	unsigned long addr = vmf->address;
>  	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct folio *folio;
> @@ -4110,10 +4194,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	/* Allocate our own private page. */
>  	if (unlikely(anon_vma_prepare(vma)))
>  		goto oom;
> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +	folio = alloc_anon_folio(vmf);
> +	if (IS_ERR(folio))
> +		return 0;
>  	if (!folio)
>  		goto oom;
>
> +	nr_pages = folio_nr_pages(folio);
> +	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +
>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>  		goto oom_free_page;
>  	folio_throttle_swaprate(folio, GFP_KERNEL);
> @@ -4130,12 +4219,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	if (vma->vm_flags & VM_WRITE)
>  		entry = pte_mkwrite(pte_mkdirty(entry), vma);
>
> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -			&vmf->ptl);
> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>  	if (!vmf->pte)
>  		goto release;
> -	if (vmf_pte_changed(vmf)) {
> -		update_mmu_tlb(vma, vmf->address, vmf->pte);
> +	if (vmf_pte_range_changed(vmf, nr_pages)) {
> +		for (i = 0; i < nr_pages; i++)
> +			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>  		goto release;
>  	}
>
> @@ -4150,16 +4239,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>  	}
>
> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
> +	folio_ref_add(folio, nr_pages - 1);
> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> +	folio_add_new_anon_rmap(folio, vma, addr);
>  	folio_add_lru_vma(folio, vma);
>  setpte:
>  	if (uffd_wp)
>  		entry = pte_mkuffd_wp(entry);
> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>
>  	/* No need to invalidate - it was non-present before */
> -	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> +	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
>  unlock:
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
> --
> 2.25.1
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
  2023-10-05 12:05       ` Daniel Gomez
@ 2023-10-05 12:49         ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-05 12:49 UTC (permalink / raw)
  To: Daniel Gomez, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 05/10/2023 13:05, Daniel Gomez wrote:
> On Fri, Sep 29, 2023 at 12:44:16PM +0100, Ryan Roberts wrote:
> 
> Hi Ryan,
>> Introduce the logic to allow THP to be configured (through the new
>> anon_orders interface we just added) to allocate large folios to back
>> anonymous memory, which are smaller than PMD-size (for example order-2,
>> order-3, order-4, etc).
>>
>> These THPs continue to be PTE-mapped, but in many cases can still
>> provide similar benefits to traditional PMD-sized THP: Page faults are
>> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
>> the configured order), but latency spikes are much less prominent
>> because the size of each page isn't as huge as the PMD-sized variant and
>> there is less memory to clear in each page fault. The number of per-page
>> operations (e.g. ref counting, rmap management, lru list management) are
>> also significantly reduced since those ops now become per-folio.
>>
>> Some architectures also employ TLB compression mechanisms to squeeze
>> more entries in when a set of PTEs are virtually and physically
>> contiguous and approporiately aligned. In this case, TLB misses will
>> occur less often.
>>
>> The new behaviour is disabled by default because the anon_orders
>> defaults to only enabling PMD-order, but can be enabled at runtime by
>> writing to anon_orders (see documentation in previous commit). The long
>> term aim is to default anon_orders to include suitable lower orders, but
>> there are some risks around internal fragmentation that need to be
>> better understood first.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  Documentation/admin-guide/mm/transhuge.rst |   9 +-
>>  include/linux/huge_mm.h                    |   6 +-
>>  mm/memory.c                                | 108 +++++++++++++++++++--
>>  3 files changed, 111 insertions(+), 12 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>> index 9f954e73a4ca..732c3b2f4ba8 100644
>> --- a/Documentation/admin-guide/mm/transhuge.rst
>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>> @@ -353,7 +353,9 @@ anonymous transparent huge pages, it is necessary to read
>>  ``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap
>>  fields for each mapping. Note that in both cases, AnonHugePages refers
>>  only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped
>> -using PTEs.
>> +using PTEs. This includes all THPs whose order is smaller than
>> +PMD-order, as well as any PMD-order THPs that happen to be PTE-mapped
>> +for other reasons.
>>
>>  The number of file transparent huge pages mapped to userspace is available
>>  by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
>> @@ -367,6 +369,11 @@ frequently will incur overhead.
>>  There are a number of counters in ``/proc/vmstat`` that may be used to
>>  monitor how successfully the system is providing huge pages for use.
>>
>> +.. note::
>> +   Currently the below counters only record events relating to
>> +   PMD-order THPs. Events relating to smaller order THPs are not
>> +   included.
>> +
>>  thp_fault_alloc
>>  	is incremented every time a huge page is successfully
>>  	allocated to handle a page fault.
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2e7c338229a6..c4860476a1f5 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>>  #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>>
>>  /*
>> - * Mask of all large folio orders supported for anonymous THP.
>> + * Mask of all large folio orders supported for anonymous THP; all orders up to
>> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
>> + * (which is a limitation of the THP implementation).
>>   */
>> -#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
>> +#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>>
>>  /*
>>   * Mask of all large folio orders supported for file THP.
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b5b82fc8e164..92ed9c782dc9 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  	return ret;
>>  }
>>
>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>> +{
>> +	int i;
>> +
>> +	if (nr_pages == 1)
>> +		return vmf_pte_changed(vmf);
>> +
>> +	for (i = 0; i < nr_pages; i++) {
>> +		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>> +			return true;
>> +	}
>> +
>> +	return false;
>> +}
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>> +{
>> +	gfp_t gfp;
>> +	pte_t *pte;
>> +	unsigned long addr;
>> +	struct folio *folio;
>> +	struct vm_area_struct *vma = vmf->vma;
>> +	unsigned int orders;
>> +	int order;
>> +
>> +	/*
>> +	 * If uffd is active for the vma we need per-page fault fidelity to
>> +	 * maintain the uffd semantics.
>> +	 */
>> +	if (userfaultfd_armed(vma))
>> +		goto fallback;
>> +
>> +	/*
>> +	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
>> +	 * for this vma. Then filter out the orders that can't be allocated over
>> +	 * the faulting address and still be fully contained in the vma.
>> +	 */
>> +	orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
>> +				    BIT(PMD_ORDER) - 1);
>> +	orders = transhuge_vma_suitable(vma, vmf->address, orders);
>> +
>> +	if (!orders)
>> +		goto fallback;
>> +
>> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>> +	if (!pte)
>> +		return ERR_PTR(-EAGAIN);
>> +
>> +	order = first_order(orders);
>> +	while (orders) {
>> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +		vmf->pte = pte + pte_index(addr);
>> +		if (!vmf_pte_range_changed(vmf, 1 << order))
>> +			break;
>> +		order = next_order(&orders, order);
>> +	}
>> +
>> +	vmf->pte = NULL;
>> +	pte_unmap(pte);
>> +
>> +	gfp = vma_thp_gfp_mask(vma);
>> +
>> +	while (orders) {
>> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +		folio = vma_alloc_folio(gfp, order, vma, addr, true);
> 
> I was checking your series and noticed about the hugepage flag. I think
> you've changed it from v1 -> v2 from being false to true when orders >=2
> but I'm not sure about the reasoning. Is this because of your statement
> in the cover letter [1]?

That hugepage flags is spec'ed as follows:

 * @hugepage: For hugepages try only the preferred node if possible.

The intent of passing true for orders higher than 0, is that we would prefer to
allocate a smaller order folio that is on the preferred node than a higher order
folio that is not on the preferred node. The assumption is that the on-going
cost of accessing the memory on the non-preferred node will outweigh the benefit
of allocating it as a high order folio.

Thanks,
Ryan


> 
> [1] cover letter snippet:
> 
> "to implement variable order, large folios for anonymous memory.
> (previously called ..., but now exposed as an extension to THP;
> "small-order THP")"
> 
> Thanks,
> Daniel
> 
>> +		if (folio) {
>> +			clear_huge_page(&folio->page, addr, 1 << order);
>> +			return folio;
>> +		}
>> +		order = next_order(&orders, order);
>> +	}
>> +
>> +fallback:
>> +	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +}
>> +#else
>> +#define alloc_anon_folio(vmf) \
>> +		vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
>> +#endif
>> +
>>  /*
>>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>   * but allow concurrent faults), and pte mapped but not yet locked.
>> @@ -4066,6 +4147,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>   */
>>  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  {
>> +	int i;
>> +	int nr_pages = 1;
>> +	unsigned long addr = vmf->address;
>>  	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>>  	struct vm_area_struct *vma = vmf->vma;
>>  	struct folio *folio;
>> @@ -4110,10 +4194,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  	/* Allocate our own private page. */
>>  	if (unlikely(anon_vma_prepare(vma)))
>>  		goto oom;
>> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +	folio = alloc_anon_folio(vmf);
>> +	if (IS_ERR(folio))
>> +		return 0;
>>  	if (!folio)
>>  		goto oom;
>>
>> +	nr_pages = folio_nr_pages(folio);
>> +	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>> +
>>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>  		goto oom_free_page;
>>  	folio_throttle_swaprate(folio, GFP_KERNEL);
>> @@ -4130,12 +4219,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  	if (vma->vm_flags & VM_WRITE)
>>  		entry = pte_mkwrite(pte_mkdirty(entry), vma);
>>
>> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> -			&vmf->ptl);
>> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>  	if (!vmf->pte)
>>  		goto release;
>> -	if (vmf_pte_changed(vmf)) {
>> -		update_mmu_tlb(vma, vmf->address, vmf->pte);
>> +	if (vmf_pte_range_changed(vmf, nr_pages)) {
>> +		for (i = 0; i < nr_pages; i++)
>> +			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>  		goto release;
>>  	}
>>
>> @@ -4150,16 +4239,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>>  	}
>>
>> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
>> +	folio_ref_add(folio, nr_pages - 1);
>> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>> +	folio_add_new_anon_rmap(folio, vma, addr);
>>  	folio_add_lru_vma(folio, vma);
>>  setpte:
>>  	if (uffd_wp)
>>  		entry = pte_mkuffd_wp(entry);
>> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>>
>>  	/* No need to invalidate - it was non-present before */
>> -	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>> +	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
>>  unlock:
>>  	if (vmf->pte)
>>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>> --
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
@ 2023-10-05 12:49         ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-05 12:49 UTC (permalink / raw)
  To: Daniel Gomez, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 05/10/2023 13:05, Daniel Gomez wrote:
> On Fri, Sep 29, 2023 at 12:44:16PM +0100, Ryan Roberts wrote:
> 
> Hi Ryan,
>> Introduce the logic to allow THP to be configured (through the new
>> anon_orders interface we just added) to allocate large folios to back
>> anonymous memory, which are smaller than PMD-size (for example order-2,
>> order-3, order-4, etc).
>>
>> These THPs continue to be PTE-mapped, but in many cases can still
>> provide similar benefits to traditional PMD-sized THP: Page faults are
>> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
>> the configured order), but latency spikes are much less prominent
>> because the size of each page isn't as huge as the PMD-sized variant and
>> there is less memory to clear in each page fault. The number of per-page
>> operations (e.g. ref counting, rmap management, lru list management) are
>> also significantly reduced since those ops now become per-folio.
>>
>> Some architectures also employ TLB compression mechanisms to squeeze
>> more entries in when a set of PTEs are virtually and physically
>> contiguous and approporiately aligned. In this case, TLB misses will
>> occur less often.
>>
>> The new behaviour is disabled by default because the anon_orders
>> defaults to only enabling PMD-order, but can be enabled at runtime by
>> writing to anon_orders (see documentation in previous commit). The long
>> term aim is to default anon_orders to include suitable lower orders, but
>> there are some risks around internal fragmentation that need to be
>> better understood first.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  Documentation/admin-guide/mm/transhuge.rst |   9 +-
>>  include/linux/huge_mm.h                    |   6 +-
>>  mm/memory.c                                | 108 +++++++++++++++++++--
>>  3 files changed, 111 insertions(+), 12 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>> index 9f954e73a4ca..732c3b2f4ba8 100644
>> --- a/Documentation/admin-guide/mm/transhuge.rst
>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>> @@ -353,7 +353,9 @@ anonymous transparent huge pages, it is necessary to read
>>  ``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap
>>  fields for each mapping. Note that in both cases, AnonHugePages refers
>>  only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped
>> -using PTEs.
>> +using PTEs. This includes all THPs whose order is smaller than
>> +PMD-order, as well as any PMD-order THPs that happen to be PTE-mapped
>> +for other reasons.
>>
>>  The number of file transparent huge pages mapped to userspace is available
>>  by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
>> @@ -367,6 +369,11 @@ frequently will incur overhead.
>>  There are a number of counters in ``/proc/vmstat`` that may be used to
>>  monitor how successfully the system is providing huge pages for use.
>>
>> +.. note::
>> +   Currently the below counters only record events relating to
>> +   PMD-order THPs. Events relating to smaller order THPs are not
>> +   included.
>> +
>>  thp_fault_alloc
>>  	is incremented every time a huge page is successfully
>>  	allocated to handle a page fault.
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2e7c338229a6..c4860476a1f5 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>>  #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>>
>>  /*
>> - * Mask of all large folio orders supported for anonymous THP.
>> + * Mask of all large folio orders supported for anonymous THP; all orders up to
>> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
>> + * (which is a limitation of the THP implementation).
>>   */
>> -#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
>> +#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>>
>>  /*
>>   * Mask of all large folio orders supported for file THP.
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b5b82fc8e164..92ed9c782dc9 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  	return ret;
>>  }
>>
>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>> +{
>> +	int i;
>> +
>> +	if (nr_pages == 1)
>> +		return vmf_pte_changed(vmf);
>> +
>> +	for (i = 0; i < nr_pages; i++) {
>> +		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>> +			return true;
>> +	}
>> +
>> +	return false;
>> +}
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>> +{
>> +	gfp_t gfp;
>> +	pte_t *pte;
>> +	unsigned long addr;
>> +	struct folio *folio;
>> +	struct vm_area_struct *vma = vmf->vma;
>> +	unsigned int orders;
>> +	int order;
>> +
>> +	/*
>> +	 * If uffd is active for the vma we need per-page fault fidelity to
>> +	 * maintain the uffd semantics.
>> +	 */
>> +	if (userfaultfd_armed(vma))
>> +		goto fallback;
>> +
>> +	/*
>> +	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
>> +	 * for this vma. Then filter out the orders that can't be allocated over
>> +	 * the faulting address and still be fully contained in the vma.
>> +	 */
>> +	orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
>> +				    BIT(PMD_ORDER) - 1);
>> +	orders = transhuge_vma_suitable(vma, vmf->address, orders);
>> +
>> +	if (!orders)
>> +		goto fallback;
>> +
>> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>> +	if (!pte)
>> +		return ERR_PTR(-EAGAIN);
>> +
>> +	order = first_order(orders);
>> +	while (orders) {
>> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +		vmf->pte = pte + pte_index(addr);
>> +		if (!vmf_pte_range_changed(vmf, 1 << order))
>> +			break;
>> +		order = next_order(&orders, order);
>> +	}
>> +
>> +	vmf->pte = NULL;
>> +	pte_unmap(pte);
>> +
>> +	gfp = vma_thp_gfp_mask(vma);
>> +
>> +	while (orders) {
>> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +		folio = vma_alloc_folio(gfp, order, vma, addr, true);
> 
> I was checking your series and noticed about the hugepage flag. I think
> you've changed it from v1 -> v2 from being false to true when orders >=2
> but I'm not sure about the reasoning. Is this because of your statement
> in the cover letter [1]?

That hugepage flags is spec'ed as follows:

 * @hugepage: For hugepages try only the preferred node if possible.

The intent of passing true for orders higher than 0, is that we would prefer to
allocate a smaller order folio that is on the preferred node than a higher order
folio that is not on the preferred node. The assumption is that the on-going
cost of accessing the memory on the non-preferred node will outweigh the benefit
of allocating it as a high order folio.

Thanks,
Ryan


> 
> [1] cover letter snippet:
> 
> "to implement variable order, large folios for anonymous memory.
> (previously called ..., but now exposed as an extension to THP;
> "small-order THP")"
> 
> Thanks,
> Daniel
> 
>> +		if (folio) {
>> +			clear_huge_page(&folio->page, addr, 1 << order);
>> +			return folio;
>> +		}
>> +		order = next_order(&orders, order);
>> +	}
>> +
>> +fallback:
>> +	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +}
>> +#else
>> +#define alloc_anon_folio(vmf) \
>> +		vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
>> +#endif
>> +
>>  /*
>>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>   * but allow concurrent faults), and pte mapped but not yet locked.
>> @@ -4066,6 +4147,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>   */
>>  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  {
>> +	int i;
>> +	int nr_pages = 1;
>> +	unsigned long addr = vmf->address;
>>  	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>>  	struct vm_area_struct *vma = vmf->vma;
>>  	struct folio *folio;
>> @@ -4110,10 +4194,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  	/* Allocate our own private page. */
>>  	if (unlikely(anon_vma_prepare(vma)))
>>  		goto oom;
>> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +	folio = alloc_anon_folio(vmf);
>> +	if (IS_ERR(folio))
>> +		return 0;
>>  	if (!folio)
>>  		goto oom;
>>
>> +	nr_pages = folio_nr_pages(folio);
>> +	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>> +
>>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>  		goto oom_free_page;
>>  	folio_throttle_swaprate(folio, GFP_KERNEL);
>> @@ -4130,12 +4219,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  	if (vma->vm_flags & VM_WRITE)
>>  		entry = pte_mkwrite(pte_mkdirty(entry), vma);
>>
>> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> -			&vmf->ptl);
>> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>  	if (!vmf->pte)
>>  		goto release;
>> -	if (vmf_pte_changed(vmf)) {
>> -		update_mmu_tlb(vma, vmf->address, vmf->pte);
>> +	if (vmf_pte_range_changed(vmf, nr_pages)) {
>> +		for (i = 0; i < nr_pages; i++)
>> +			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>  		goto release;
>>  	}
>>
>> @@ -4150,16 +4239,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>>  	}
>>
>> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
>> +	folio_ref_add(folio, nr_pages - 1);
>> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>> +	folio_add_new_anon_rmap(folio, vma, addr);
>>  	folio_add_lru_vma(folio, vma);
>>  setpte:
>>  	if (uffd_wp)
>>  		entry = pte_mkuffd_wp(entry);
>> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>>
>>  	/* No need to invalidate - it was non-present before */
>> -	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>> +	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
>>  unlock:
>>  	if (vmf->pte)
>>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>> --
>> 2.25.1
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
  2023-10-05 12:49         ` Ryan Roberts
@ 2023-10-05 14:59           ` Daniel Gomez
  -1 siblings, 0 replies; 140+ messages in thread
From: Daniel Gomez @ 2023-10-05 14:59 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On Thu, Oct 05, 2023 at 01:49:30PM +0100, Ryan Roberts wrote:
> On 05/10/2023 13:05, Daniel Gomez wrote:
> > On Fri, Sep 29, 2023 at 12:44:16PM +0100, Ryan Roberts wrote:
> >
> > Hi Ryan,
> >> Introduce the logic to allow THP to be configured (through the new
> >> anon_orders interface we just added) to allocate large folios to back
> >> anonymous memory, which are smaller than PMD-size (for example order-2,
> >> order-3, order-4, etc).
> >>
> >> These THPs continue to be PTE-mapped, but in many cases can still
> >> provide similar benefits to traditional PMD-sized THP: Page faults are
> >> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
> >> the configured order), but latency spikes are much less prominent
> >> because the size of each page isn't as huge as the PMD-sized variant and
> >> there is less memory to clear in each page fault. The number of per-page
> >> operations (e.g. ref counting, rmap management, lru list management) are
> >> also significantly reduced since those ops now become per-folio.
> >>
> >> Some architectures also employ TLB compression mechanisms to squeeze
> >> more entries in when a set of PTEs are virtually and physically
> >> contiguous and approporiately aligned. In this case, TLB misses will
> >> occur less often.
> >>
> >> The new behaviour is disabled by default because the anon_orders
> >> defaults to only enabling PMD-order, but can be enabled at runtime by
> >> writing to anon_orders (see documentation in previous commit). The long
> >> term aim is to default anon_orders to include suitable lower orders, but
> >> there are some risks around internal fragmentation that need to be
> >> better understood first.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  Documentation/admin-guide/mm/transhuge.rst |   9 +-
> >>  include/linux/huge_mm.h                    |   6 +-
> >>  mm/memory.c                                | 108 +++++++++++++++++++--
> >>  3 files changed, 111 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> >> index 9f954e73a4ca..732c3b2f4ba8 100644
> >> --- a/Documentation/admin-guide/mm/transhuge.rst
> >> +++ b/Documentation/admin-guide/mm/transhuge.rst
> >> @@ -353,7 +353,9 @@ anonymous transparent huge pages, it is necessary to read
> >>  ``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap
> >>  fields for each mapping. Note that in both cases, AnonHugePages refers
> >>  only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped
> >> -using PTEs.
> >> +using PTEs. This includes all THPs whose order is smaller than
> >> +PMD-order, as well as any PMD-order THPs that happen to be PTE-mapped
> >> +for other reasons.
> >>
> >>  The number of file transparent huge pages mapped to userspace is available
> >>  by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
> >> @@ -367,6 +369,11 @@ frequently will incur overhead.
> >>  There are a number of counters in ``/proc/vmstat`` that may be used to
> >>  monitor how successfully the system is providing huge pages for use.
> >>
> >> +.. note::
> >> +   Currently the below counters only record events relating to
> >> +   PMD-order THPs. Events relating to smaller order THPs are not
> >> +   included.
> >> +
> >>  thp_fault_alloc
> >>  	is incremented every time a huge page is successfully
> >>  	allocated to handle a page fault.
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index 2e7c338229a6..c4860476a1f5 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
> >>  #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> >>
> >>  /*
> >> - * Mask of all large folio orders supported for anonymous THP.
> >> + * Mask of all large folio orders supported for anonymous THP; all orders up to
> >> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> >> + * (which is a limitation of the THP implementation).
> >>   */
> >> -#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
> >> +#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
> >>
> >>  /*
> >>   * Mask of all large folio orders supported for file THP.
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index b5b82fc8e164..92ed9c782dc9 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>  	return ret;
> >>  }
> >>
> >> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> >> +{
> >> +	int i;
> >> +
> >> +	if (nr_pages == 1)
> >> +		return vmf_pte_changed(vmf);
> >> +
> >> +	for (i = 0; i < nr_pages; i++) {
> >> +		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> >> +			return true;
> >> +	}
> >> +
> >> +	return false;
> >> +}
> >> +
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >> +{
> >> +	gfp_t gfp;
> >> +	pte_t *pte;
> >> +	unsigned long addr;
> >> +	struct folio *folio;
> >> +	struct vm_area_struct *vma = vmf->vma;
> >> +	unsigned int orders;
> >> +	int order;
> >> +
> >> +	/*
> >> +	 * If uffd is active for the vma we need per-page fault fidelity to
> >> +	 * maintain the uffd semantics.
> >> +	 */
> >> +	if (userfaultfd_armed(vma))
> >> +		goto fallback;
> >> +
> >> +	/*
> >> +	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >> +	 * for this vma. Then filter out the orders that can't be allocated over
> >> +	 * the faulting address and still be fully contained in the vma.
> >> +	 */
> >> +	orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
> >> +				    BIT(PMD_ORDER) - 1);
> >> +	orders = transhuge_vma_suitable(vma, vmf->address, orders);
> >> +
> >> +	if (!orders)
> >> +		goto fallback;
> >> +
> >> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> >> +	if (!pte)
> >> +		return ERR_PTR(-EAGAIN);
> >> +
> >> +	order = first_order(orders);
> >> +	while (orders) {
> >> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >> +		vmf->pte = pte + pte_index(addr);
> >> +		if (!vmf_pte_range_changed(vmf, 1 << order))
> >> +			break;
> >> +		order = next_order(&orders, order);
> >> +	}
> >> +
> >> +	vmf->pte = NULL;
> >> +	pte_unmap(pte);
> >> +
> >> +	gfp = vma_thp_gfp_mask(vma);
> >> +
> >> +	while (orders) {
> >> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >> +		folio = vma_alloc_folio(gfp, order, vma, addr, true);
> >
> > I was checking your series and noticed about the hugepage flag. I think
> > you've changed it from v1 -> v2 from being false to true when orders >=2
> > but I'm not sure about the reasoning. Is this because of your statement
> > in the cover letter [1]?
>
> That hugepage flags is spec'ed as follows:
>
>  * @hugepage: For hugepages try only the preferred node if possible.
>
> The intent of passing true for orders higher than 0, is that we would prefer to
> allocate a smaller order folio that is on the preferred node than a higher order
> folio that is not on the preferred node. The assumption is that the on-going
> cost of accessing the memory on the non-preferred node will outweigh the benefit
> of allocating it as a high order folio.
>
> Thanks,
> Ryan

I think I'm confused about the @hugepage name. I guess activating that
for any order >= 2 doesn't imply you are in fact allocating a huge page
isn't? I can see order is passed from vma_alloc_folio -> __folio_alloc
-> __alloc_pages but I assumed (before reading your patch) you always
want this disabled except for HPAGE_PMD_ORDER allocation cases. But
order is not a limitation for the preferred node here, regardless this
is a huge page or not.

I see the motivation, thanks for sharing.
>
>
> >
> > [1] cover letter snippet:
> >
> > "to implement variable order, large folios for anonymous memory.
> > (previously called ..., but now exposed as an extension to THP;
> > "small-order THP")"
> >
> > Thanks,
> > Daniel
> >
> >> +		if (folio) {
> >> +			clear_huge_page(&folio->page, addr, 1 << order);
> >> +			return folio;
> >> +		}
> >> +		order = next_order(&orders, order);
> >> +	}
> >> +
> >> +fallback:
> >> +	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
> >> +}
> >> +#else
> >> +#define alloc_anon_folio(vmf) \
> >> +		vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
> >> +#endif
> >> +
> >>  /*
> >>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >>   * but allow concurrent faults), and pte mapped but not yet locked.
> >> @@ -4066,6 +4147,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>   */
> >>  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>  {
> >> +	int i;
> >> +	int nr_pages = 1;
> >> +	unsigned long addr = vmf->address;
> >>  	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
> >>  	struct vm_area_struct *vma = vmf->vma;
> >>  	struct folio *folio;
> >> @@ -4110,10 +4194,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>  	/* Allocate our own private page. */
> >>  	if (unlikely(anon_vma_prepare(vma)))
> >>  		goto oom;
> >> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> >> +	folio = alloc_anon_folio(vmf);
> >> +	if (IS_ERR(folio))
> >> +		return 0;
> >>  	if (!folio)
> >>  		goto oom;
> >>
> >> +	nr_pages = folio_nr_pages(folio);
> >> +	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> >> +
> >>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> >>  		goto oom_free_page;
> >>  	folio_throttle_swaprate(folio, GFP_KERNEL);
> >> @@ -4130,12 +4219,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>  	if (vma->vm_flags & VM_WRITE)
> >>  		entry = pte_mkwrite(pte_mkdirty(entry), vma);
> >>
> >> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >> -			&vmf->ptl);
> >> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> >>  	if (!vmf->pte)
> >>  		goto release;
> >> -	if (vmf_pte_changed(vmf)) {
> >> -		update_mmu_tlb(vma, vmf->address, vmf->pte);
> >> +	if (vmf_pte_range_changed(vmf, nr_pages)) {
> >> +		for (i = 0; i < nr_pages; i++)
> >> +			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
> >>  		goto release;
> >>  	}
> >>
> >> @@ -4150,16 +4239,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>  		return handle_userfault(vmf, VM_UFFD_MISSING);
> >>  	}
> >>
> >> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
> >> +	folio_ref_add(folio, nr_pages - 1);
> >> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >> +	folio_add_new_anon_rmap(folio, vma, addr);
> >>  	folio_add_lru_vma(folio, vma);
> >>  setpte:
> >>  	if (uffd_wp)
> >>  		entry = pte_mkuffd_wp(entry);
> >> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> >> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
> >>
> >>  	/* No need to invalidate - it was non-present before */
> >> -	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> >> +	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
> >>  unlock:
> >>  	if (vmf->pte)
> >>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> --
> >> 2.25.1
> >>
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
@ 2023-10-05 14:59           ` Daniel Gomez
  0 siblings, 0 replies; 140+ messages in thread
From: Daniel Gomez @ 2023-10-05 14:59 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On Thu, Oct 05, 2023 at 01:49:30PM +0100, Ryan Roberts wrote:
> On 05/10/2023 13:05, Daniel Gomez wrote:
> > On Fri, Sep 29, 2023 at 12:44:16PM +0100, Ryan Roberts wrote:
> >
> > Hi Ryan,
> >> Introduce the logic to allow THP to be configured (through the new
> >> anon_orders interface we just added) to allocate large folios to back
> >> anonymous memory, which are smaller than PMD-size (for example order-2,
> >> order-3, order-4, etc).
> >>
> >> These THPs continue to be PTE-mapped, but in many cases can still
> >> provide similar benefits to traditional PMD-sized THP: Page faults are
> >> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
> >> the configured order), but latency spikes are much less prominent
> >> because the size of each page isn't as huge as the PMD-sized variant and
> >> there is less memory to clear in each page fault. The number of per-page
> >> operations (e.g. ref counting, rmap management, lru list management) are
> >> also significantly reduced since those ops now become per-folio.
> >>
> >> Some architectures also employ TLB compression mechanisms to squeeze
> >> more entries in when a set of PTEs are virtually and physically
> >> contiguous and approporiately aligned. In this case, TLB misses will
> >> occur less often.
> >>
> >> The new behaviour is disabled by default because the anon_orders
> >> defaults to only enabling PMD-order, but can be enabled at runtime by
> >> writing to anon_orders (see documentation in previous commit). The long
> >> term aim is to default anon_orders to include suitable lower orders, but
> >> there are some risks around internal fragmentation that need to be
> >> better understood first.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  Documentation/admin-guide/mm/transhuge.rst |   9 +-
> >>  include/linux/huge_mm.h                    |   6 +-
> >>  mm/memory.c                                | 108 +++++++++++++++++++--
> >>  3 files changed, 111 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> >> index 9f954e73a4ca..732c3b2f4ba8 100644
> >> --- a/Documentation/admin-guide/mm/transhuge.rst
> >> +++ b/Documentation/admin-guide/mm/transhuge.rst
> >> @@ -353,7 +353,9 @@ anonymous transparent huge pages, it is necessary to read
> >>  ``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap
> >>  fields for each mapping. Note that in both cases, AnonHugePages refers
> >>  only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped
> >> -using PTEs.
> >> +using PTEs. This includes all THPs whose order is smaller than
> >> +PMD-order, as well as any PMD-order THPs that happen to be PTE-mapped
> >> +for other reasons.
> >>
> >>  The number of file transparent huge pages mapped to userspace is available
> >>  by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
> >> @@ -367,6 +369,11 @@ frequently will incur overhead.
> >>  There are a number of counters in ``/proc/vmstat`` that may be used to
> >>  monitor how successfully the system is providing huge pages for use.
> >>
> >> +.. note::
> >> +   Currently the below counters only record events relating to
> >> +   PMD-order THPs. Events relating to smaller order THPs are not
> >> +   included.
> >> +
> >>  thp_fault_alloc
> >>  	is incremented every time a huge page is successfully
> >>  	allocated to handle a page fault.
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index 2e7c338229a6..c4860476a1f5 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
> >>  #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> >>
> >>  /*
> >> - * Mask of all large folio orders supported for anonymous THP.
> >> + * Mask of all large folio orders supported for anonymous THP; all orders up to
> >> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> >> + * (which is a limitation of the THP implementation).
> >>   */
> >> -#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
> >> +#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
> >>
> >>  /*
> >>   * Mask of all large folio orders supported for file THP.
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index b5b82fc8e164..92ed9c782dc9 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>  	return ret;
> >>  }
> >>
> >> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> >> +{
> >> +	int i;
> >> +
> >> +	if (nr_pages == 1)
> >> +		return vmf_pte_changed(vmf);
> >> +
> >> +	for (i = 0; i < nr_pages; i++) {
> >> +		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> >> +			return true;
> >> +	}
> >> +
> >> +	return false;
> >> +}
> >> +
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >> +{
> >> +	gfp_t gfp;
> >> +	pte_t *pte;
> >> +	unsigned long addr;
> >> +	struct folio *folio;
> >> +	struct vm_area_struct *vma = vmf->vma;
> >> +	unsigned int orders;
> >> +	int order;
> >> +
> >> +	/*
> >> +	 * If uffd is active for the vma we need per-page fault fidelity to
> >> +	 * maintain the uffd semantics.
> >> +	 */
> >> +	if (userfaultfd_armed(vma))
> >> +		goto fallback;
> >> +
> >> +	/*
> >> +	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >> +	 * for this vma. Then filter out the orders that can't be allocated over
> >> +	 * the faulting address and still be fully contained in the vma.
> >> +	 */
> >> +	orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
> >> +				    BIT(PMD_ORDER) - 1);
> >> +	orders = transhuge_vma_suitable(vma, vmf->address, orders);
> >> +
> >> +	if (!orders)
> >> +		goto fallback;
> >> +
> >> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> >> +	if (!pte)
> >> +		return ERR_PTR(-EAGAIN);
> >> +
> >> +	order = first_order(orders);
> >> +	while (orders) {
> >> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >> +		vmf->pte = pte + pte_index(addr);
> >> +		if (!vmf_pte_range_changed(vmf, 1 << order))
> >> +			break;
> >> +		order = next_order(&orders, order);
> >> +	}
> >> +
> >> +	vmf->pte = NULL;
> >> +	pte_unmap(pte);
> >> +
> >> +	gfp = vma_thp_gfp_mask(vma);
> >> +
> >> +	while (orders) {
> >> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >> +		folio = vma_alloc_folio(gfp, order, vma, addr, true);
> >
> > I was checking your series and noticed about the hugepage flag. I think
> > you've changed it from v1 -> v2 from being false to true when orders >=2
> > but I'm not sure about the reasoning. Is this because of your statement
> > in the cover letter [1]?
>
> That hugepage flags is spec'ed as follows:
>
>  * @hugepage: For hugepages try only the preferred node if possible.
>
> The intent of passing true for orders higher than 0, is that we would prefer to
> allocate a smaller order folio that is on the preferred node than a higher order
> folio that is not on the preferred node. The assumption is that the on-going
> cost of accessing the memory on the non-preferred node will outweigh the benefit
> of allocating it as a high order folio.
>
> Thanks,
> Ryan

I think I'm confused about the @hugepage name. I guess activating that
for any order >= 2 doesn't imply you are in fact allocating a huge page
isn't? I can see order is passed from vma_alloc_folio -> __folio_alloc
-> __alloc_pages but I assumed (before reading your patch) you always
want this disabled except for HPAGE_PMD_ORDER allocation cases. But
order is not a limitation for the preferred node here, regardless this
is a huge page or not.

I see the motivation, thanks for sharing.
>
>
> >
> > [1] cover letter snippet:
> >
> > "to implement variable order, large folios for anonymous memory.
> > (previously called ..., but now exposed as an extension to THP;
> > "small-order THP")"
> >
> > Thanks,
> > Daniel
> >
> >> +		if (folio) {
> >> +			clear_huge_page(&folio->page, addr, 1 << order);
> >> +			return folio;
> >> +		}
> >> +		order = next_order(&orders, order);
> >> +	}
> >> +
> >> +fallback:
> >> +	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
> >> +}
> >> +#else
> >> +#define alloc_anon_folio(vmf) \
> >> +		vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
> >> +#endif
> >> +
> >>  /*
> >>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >>   * but allow concurrent faults), and pte mapped but not yet locked.
> >> @@ -4066,6 +4147,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>   */
> >>  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>  {
> >> +	int i;
> >> +	int nr_pages = 1;
> >> +	unsigned long addr = vmf->address;
> >>  	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
> >>  	struct vm_area_struct *vma = vmf->vma;
> >>  	struct folio *folio;
> >> @@ -4110,10 +4194,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>  	/* Allocate our own private page. */
> >>  	if (unlikely(anon_vma_prepare(vma)))
> >>  		goto oom;
> >> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> >> +	folio = alloc_anon_folio(vmf);
> >> +	if (IS_ERR(folio))
> >> +		return 0;
> >>  	if (!folio)
> >>  		goto oom;
> >>
> >> +	nr_pages = folio_nr_pages(folio);
> >> +	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> >> +
> >>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> >>  		goto oom_free_page;
> >>  	folio_throttle_swaprate(folio, GFP_KERNEL);
> >> @@ -4130,12 +4219,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>  	if (vma->vm_flags & VM_WRITE)
> >>  		entry = pte_mkwrite(pte_mkdirty(entry), vma);
> >>
> >> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >> -			&vmf->ptl);
> >> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> >>  	if (!vmf->pte)
> >>  		goto release;
> >> -	if (vmf_pte_changed(vmf)) {
> >> -		update_mmu_tlb(vma, vmf->address, vmf->pte);
> >> +	if (vmf_pte_range_changed(vmf, nr_pages)) {
> >> +		for (i = 0; i < nr_pages; i++)
> >> +			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
> >>  		goto release;
> >>  	}
> >>
> >> @@ -4150,16 +4239,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>  		return handle_userfault(vmf, VM_UFFD_MISSING);
> >>  	}
> >>
> >> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
> >> +	folio_ref_add(folio, nr_pages - 1);
> >> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >> +	folio_add_new_anon_rmap(folio, vma, addr);
> >>  	folio_add_lru_vma(folio, vma);
> >>  setpte:
> >>  	if (uffd_wp)
> >>  		entry = pte_mkuffd_wp(entry);
> >> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> >> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
> >>
> >>  	/* No need to invalidate - it was non-present before */
> >> -	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> >> +	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
> >>  unlock:
> >>  	if (vmf->pte)
> >>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> --
> >> 2.25.1
> >>
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-09-29 11:44 ` Ryan Roberts
@ 2023-10-06 20:06   ` David Hildenbrand
  -1 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-06 20:06 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 29.09.23 13:44, Ryan Roberts wrote:
> Hi All,

Let me highlight some core decisions on the things discussed in the 
previous alignment meetings, and comment on them.

> 
> This is v6 of a series to implement variable order, large folios for anonymous
> memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO",
> "FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The
> objective of this is to improve performance by allocating larger chunks of
> memory during anonymous page faults:

Change number 1: Let's call these things THP.

Fine with me; I previously rooted for that but was told that end users 
could be confused. I think the important bit is that we don't mess up 
the stats, and when we talk about THP we default to "PMD-sized THP", 
unless we explicitly include the other ones.


I dislike exposing "orders" to the users, I'm happy to be convinced why 
I am wrong and it is a good idea.

So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" 
-- as said, I think FreeBSD tends to call it "Medium-sized superpages". 
But what's small/medium/large is debatable. "Small" implies at least 
that it's smaller than what we used to know, which is a fact.

Can we also now use the terminology consistently? (e.g., 
"variable-order, large folios for anonymous memory" -> "Small-sized 
anonymous THP", you can just point at the previous patch set name in the 
cover letter)

> 
> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>     and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>     overhead. This should benefit all architectures.
> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
> 
> The major change in this revision is the addition of sysfs controls to allow
> this "small-order THP" to be enabled/disabled/configured independently of
> PMD-order THP. The approach I've taken differs a bit from previous discussions;
> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
> personally think this makes things clearer and more extensible. See [6] for
> detailed rationale.

Change 2: sysfs interface.

If we call it THP, it shall go under 
"/sys/kernel/mm/transparent_hugepage/", I agree.

What we expose there and how, is TBD. Again, not a friend of "orders" 
and bitmaps at all. We can do better if we want to go down that path.

Maybe we should take a look at hugetlb, and how they added support for 
multiple sizes. What *might* make sense could be (depending on which 
values we actually support!)


/sys/kernel/mm/transparent_hugepage/hugepages-64kB/
/sys/kernel/mm/transparent_hugepage/hugepages-128kB/
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/
/sys/kernel/mm/transparent_hugepage/hugepages-512kB/
/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/

Each one would contain an "enabled" and "defrag" file. We want something 
minimal first? Start with the "enabled" option.


enabled: always [global] madvise never

Initially, we would set it for PMD-sized THP to "global" and for 
everything else to "never".



That sounds reasonable at least to me, and we would be using what we 
learned from THP (as John suggested).  That still gives reasonable 
flexibility without going too wild, and a better IMHO interface.

I understand Yu's point about ABI discussions and "0 knobs". I'm happy 
as long as we can have something that won't hurt us later and still be 
able to use this in distributions within a reasonable timeframe. 
Enabling/disabling individual sizes does not sound too restrictive to 
me. And we could always add an "auto" setting later and default to that 
with a new kconfig knob.

If someone wants to configure it, why not. Let's just prepare a way to 
to handle this "better" automatically in the future (if ever ...).


Change 3: Stats

 > /proc/meminfo:
 >   Introduce new "AnonHugePteMap" field, which reports the amount of
 >   memory (in KiB) mapped from large folios globally (similar to
 >   AnonHugePages field).

AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", 
I think we all agree on that. It should have been named "AnonPmdMapped" 
or "AnonHugePmdMapped", too bad, we can't change that.

"AnonHugePteMap" better be "AnonHugePteMapped".

But, I wonder if we want to expose this "PteMapped" to user space *at 
all*. Why should they care if it's PTE mapped? For PMD-sized THP it 
makes a bit of sense, because !PMD implied !performance, and one might 
have been able to troubleshoot that somehow. For PTE-mapped, it doesn't 
make much sense really, they are always PTE-mapped.

That also raises the question how you would account a PTE-mapped THP. 
The hole thing? Only the parts that are mapped? Let's better not go down 
that path.

That leaves the question why we would want to include them here at all 
in a special PTE-mapped way?


Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.

HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:         1050624 kB

-> Only the last one gives the sum, the other stats don't even mention 
the other ones. [how do we get their stats, if at all?]

So maybe, we only want a summary of how many anon huge pages of any size 
are allocated (independent of the PTE vs. PMD mapping), and some other 
source to eventually inspect how the different sizes behave.

But note that for non-PMD-sized file THP we don't even have special 
counters! ... so maybe we should also defer any such stats and come up 
with something uniform for all types of non-PMD-sized THP.


Sane discussion applies to all other stats.


> 
> Because we now have runtime enable/disable control, I've removed the compile
> time Kconfig switch. It still defaults to runtime-disabled.
> 
> NOTE: These changes should not be merged until the prerequisites are complete.
> These are in progress and tracked at [7].

We should probably list them here, and classify which one we see as 
strict a requirement, which ones might be an optimization.


Now, these are just my thoughts, and I'm happy about other thoughts.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-06 20:06   ` David Hildenbrand
  0 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-06 20:06 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 29.09.23 13:44, Ryan Roberts wrote:
> Hi All,

Let me highlight some core decisions on the things discussed in the 
previous alignment meetings, and comment on them.

> 
> This is v6 of a series to implement variable order, large folios for anonymous
> memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO",
> "FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The
> objective of this is to improve performance by allocating larger chunks of
> memory during anonymous page faults:

Change number 1: Let's call these things THP.

Fine with me; I previously rooted for that but was told that end users 
could be confused. I think the important bit is that we don't mess up 
the stats, and when we talk about THP we default to "PMD-sized THP", 
unless we explicitly include the other ones.


I dislike exposing "orders" to the users, I'm happy to be convinced why 
I am wrong and it is a good idea.

So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" 
-- as said, I think FreeBSD tends to call it "Medium-sized superpages". 
But what's small/medium/large is debatable. "Small" implies at least 
that it's smaller than what we used to know, which is a fact.

Can we also now use the terminology consistently? (e.g., 
"variable-order, large folios for anonymous memory" -> "Small-sized 
anonymous THP", you can just point at the previous patch set name in the 
cover letter)

> 
> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>     and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>     overhead. This should benefit all architectures.
> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
> 
> The major change in this revision is the addition of sysfs controls to allow
> this "small-order THP" to be enabled/disabled/configured independently of
> PMD-order THP. The approach I've taken differs a bit from previous discussions;
> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
> personally think this makes things clearer and more extensible. See [6] for
> detailed rationale.

Change 2: sysfs interface.

If we call it THP, it shall go under 
"/sys/kernel/mm/transparent_hugepage/", I agree.

What we expose there and how, is TBD. Again, not a friend of "orders" 
and bitmaps at all. We can do better if we want to go down that path.

Maybe we should take a look at hugetlb, and how they added support for 
multiple sizes. What *might* make sense could be (depending on which 
values we actually support!)


/sys/kernel/mm/transparent_hugepage/hugepages-64kB/
/sys/kernel/mm/transparent_hugepage/hugepages-128kB/
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/
/sys/kernel/mm/transparent_hugepage/hugepages-512kB/
/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/

Each one would contain an "enabled" and "defrag" file. We want something 
minimal first? Start with the "enabled" option.


enabled: always [global] madvise never

Initially, we would set it for PMD-sized THP to "global" and for 
everything else to "never".



That sounds reasonable at least to me, and we would be using what we 
learned from THP (as John suggested).  That still gives reasonable 
flexibility without going too wild, and a better IMHO interface.

I understand Yu's point about ABI discussions and "0 knobs". I'm happy 
as long as we can have something that won't hurt us later and still be 
able to use this in distributions within a reasonable timeframe. 
Enabling/disabling individual sizes does not sound too restrictive to 
me. And we could always add an "auto" setting later and default to that 
with a new kconfig knob.

If someone wants to configure it, why not. Let's just prepare a way to 
to handle this "better" automatically in the future (if ever ...).


Change 3: Stats

 > /proc/meminfo:
 >   Introduce new "AnonHugePteMap" field, which reports the amount of
 >   memory (in KiB) mapped from large folios globally (similar to
 >   AnonHugePages field).

AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", 
I think we all agree on that. It should have been named "AnonPmdMapped" 
or "AnonHugePmdMapped", too bad, we can't change that.

"AnonHugePteMap" better be "AnonHugePteMapped".

But, I wonder if we want to expose this "PteMapped" to user space *at 
all*. Why should they care if it's PTE mapped? For PMD-sized THP it 
makes a bit of sense, because !PMD implied !performance, and one might 
have been able to troubleshoot that somehow. For PTE-mapped, it doesn't 
make much sense really, they are always PTE-mapped.

That also raises the question how you would account a PTE-mapped THP. 
The hole thing? Only the parts that are mapped? Let's better not go down 
that path.

That leaves the question why we would want to include them here at all 
in a special PTE-mapped way?


Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.

HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:         1050624 kB

-> Only the last one gives the sum, the other stats don't even mention 
the other ones. [how do we get their stats, if at all?]

So maybe, we only want a summary of how many anon huge pages of any size 
are allocated (independent of the PTE vs. PMD mapping), and some other 
source to eventually inspect how the different sizes behave.

But note that for non-PMD-sized file THP we don't even have special 
counters! ... so maybe we should also defer any such stats and come up 
with something uniform for all types of non-PMD-sized THP.


Sane discussion applies to all other stats.


> 
> Because we now have runtime enable/disable control, I've removed the compile
> time Kconfig switch. It still defaults to runtime-disabled.
> 
> NOTE: These changes should not be merged until the prerequisites are complete.
> These are in progress and tracked at [7].

We should probably list them here, and classify which one we see as 
strict a requirement, which ones might be an optimization.


Now, these are just my thoughts, and I'm happy about other thoughts.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders
  2023-09-29 11:44   ` Ryan Roberts
@ 2023-10-06 20:08     ` David Hildenbrand
  -1 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-06 20:08 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 29.09.23 13:44, Ryan Roberts wrote:
> In addition to passing a bitfield of folio orders to enable for THP,
> allow the string "recommend" to be written, which has the effect of
> causing the system to enable the orders preferred by the architecture
> and by the mm. The user can see what these orders are by subsequently
> reading back the file.
> 
> Note that these recommended orders are expected to be static for a given
> boot of the system, and so the keyword "auto" was deliberately not used,
> as I want to reserve it for a possible future use where the "best" order
> is chosen more dynamically at runtime.
> 
> Recommended orders are determined as follows:
>    - PMD_ORDER: The traditional THP size
>    - arch_wants_pte_order() if implemented by the arch
>    - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list
> 
> arch_wants_pte_order() can be overridden by the architecture if desired.
> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> set of ptes map physically contigious, naturally aligned memory, so this
> mechanism allows the architecture to optimize as required.
> 
> Here we add the default implementation of arch_wants_pte_order(), used
> when the architecture does not define it, which returns -1, implying
> that the HW has no preference.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   Documentation/admin-guide/mm/transhuge.rst |  4 ++++
>   include/linux/pgtable.h                    | 13 +++++++++++++
>   mm/huge_memory.c                           | 14 +++++++++++---
>   3 files changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 732c3b2f4ba8..d6363d4efa3a 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9
>   By enabling multiple orders, allocation of each order will be
>   attempted, highest to lowest, until a successful allocation is made.
>   If the PMD-order is unset, then no PMD-sized THPs will be allocated.
> +It is also possible to enable the recommended set of orders, which
> +will be optimized for the architecture and mm::
> +
> +	echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>   
>   The kernel will ignore any orders that it does not support so read the
>   file back to determine which orders are enabled::
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index af7639c3b0a3..0e110ce57cc3 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
>   }
>   #endif
>   
> +#ifndef arch_wants_pte_order
> +/*
> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> + * PMD_ORDER) and must not be order-1 since THP requires large folios to be at
> + * least order-2. Negative value implies that the HW has no preference and mm
> + * will choose it's own default order.
> + */
> +static inline int arch_wants_pte_order(void)
> +{
> +	return -1;
> +}
> +#endif
> +
>   #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>   static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>   				       unsigned long address,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bcecce769017..e2e2d3906a21 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj,
>   	int err;
>   	int ret = count;
>   	unsigned int orders;
> +	int arch;
>   
> -	err = kstrtouint(buf, 0, &orders);
> -	if (err)
> -		ret = -EINVAL;
> +	if (sysfs_streq(buf, "recommend")) {
> +		arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> +		orders = BIT(arch);
> +		orders |= BIT(PAGE_ALLOC_COSTLY_ORDER);
> +		orders |= BIT(PMD_ORDER);
> +	} else {
> +		err = kstrtouint(buf, 0, &orders);
> +		if (err)
> +			ret = -EINVAL;
> +	}
>   
>   	if (ret > 0) {
>   		orders &= THP_ORDERS_ALL_ANON;

:/ don't really like that. Regarding my proposal, one could have 
something like that in an "auto" setting for the "enabled" value, or a 
"recommended" setting [not sure].

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders
@ 2023-10-06 20:08     ` David Hildenbrand
  0 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-06 20:08 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 29.09.23 13:44, Ryan Roberts wrote:
> In addition to passing a bitfield of folio orders to enable for THP,
> allow the string "recommend" to be written, which has the effect of
> causing the system to enable the orders preferred by the architecture
> and by the mm. The user can see what these orders are by subsequently
> reading back the file.
> 
> Note that these recommended orders are expected to be static for a given
> boot of the system, and so the keyword "auto" was deliberately not used,
> as I want to reserve it for a possible future use where the "best" order
> is chosen more dynamically at runtime.
> 
> Recommended orders are determined as follows:
>    - PMD_ORDER: The traditional THP size
>    - arch_wants_pte_order() if implemented by the arch
>    - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list
> 
> arch_wants_pte_order() can be overridden by the architecture if desired.
> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> set of ptes map physically contigious, naturally aligned memory, so this
> mechanism allows the architecture to optimize as required.
> 
> Here we add the default implementation of arch_wants_pte_order(), used
> when the architecture does not define it, which returns -1, implying
> that the HW has no preference.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   Documentation/admin-guide/mm/transhuge.rst |  4 ++++
>   include/linux/pgtable.h                    | 13 +++++++++++++
>   mm/huge_memory.c                           | 14 +++++++++++---
>   3 files changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 732c3b2f4ba8..d6363d4efa3a 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9
>   By enabling multiple orders, allocation of each order will be
>   attempted, highest to lowest, until a successful allocation is made.
>   If the PMD-order is unset, then no PMD-sized THPs will be allocated.
> +It is also possible to enable the recommended set of orders, which
> +will be optimized for the architecture and mm::
> +
> +	echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>   
>   The kernel will ignore any orders that it does not support so read the
>   file back to determine which orders are enabled::
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index af7639c3b0a3..0e110ce57cc3 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
>   }
>   #endif
>   
> +#ifndef arch_wants_pte_order
> +/*
> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> + * PMD_ORDER) and must not be order-1 since THP requires large folios to be at
> + * least order-2. Negative value implies that the HW has no preference and mm
> + * will choose it's own default order.
> + */
> +static inline int arch_wants_pte_order(void)
> +{
> +	return -1;
> +}
> +#endif
> +
>   #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>   static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>   				       unsigned long address,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bcecce769017..e2e2d3906a21 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj,
>   	int err;
>   	int ret = count;
>   	unsigned int orders;
> +	int arch;
>   
> -	err = kstrtouint(buf, 0, &orders);
> -	if (err)
> -		ret = -EINVAL;
> +	if (sysfs_streq(buf, "recommend")) {
> +		arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> +		orders = BIT(arch);
> +		orders |= BIT(PAGE_ALLOC_COSTLY_ORDER);
> +		orders |= BIT(PMD_ORDER);
> +	} else {
> +		err = kstrtouint(buf, 0, &orders);
> +		if (err)
> +			ret = -EINVAL;
> +	}
>   
>   	if (ret > 0) {
>   		orders &= THP_ORDERS_ALL_ANON;

:/ don't really like that. Regarding my proposal, one could have 
something like that in an "auto" setting for the "enabled" value, or a 
"recommended" setting [not sure].

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders
  2023-10-06 20:08     ` David Hildenbrand
@ 2023-10-06 22:28       ` Yu Zhao
  -1 siblings, 0 replies; 140+ messages in thread
From: Yu Zhao @ 2023-10-06 22:28 UTC (permalink / raw)
  To: David Hildenbrand, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, Catalin Marinas,
	Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On Fri, Oct 6, 2023 at 2:08 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 29.09.23 13:44, Ryan Roberts wrote:
> > In addition to passing a bitfield of folio orders to enable for THP,
> > allow the string "recommend" to be written, which has the effect of
> > causing the system to enable the orders preferred by the architecture
> > and by the mm. The user can see what these orders are by subsequently
> > reading back the file.
> >
> > Note that these recommended orders are expected to be static for a given
> > boot of the system, and so the keyword "auto" was deliberately not used,
> > as I want to reserve it for a possible future use where the "best" order
> > is chosen more dynamically at runtime.
> >
> > Recommended orders are determined as follows:
> >    - PMD_ORDER: The traditional THP size
> >    - arch_wants_pte_order() if implemented by the arch
> >    - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list
> >
> > arch_wants_pte_order() can be overridden by the architecture if desired.
> > Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> > set of ptes map physically contigious, naturally aligned memory, so this
> > mechanism allows the architecture to optimize as required.
> >
> > Here we add the default implementation of arch_wants_pte_order(), used
> > when the architecture does not define it, which returns -1, implying
> > that the HW has no preference.
> >
> > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > ---
> >   Documentation/admin-guide/mm/transhuge.rst |  4 ++++
> >   include/linux/pgtable.h                    | 13 +++++++++++++
> >   mm/huge_memory.c                           | 14 +++++++++++---
> >   3 files changed, 28 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> > index 732c3b2f4ba8..d6363d4efa3a 100644
> > --- a/Documentation/admin-guide/mm/transhuge.rst
> > +++ b/Documentation/admin-guide/mm/transhuge.rst
> > @@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9
> >   By enabling multiple orders, allocation of each order will be
> >   attempted, highest to lowest, until a successful allocation is made.
> >   If the PMD-order is unset, then no PMD-sized THPs will be allocated.
> > +It is also possible to enable the recommended set of orders, which
> > +will be optimized for the architecture and mm::
> > +
> > +     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
> >
> >   The kernel will ignore any orders that it does not support so read the
> >   file back to determine which orders are enabled::
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index af7639c3b0a3..0e110ce57cc3 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
> >   }
> >   #endif
> >
> > +#ifndef arch_wants_pte_order
> > +/*
> > + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> > + * PMD_ORDER) and must not be order-1 since THP requires large folios to be at
> > + * least order-2. Negative value implies that the HW has no preference and mm
> > + * will choose it's own default order.
> > + */
> > +static inline int arch_wants_pte_order(void)
> > +{
> > +     return -1;
> > +}
> > +#endif
> > +
> >   #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
> >   static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> >                                      unsigned long address,
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index bcecce769017..e2e2d3906a21 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj,
> >       int err;
> >       int ret = count;
> >       unsigned int orders;
> > +     int arch;
> >
> > -     err = kstrtouint(buf, 0, &orders);
> > -     if (err)
> > -             ret = -EINVAL;
> > +     if (sysfs_streq(buf, "recommend")) {
> > +             arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> > +             orders = BIT(arch);
> > +             orders |= BIT(PAGE_ALLOC_COSTLY_ORDER);
> > +             orders |= BIT(PMD_ORDER);
> > +     } else {
> > +             err = kstrtouint(buf, 0, &orders);
> > +             if (err)
> > +                     ret = -EINVAL;
> > +     }
> >
> >       if (ret > 0) {
> >               orders &= THP_ORDERS_ALL_ANON;
>
> :/ don't really like that. Regarding my proposal, one could have
> something like that in an "auto" setting for the "enabled" value, or a
> "recommended" setting [not sure].

Me either.

Again this is something I call random --  we only discussed "auto",
and yes, the commit message above explained why "recommended" here but
it has never surfaced in previous discussions, has it?

If so, this reinforces what I said here [1].

[1] https://lore.kernel.org/mm-commits/CAOUHufYEKx5_zxRJkeqrmnStFjR+pVQdpZ40ATSTaxLA_iRPGw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders
@ 2023-10-06 22:28       ` Yu Zhao
  0 siblings, 0 replies; 140+ messages in thread
From: Yu Zhao @ 2023-10-06 22:28 UTC (permalink / raw)
  To: David Hildenbrand, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, Catalin Marinas,
	Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On Fri, Oct 6, 2023 at 2:08 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 29.09.23 13:44, Ryan Roberts wrote:
> > In addition to passing a bitfield of folio orders to enable for THP,
> > allow the string "recommend" to be written, which has the effect of
> > causing the system to enable the orders preferred by the architecture
> > and by the mm. The user can see what these orders are by subsequently
> > reading back the file.
> >
> > Note that these recommended orders are expected to be static for a given
> > boot of the system, and so the keyword "auto" was deliberately not used,
> > as I want to reserve it for a possible future use where the "best" order
> > is chosen more dynamically at runtime.
> >
> > Recommended orders are determined as follows:
> >    - PMD_ORDER: The traditional THP size
> >    - arch_wants_pte_order() if implemented by the arch
> >    - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list
> >
> > arch_wants_pte_order() can be overridden by the architecture if desired.
> > Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> > set of ptes map physically contigious, naturally aligned memory, so this
> > mechanism allows the architecture to optimize as required.
> >
> > Here we add the default implementation of arch_wants_pte_order(), used
> > when the architecture does not define it, which returns -1, implying
> > that the HW has no preference.
> >
> > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > ---
> >   Documentation/admin-guide/mm/transhuge.rst |  4 ++++
> >   include/linux/pgtable.h                    | 13 +++++++++++++
> >   mm/huge_memory.c                           | 14 +++++++++++---
> >   3 files changed, 28 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> > index 732c3b2f4ba8..d6363d4efa3a 100644
> > --- a/Documentation/admin-guide/mm/transhuge.rst
> > +++ b/Documentation/admin-guide/mm/transhuge.rst
> > @@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9
> >   By enabling multiple orders, allocation of each order will be
> >   attempted, highest to lowest, until a successful allocation is made.
> >   If the PMD-order is unset, then no PMD-sized THPs will be allocated.
> > +It is also possible to enable the recommended set of orders, which
> > +will be optimized for the architecture and mm::
> > +
> > +     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
> >
> >   The kernel will ignore any orders that it does not support so read the
> >   file back to determine which orders are enabled::
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index af7639c3b0a3..0e110ce57cc3 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
> >   }
> >   #endif
> >
> > +#ifndef arch_wants_pte_order
> > +/*
> > + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> > + * PMD_ORDER) and must not be order-1 since THP requires large folios to be at
> > + * least order-2. Negative value implies that the HW has no preference and mm
> > + * will choose it's own default order.
> > + */
> > +static inline int arch_wants_pte_order(void)
> > +{
> > +     return -1;
> > +}
> > +#endif
> > +
> >   #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
> >   static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> >                                      unsigned long address,
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index bcecce769017..e2e2d3906a21 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj,
> >       int err;
> >       int ret = count;
> >       unsigned int orders;
> > +     int arch;
> >
> > -     err = kstrtouint(buf, 0, &orders);
> > -     if (err)
> > -             ret = -EINVAL;
> > +     if (sysfs_streq(buf, "recommend")) {
> > +             arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> > +             orders = BIT(arch);
> > +             orders |= BIT(PAGE_ALLOC_COSTLY_ORDER);
> > +             orders |= BIT(PMD_ORDER);
> > +     } else {
> > +             err = kstrtouint(buf, 0, &orders);
> > +             if (err)
> > +                     ret = -EINVAL;
> > +     }
> >
> >       if (ret > 0) {
> >               orders &= THP_ORDERS_ALL_ANON;
>
> :/ don't really like that. Regarding my proposal, one could have
> something like that in an "auto" setting for the "enabled" value, or a
> "recommended" setting [not sure].

Me either.

Again this is something I call random --  we only discussed "auto",
and yes, the commit message above explained why "recommended" here but
it has never surfaced in previous discussions, has it?

If so, this reinforces what I said here [1].

[1] https://lore.kernel.org/mm-commits/CAOUHufYEKx5_zxRJkeqrmnStFjR+pVQdpZ40ATSTaxLA_iRPGw@mail.gmail.com/

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
  2023-09-29 22:55     ` Andrew Morton
  (?)
@ 2023-10-07 22:54       ` Michael Ellerman
  -1 siblings, 0 replies; 140+ messages in thread
From: Michael Ellerman @ 2023-10-07 22:54 UTC (permalink / raw)
  To: Andrew Morton, Ryan Roberts, Aneesh Kumar K.V
  Cc: Matthew Wilcox, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel, linuxppc-dev

Andrew Morton <akpm@linux-foundation.org> writes:
> On Fri, 29 Sep 2023 12:44:15 +0100 Ryan Roberts <ryan.roberts@arm.com> wrote:
>
>> In preparation for adding support for anonymous large folios that are
>> smaller than the PMD-size, introduce 2 new sysfs files that will be used
>> to control the new behaviours via the transparent_hugepage interface.
>> For now, the kernel still only supports PMD-order anonymous THP, so when
>> reading back anon_orders, it will reflect that. Therefore there are no
>> behavioural changes intended here.
>
> powerpc strikes again.  ARCH=powerpc allmodconfig:
>
>
> In file included from ./include/linux/bits.h:6,
>                  from ./include/linux/ratelimit_types.h:5,
>                  from ./include/linux/printk.h:9,
>                  from ./include/asm-generic/bug.h:22,
>                  from ./arch/powerpc/include/asm/bug.h:116,
>                  from ./include/linux/bug.h:5,
>                  from ./include/linux/mmdebug.h:5,
>                  from ./include/linux/mm.h:6,
>                  from mm/huge_memory.c:8:
> ./include/vdso/bits.h:7:33: error: initializer element is not constant
>     7 | #define BIT(nr)                 (UL(1) << (nr))
>       |                                 ^
> mm/huge_memory.c:77:47: note: in expansion of macro 'BIT'
>    77 | unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
>       |                                               ^~~
>
> We keep tripping over this.  I wish there was a way to fix it.

I can't think of any solution, other than ripping the code out.

To catch it earlier we'd need a generic compile-time test that all values
derived from the page table geometry are only used in places that don't
require a constant. I can't think of a way to write a test for that.

Or submitters could compile-test for powerpc - one can dream :D

> Style whine: an all-caps identifier is supposed to be a constant,
> dammit.
>
> 	#define PTE_INDEX_SIZE  __pte_index_size
>
> Nope.

I agree it's ugly. It was done that way because PTE_INDEX_SIZE used to
be constant, and still is for 32-bit PPC and 64-bit Book3E PPC.

We could rename PTE_INDEX_SIZE itself, but we'd still have eg.
PTE_TABLE_SIZE which is used in generic code, and which would be
sometimes constant and sometimes not for different powerpc subarches.

> I did this:
>
> --- a/mm/huge_memory.c~mm-thp-introduce-anon_orders-and-anon_always_mask-sysfs-files-fix
> +++ a/mm/huge_memory.c
> @@ -74,7 +74,7 @@ static unsigned long deferred_split_scan
>  static atomic_t huge_zero_refcount;
>  struct page *huge_zero_page __read_mostly;
>  unsigned long huge_zero_pfn __read_mostly = ~0UL;
> -unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
> +unsigned int huge_anon_orders __read_mostly;
>  static unsigned int huge_anon_always_mask __read_mostly;
>  
>  /**
> @@ -528,6 +528,9 @@ static int __init hugepage_init_sysfs(st
>  {
>  	int err;
>  
> +	/* powerpc's PMD_ORDER isn't a compile-time constant */
> +	huge_anon_orders = BIT(PMD_ORDER);
> +
>  	*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
>  	if (unlikely(!*hugepage_kobj)) {
>  		pr_err("failed to create transparent hugepage kobject\n");
> _
>
>
> I assume this is set up early enough.

Yes it should be.

> I don't know why powerpc's PTE_INDEX_SIZE is variable.

To allow a single vmlinux to boot using either the Hashed Page Table
MMU, or Radix Tree MMU, which have different page table geometry.

That's a pretty crucial feature for distros, so that they can build a
single kernel to boot on Power8/9/10.

cheers

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
@ 2023-10-07 22:54       ` Michael Ellerman
  0 siblings, 0 replies; 140+ messages in thread
From: Michael Ellerman @ 2023-10-07 22:54 UTC (permalink / raw)
  To: Andrew Morton, Ryan Roberts, Aneesh Kumar K.V
  Cc: linux-arm-kernel, David Rientjes, Yu Zhao, John Hubbard,
	Anshuman Khandual, Catalin Marinas, Yang Shi, David Hildenbrand,
	Hugh Dickins, Yin Fengwei, Matthew Wilcox, linux-kernel,
	linux-mm, Luis Chamberlain, Vlastimil Babka, Zi Yan, Huang, Ying,
	Itaru Kitayama, linuxppc-dev, Kirill A. Shutemov

Andrew Morton <akpm@linux-foundation.org> writes:
> On Fri, 29 Sep 2023 12:44:15 +0100 Ryan Roberts <ryan.roberts@arm.com> wrote:
>
>> In preparation for adding support for anonymous large folios that are
>> smaller than the PMD-size, introduce 2 new sysfs files that will be used
>> to control the new behaviours via the transparent_hugepage interface.
>> For now, the kernel still only supports PMD-order anonymous THP, so when
>> reading back anon_orders, it will reflect that. Therefore there are no
>> behavioural changes intended here.
>
> powerpc strikes again.  ARCH=powerpc allmodconfig:
>
>
> In file included from ./include/linux/bits.h:6,
>                  from ./include/linux/ratelimit_types.h:5,
>                  from ./include/linux/printk.h:9,
>                  from ./include/asm-generic/bug.h:22,
>                  from ./arch/powerpc/include/asm/bug.h:116,
>                  from ./include/linux/bug.h:5,
>                  from ./include/linux/mmdebug.h:5,
>                  from ./include/linux/mm.h:6,
>                  from mm/huge_memory.c:8:
> ./include/vdso/bits.h:7:33: error: initializer element is not constant
>     7 | #define BIT(nr)                 (UL(1) << (nr))
>       |                                 ^
> mm/huge_memory.c:77:47: note: in expansion of macro 'BIT'
>    77 | unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
>       |                                               ^~~
>
> We keep tripping over this.  I wish there was a way to fix it.

I can't think of any solution, other than ripping the code out.

To catch it earlier we'd need a generic compile-time test that all values
derived from the page table geometry are only used in places that don't
require a constant. I can't think of a way to write a test for that.

Or submitters could compile-test for powerpc - one can dream :D

> Style whine: an all-caps identifier is supposed to be a constant,
> dammit.
>
> 	#define PTE_INDEX_SIZE  __pte_index_size
>
> Nope.

I agree it's ugly. It was done that way because PTE_INDEX_SIZE used to
be constant, and still is for 32-bit PPC and 64-bit Book3E PPC.

We could rename PTE_INDEX_SIZE itself, but we'd still have eg.
PTE_TABLE_SIZE which is used in generic code, and which would be
sometimes constant and sometimes not for different powerpc subarches.

> I did this:
>
> --- a/mm/huge_memory.c~mm-thp-introduce-anon_orders-and-anon_always_mask-sysfs-files-fix
> +++ a/mm/huge_memory.c
> @@ -74,7 +74,7 @@ static unsigned long deferred_split_scan
>  static atomic_t huge_zero_refcount;
>  struct page *huge_zero_page __read_mostly;
>  unsigned long huge_zero_pfn __read_mostly = ~0UL;
> -unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
> +unsigned int huge_anon_orders __read_mostly;
>  static unsigned int huge_anon_always_mask __read_mostly;
>  
>  /**
> @@ -528,6 +528,9 @@ static int __init hugepage_init_sysfs(st
>  {
>  	int err;
>  
> +	/* powerpc's PMD_ORDER isn't a compile-time constant */
> +	huge_anon_orders = BIT(PMD_ORDER);
> +
>  	*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
>  	if (unlikely(!*hugepage_kobj)) {
>  		pr_err("failed to create transparent hugepage kobject\n");
> _
>
>
> I assume this is set up early enough.

Yes it should be.

> I don't know why powerpc's PTE_INDEX_SIZE is variable.

To allow a single vmlinux to boot using either the Hashed Page Table
MMU, or Radix Tree MMU, which have different page table geometry.

That's a pretty crucial feature for distros, so that they can build a
single kernel to boot on Power8/9/10.

cheers

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
@ 2023-10-07 22:54       ` Michael Ellerman
  0 siblings, 0 replies; 140+ messages in thread
From: Michael Ellerman @ 2023-10-07 22:54 UTC (permalink / raw)
  To: Andrew Morton, Ryan Roberts, Aneesh Kumar K.V
  Cc: Matthew Wilcox, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel, linuxppc-dev

Andrew Morton <akpm@linux-foundation.org> writes:
> On Fri, 29 Sep 2023 12:44:15 +0100 Ryan Roberts <ryan.roberts@arm.com> wrote:
>
>> In preparation for adding support for anonymous large folios that are
>> smaller than the PMD-size, introduce 2 new sysfs files that will be used
>> to control the new behaviours via the transparent_hugepage interface.
>> For now, the kernel still only supports PMD-order anonymous THP, so when
>> reading back anon_orders, it will reflect that. Therefore there are no
>> behavioural changes intended here.
>
> powerpc strikes again.  ARCH=powerpc allmodconfig:
>
>
> In file included from ./include/linux/bits.h:6,
>                  from ./include/linux/ratelimit_types.h:5,
>                  from ./include/linux/printk.h:9,
>                  from ./include/asm-generic/bug.h:22,
>                  from ./arch/powerpc/include/asm/bug.h:116,
>                  from ./include/linux/bug.h:5,
>                  from ./include/linux/mmdebug.h:5,
>                  from ./include/linux/mm.h:6,
>                  from mm/huge_memory.c:8:
> ./include/vdso/bits.h:7:33: error: initializer element is not constant
>     7 | #define BIT(nr)                 (UL(1) << (nr))
>       |                                 ^
> mm/huge_memory.c:77:47: note: in expansion of macro 'BIT'
>    77 | unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
>       |                                               ^~~
>
> We keep tripping over this.  I wish there was a way to fix it.

I can't think of any solution, other than ripping the code out.

To catch it earlier we'd need a generic compile-time test that all values
derived from the page table geometry are only used in places that don't
require a constant. I can't think of a way to write a test for that.

Or submitters could compile-test for powerpc - one can dream :D

> Style whine: an all-caps identifier is supposed to be a constant,
> dammit.
>
> 	#define PTE_INDEX_SIZE  __pte_index_size
>
> Nope.

I agree it's ugly. It was done that way because PTE_INDEX_SIZE used to
be constant, and still is for 32-bit PPC and 64-bit Book3E PPC.

We could rename PTE_INDEX_SIZE itself, but we'd still have eg.
PTE_TABLE_SIZE which is used in generic code, and which would be
sometimes constant and sometimes not for different powerpc subarches.

> I did this:
>
> --- a/mm/huge_memory.c~mm-thp-introduce-anon_orders-and-anon_always_mask-sysfs-files-fix
> +++ a/mm/huge_memory.c
> @@ -74,7 +74,7 @@ static unsigned long deferred_split_scan
>  static atomic_t huge_zero_refcount;
>  struct page *huge_zero_page __read_mostly;
>  unsigned long huge_zero_pfn __read_mostly = ~0UL;
> -unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
> +unsigned int huge_anon_orders __read_mostly;
>  static unsigned int huge_anon_always_mask __read_mostly;
>  
>  /**
> @@ -528,6 +528,9 @@ static int __init hugepage_init_sysfs(st
>  {
>  	int err;
>  
> +	/* powerpc's PMD_ORDER isn't a compile-time constant */
> +	huge_anon_orders = BIT(PMD_ORDER);
> +
>  	*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
>  	if (unlikely(!*hugepage_kobj)) {
>  		pr_err("failed to create transparent hugepage kobject\n");
> _
>
>
> I assume this is set up early enough.

Yes it should be.

> I don't know why powerpc's PTE_INDEX_SIZE is variable.

To allow a single vmlinux to boot using either the Hashed Page Table
MMU, or Radix Tree MMU, which have different page table geometry.

That's a pretty crucial feature for distros, so that they can build a
single kernel to boot on Power8/9/10.

cheers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-06 20:06   ` David Hildenbrand
@ 2023-10-09 11:28     ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-09 11:28 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 06/10/2023 21:06, David Hildenbrand wrote:
> On 29.09.23 13:44, Ryan Roberts wrote:
>> Hi All,
> 
> Let me highlight some core decisions on the things discussed in the previous
> alignment meetings, and comment on them.
> 
>>
>> This is v6 of a series to implement variable order, large folios for anonymous
>> memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO",
>> "FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The
>> objective of this is to improve performance by allocating larger chunks of
>> memory during anonymous page faults:
> 
> Change number 1: Let's call these things THP.
> 
> Fine with me; I previously rooted for that but was told that end users could be
> confused. I think the important bit is that we don't mess up the stats, and when
> we talk about THP we default to "PMD-sized THP", unless we explicitly include
> the other ones.
> 
> 
> I dislike exposing "orders" to the users, I'm happy to be convinced why I am
> wrong and it is a good idea.
> 
> So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" -- as
> said, I think FreeBSD tends to call it "Medium-sized superpages". But what's
> small/medium/large is debatable. "Small" implies at least that it's smaller than
> what we used to know, which is a fact.
> 
> Can we also now use the terminology consistently? (e.g., "variable-order, large
> folios for anonymous memory" -> "Small-sized anonymous THP", you can just point
> at the previous patch set name in the cover letter)

Yes absolutely. FWIW, I was deliberately not changing the title of the patchset
so people could easily see it was an evolution of something posted before. But
if it's the norm to change the title as the patchset evolves, I'm very happy to
do that. And there are other places too, in commit logs that I can tidy up. I
will assume "PMD-sized THP", "small-sized THP" and "anonymous small-sized THP"
(that last one slightly different from what David suggested above - it means
"small-sized THP" can still be grepped) unless others object.

> 
>>
>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>     and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>     overhead. This should benefit all architectures.
>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>
>> The major change in this revision is the addition of sysfs controls to allow
>> this "small-order THP" to be enabled/disabled/configured independently of
>> PMD-order THP. The approach I've taken differs a bit from previous discussions;
>> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
>> personally think this makes things clearer and more extensible. See [6] for
>> detailed rationale.
> 
> Change 2: sysfs interface.
> 
> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
> agree.
> 
> What we expose there and how, is TBD. Again, not a friend of "orders" and
> bitmaps at all. We can do better if we want to go down that path.
> 
> Maybe we should take a look at hugetlb, and how they added support for multiple
> sizes. What *might* make sense could be (depending on which values we actually
> support!)
> 
> 
> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
> 
> Each one would contain an "enabled" and "defrag" file. We want something minimal
> first? Start with the "enabled" option.
> 
> 
> enabled: always [global] madvise never
> 
> Initially, we would set it for PMD-sized THP to "global" and for everything else
> to "never".

My only reservation about this approach is the potential for a future need for a
"one setting applied across all sizes" class of control (e.g. "auto"). I think
we agreed in the previous meetings that chasing a solution for "auto" was a good
aspiration to have, so it would be good to have a place we we can insert that in
future. The main reason why I chose to expose the "anon_orders" control is
because it is possible to both enable/disable the various sizes as well as
specificy (e.g.) "auto", without creating redundancy. But I agree that ideally
we wouldn't expose orders to the user; I was attempting a compromise to simplify
the "auto" case.

A potential (though feels quite complex) solution to make auto work with your
proposal: Add "auto" as an option to the existing global enabled file, and to
all of your proposed new enabled files. But its only possible to *set* auto
through the global file. And when it is set, all of the size-specific enabled
files read-back "auto" too. Any any writes to the size-specific enabled files
are ignored (or remembered but not enacted) until the global enabled file is
changed away from auto.

But I'm not sure if adding a new option to the global enabled file might break
compat?

> 
> 
> 
> That sounds reasonable at least to me, and we would be using what we learned
> from THP (as John suggested).  That still gives reasonable flexibility without
> going too wild, and a better IMHO interface.
> 
> I understand Yu's point about ABI discussions and "0 knobs". I'm happy as long
> as we can have something that won't hurt us later and still be able to use this
> in distributions within a reasonable timeframe. Enabling/disabling individual
> sizes does not sound too restrictive to me. And we could always add an "auto"
> setting later and default to that with a new kconfig knob.
> 
> If someone wants to configure it, why not. Let's just prepare a way to to handle
> this "better" automatically in the future (if ever ...).
> 
> 
> Change 3: Stats
> 
>> /proc/meminfo:
>>   Introduce new "AnonHugePteMap" field, which reports the amount of
>>   memory (in KiB) mapped from large folios globally (similar to
>>   AnonHugePages field).
> 
> AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", I think
> we all agree on that. It should have been named "AnonPmdMapped" or
> "AnonHugePmdMapped", too bad, we can't change that.

Yes agreed. I did consider redefining "AnonHugePages" to cover PMD- and
PTE-mapped memory, then introduce both an "AnonHugePmdMapped" and
"AnonHugePteMapped", but I think that would likely break things. Its further
complicated because vmstats prints it in PMD-size units, so can't represent
PTE-mapped memory in that counter.

> 
> "AnonHugePteMap" better be "AnonHugePteMapped".

I agree, but I went with the shorter one because any longer and it would unalign
the value e.g:

    AnonHugePages:         0 kB
    AnonHugePteMapped:        0 kB
    ShmemPmdMapped:        0 kB
    Shared_Hugetlb:        0 kB

So would need to decide which is preferable, or come up with a shorter name.

> 
> But, I wonder if we want to expose this "PteMapped" to user space *at all*. Why
> should they care if it's PTE mapped? For PMD-sized THP it makes a bit of sense,
> because !PMD implied !performance, and one might have been able to troubleshoot
> that somehow. For PTE-mapped, it doesn't make much sense really, they are always
> PTE-mapped.

I disagree; I've been using it a lot to debug performance issues. It tells you
how much of your anon memory is allocated with large folios. And making that
percentage bigger improves performance; fewer page faults, and with a separate
contpte series on arm64, better use of the TLB. Reasons might include; poorly
aligned/too small VMAs, memory fragmentation preventing allocation, CoW, etc.

I would actually argue for adding similar counters for file-backed memory too
for the same reasons. (I actually posted an independent patch a while back that
did this for file- and anon- memory, bucketted by size. But I think the idea of
the bucketting was NAKed.

> 
> That also raises the question how you would account a PTE-mapped THP. The hole
> thing? Only the parts that are mapped? Let's better not go down that path.

The approach I've taken in this series is the simple one - account every page
that belongs to a large folio from when it is first mapped to last unmapped.
Yes, in this case, you might not actually be mapping the full thing
contigiously. But it gives a good indication.

I also considered accounting the whole folio only when all of its pages become
mapped (although not worrying about them all being contiguous). That's still
simple to implement for all counters except smaps. So went with the simplest
approach with the view that its "good enough".

> 
> That leaves the question why we would want to include them here at all in a
> special PTE-mapped way?
> 
> 
> Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.
> 
> HugePages_Total:       1
> HugePages_Free:        1
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> Hugetlb:         1050624 kB
> 
> -> Only the last one gives the sum, the other stats don't even mention the other
> ones. [how do we get their stats, if at all?]

There are some files in /sys/kernel/mm/hugepages/hugepages-XXkB and
/sys/devices/system/node/node*/hugepages/; nr_hugepages, free_hugepages,
surplus_hugepages. But this interface also constitutes the allocator, not just
stats, I think.

> 
> So maybe, we only want a summary of how many anon huge pages of any size are
> allocated (independent of the PTE vs. PMD mapping), 

Are you proposing (AnonHugePages + AnonHugePteMapped) here or something else? If
the former, then I don't really see the difference. We have to continue to
expose PMD-size (AnonHugePages). So either add PTE-only counter, and derive the
total, or add a total counter and derive PTE-only. I suspect I've misunderstood
your point.

> and some other source to
> eventually inspect how the different sizes behave.
> 
> But note that for non-PMD-sized file THP we don't even have special counters!
> ... so maybe we should also defer any such stats and come up with something
> uniform for all types of non-PMD-sized THP.

Indeed, I can see benefit in adding these for file THP - in fact I have a patch
that does exactly that to help my development work. I had envisaged that we
could add something like FileHugePteMapped, ShmemHugePteMapped that would follow
the same semantics as AnonHugePteMapped.

> 
> 
> Sane discussion applies to all other stats.
> 
> 
>>
>> Because we now have runtime enable/disable control, I've removed the compile
>> time Kconfig switch. It still defaults to runtime-disabled.
>>
>> NOTE: These changes should not be merged until the prerequisites are complete.
>> These are in progress and tracked at [7].
> 
> We should probably list them here, and classify which one we see as strict a
> requirement, which ones might be an optimization.


I'll need some help with clasifying them, so showing my working. Final list that
I would propose as strict requirements at bottom.

This is my list with status, as per response to Yu in other thread:

  - David is working on "shared vs exclusive mappings"
  - Zi Yan has posted an RFC for compaction
  - Yin Fengwei's mlock series is now in mm-stable
  - Yin Fengwei's madvise series is in 6.6
  - I've reworked and posted a series for deferred_split_folio; although I've
    deprioritied it because Yu said it wasn't really a pre-requisite.
  - numa balancing depends on David's "shared vs exclusive mappings" work
  - I've started looking at "large folios in swap cache" in the background,
    because I'm seeing some slow down with large folios, but we also agreed that
    wasn't a prerequisite

Although, since sending that, I've determined that when running kernel
compilation across high number of cores on arm64, the cost of splitting the
folios gets large due to needing to broadcast the extra TLBIs. So I think the
last point on that list may be a prerequisite after all. (I've been able to fix
this by adding support for allocating large folios in the swap file, and
avoiding the split - planning to send RFC this week).

There is also this set of things that you mentioned against "shared vs exclusive
mappings", which I'm not sure if you are planning to cover as part of your work
or if they are follow on things that will need to be done:

(1) Detecting shared folios, to not mess with them while they are shared.
    MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
    replace cases where folio_estimated_sharers() == 1 would currently be the
    best we can do (and in some cases, page_mapcount() == 1).

And I recently discovered that khugepaged doesn't collapse file-backed pages to
a PMD-size THP if they belong to a large folio, so I'm guessing it may also
suffer the same behaviour for anon memory. I'm not sure if that's what your
"khugepaged ..." comment refers to?

So taking all that and trying to put together a complete outstanding list for
strict requirements:

  - Shared vs Exclusive Mappings (DavidH)
      - user-triggered page migration
      - NUMA hinting/balancing
      - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios
  - Compaction of Large Folios (Zi Yan)
  - Swap out small-size THP without Split (Ryan Roberts)


> 
> 
> Now, these are just my thoughts, and I'm happy about other thoughts.

As always, thanks for taking the time - I really appreciate it.

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-09 11:28     ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-09 11:28 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 06/10/2023 21:06, David Hildenbrand wrote:
> On 29.09.23 13:44, Ryan Roberts wrote:
>> Hi All,
> 
> Let me highlight some core decisions on the things discussed in the previous
> alignment meetings, and comment on them.
> 
>>
>> This is v6 of a series to implement variable order, large folios for anonymous
>> memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO",
>> "FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The
>> objective of this is to improve performance by allocating larger chunks of
>> memory during anonymous page faults:
> 
> Change number 1: Let's call these things THP.
> 
> Fine with me; I previously rooted for that but was told that end users could be
> confused. I think the important bit is that we don't mess up the stats, and when
> we talk about THP we default to "PMD-sized THP", unless we explicitly include
> the other ones.
> 
> 
> I dislike exposing "orders" to the users, I'm happy to be convinced why I am
> wrong and it is a good idea.
> 
> So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" -- as
> said, I think FreeBSD tends to call it "Medium-sized superpages". But what's
> small/medium/large is debatable. "Small" implies at least that it's smaller than
> what we used to know, which is a fact.
> 
> Can we also now use the terminology consistently? (e.g., "variable-order, large
> folios for anonymous memory" -> "Small-sized anonymous THP", you can just point
> at the previous patch set name in the cover letter)

Yes absolutely. FWIW, I was deliberately not changing the title of the patchset
so people could easily see it was an evolution of something posted before. But
if it's the norm to change the title as the patchset evolves, I'm very happy to
do that. And there are other places too, in commit logs that I can tidy up. I
will assume "PMD-sized THP", "small-sized THP" and "anonymous small-sized THP"
(that last one slightly different from what David suggested above - it means
"small-sized THP" can still be grepped) unless others object.

> 
>>
>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>     and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>     overhead. This should benefit all architectures.
>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>
>> The major change in this revision is the addition of sysfs controls to allow
>> this "small-order THP" to be enabled/disabled/configured independently of
>> PMD-order THP. The approach I've taken differs a bit from previous discussions;
>> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
>> personally think this makes things clearer and more extensible. See [6] for
>> detailed rationale.
> 
> Change 2: sysfs interface.
> 
> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
> agree.
> 
> What we expose there and how, is TBD. Again, not a friend of "orders" and
> bitmaps at all. We can do better if we want to go down that path.
> 
> Maybe we should take a look at hugetlb, and how they added support for multiple
> sizes. What *might* make sense could be (depending on which values we actually
> support!)
> 
> 
> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
> 
> Each one would contain an "enabled" and "defrag" file. We want something minimal
> first? Start with the "enabled" option.
> 
> 
> enabled: always [global] madvise never
> 
> Initially, we would set it for PMD-sized THP to "global" and for everything else
> to "never".

My only reservation about this approach is the potential for a future need for a
"one setting applied across all sizes" class of control (e.g. "auto"). I think
we agreed in the previous meetings that chasing a solution for "auto" was a good
aspiration to have, so it would be good to have a place we we can insert that in
future. The main reason why I chose to expose the "anon_orders" control is
because it is possible to both enable/disable the various sizes as well as
specificy (e.g.) "auto", without creating redundancy. But I agree that ideally
we wouldn't expose orders to the user; I was attempting a compromise to simplify
the "auto" case.

A potential (though feels quite complex) solution to make auto work with your
proposal: Add "auto" as an option to the existing global enabled file, and to
all of your proposed new enabled files. But its only possible to *set* auto
through the global file. And when it is set, all of the size-specific enabled
files read-back "auto" too. Any any writes to the size-specific enabled files
are ignored (or remembered but not enacted) until the global enabled file is
changed away from auto.

But I'm not sure if adding a new option to the global enabled file might break
compat?

> 
> 
> 
> That sounds reasonable at least to me, and we would be using what we learned
> from THP (as John suggested).  That still gives reasonable flexibility without
> going too wild, and a better IMHO interface.
> 
> I understand Yu's point about ABI discussions and "0 knobs". I'm happy as long
> as we can have something that won't hurt us later and still be able to use this
> in distributions within a reasonable timeframe. Enabling/disabling individual
> sizes does not sound too restrictive to me. And we could always add an "auto"
> setting later and default to that with a new kconfig knob.
> 
> If someone wants to configure it, why not. Let's just prepare a way to to handle
> this "better" automatically in the future (if ever ...).
> 
> 
> Change 3: Stats
> 
>> /proc/meminfo:
>>   Introduce new "AnonHugePteMap" field, which reports the amount of
>>   memory (in KiB) mapped from large folios globally (similar to
>>   AnonHugePages field).
> 
> AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", I think
> we all agree on that. It should have been named "AnonPmdMapped" or
> "AnonHugePmdMapped", too bad, we can't change that.

Yes agreed. I did consider redefining "AnonHugePages" to cover PMD- and
PTE-mapped memory, then introduce both an "AnonHugePmdMapped" and
"AnonHugePteMapped", but I think that would likely break things. Its further
complicated because vmstats prints it in PMD-size units, so can't represent
PTE-mapped memory in that counter.

> 
> "AnonHugePteMap" better be "AnonHugePteMapped".

I agree, but I went with the shorter one because any longer and it would unalign
the value e.g:

    AnonHugePages:         0 kB
    AnonHugePteMapped:        0 kB
    ShmemPmdMapped:        0 kB
    Shared_Hugetlb:        0 kB

So would need to decide which is preferable, or come up with a shorter name.

> 
> But, I wonder if we want to expose this "PteMapped" to user space *at all*. Why
> should they care if it's PTE mapped? For PMD-sized THP it makes a bit of sense,
> because !PMD implied !performance, and one might have been able to troubleshoot
> that somehow. For PTE-mapped, it doesn't make much sense really, they are always
> PTE-mapped.

I disagree; I've been using it a lot to debug performance issues. It tells you
how much of your anon memory is allocated with large folios. And making that
percentage bigger improves performance; fewer page faults, and with a separate
contpte series on arm64, better use of the TLB. Reasons might include; poorly
aligned/too small VMAs, memory fragmentation preventing allocation, CoW, etc.

I would actually argue for adding similar counters for file-backed memory too
for the same reasons. (I actually posted an independent patch a while back that
did this for file- and anon- memory, bucketted by size. But I think the idea of
the bucketting was NAKed.

> 
> That also raises the question how you would account a PTE-mapped THP. The hole
> thing? Only the parts that are mapped? Let's better not go down that path.

The approach I've taken in this series is the simple one - account every page
that belongs to a large folio from when it is first mapped to last unmapped.
Yes, in this case, you might not actually be mapping the full thing
contigiously. But it gives a good indication.

I also considered accounting the whole folio only when all of its pages become
mapped (although not worrying about them all being contiguous). That's still
simple to implement for all counters except smaps. So went with the simplest
approach with the view that its "good enough".

> 
> That leaves the question why we would want to include them here at all in a
> special PTE-mapped way?
> 
> 
> Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.
> 
> HugePages_Total:       1
> HugePages_Free:        1
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> Hugetlb:         1050624 kB
> 
> -> Only the last one gives the sum, the other stats don't even mention the other
> ones. [how do we get their stats, if at all?]

There are some files in /sys/kernel/mm/hugepages/hugepages-XXkB and
/sys/devices/system/node/node*/hugepages/; nr_hugepages, free_hugepages,
surplus_hugepages. But this interface also constitutes the allocator, not just
stats, I think.

> 
> So maybe, we only want a summary of how many anon huge pages of any size are
> allocated (independent of the PTE vs. PMD mapping), 

Are you proposing (AnonHugePages + AnonHugePteMapped) here or something else? If
the former, then I don't really see the difference. We have to continue to
expose PMD-size (AnonHugePages). So either add PTE-only counter, and derive the
total, or add a total counter and derive PTE-only. I suspect I've misunderstood
your point.

> and some other source to
> eventually inspect how the different sizes behave.
> 
> But note that for non-PMD-sized file THP we don't even have special counters!
> ... so maybe we should also defer any such stats and come up with something
> uniform for all types of non-PMD-sized THP.

Indeed, I can see benefit in adding these for file THP - in fact I have a patch
that does exactly that to help my development work. I had envisaged that we
could add something like FileHugePteMapped, ShmemHugePteMapped that would follow
the same semantics as AnonHugePteMapped.

> 
> 
> Sane discussion applies to all other stats.
> 
> 
>>
>> Because we now have runtime enable/disable control, I've removed the compile
>> time Kconfig switch. It still defaults to runtime-disabled.
>>
>> NOTE: These changes should not be merged until the prerequisites are complete.
>> These are in progress and tracked at [7].
> 
> We should probably list them here, and classify which one we see as strict a
> requirement, which ones might be an optimization.


I'll need some help with clasifying them, so showing my working. Final list that
I would propose as strict requirements at bottom.

This is my list with status, as per response to Yu in other thread:

  - David is working on "shared vs exclusive mappings"
  - Zi Yan has posted an RFC for compaction
  - Yin Fengwei's mlock series is now in mm-stable
  - Yin Fengwei's madvise series is in 6.6
  - I've reworked and posted a series for deferred_split_folio; although I've
    deprioritied it because Yu said it wasn't really a pre-requisite.
  - numa balancing depends on David's "shared vs exclusive mappings" work
  - I've started looking at "large folios in swap cache" in the background,
    because I'm seeing some slow down with large folios, but we also agreed that
    wasn't a prerequisite

Although, since sending that, I've determined that when running kernel
compilation across high number of cores on arm64, the cost of splitting the
folios gets large due to needing to broadcast the extra TLBIs. So I think the
last point on that list may be a prerequisite after all. (I've been able to fix
this by adding support for allocating large folios in the swap file, and
avoiding the split - planning to send RFC this week).

There is also this set of things that you mentioned against "shared vs exclusive
mappings", which I'm not sure if you are planning to cover as part of your work
or if they are follow on things that will need to be done:

(1) Detecting shared folios, to not mess with them while they are shared.
    MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
    replace cases where folio_estimated_sharers() == 1 would currently be the
    best we can do (and in some cases, page_mapcount() == 1).

And I recently discovered that khugepaged doesn't collapse file-backed pages to
a PMD-size THP if they belong to a large folio, so I'm guessing it may also
suffer the same behaviour for anon memory. I'm not sure if that's what your
"khugepaged ..." comment refers to?

So taking all that and trying to put together a complete outstanding list for
strict requirements:

  - Shared vs Exclusive Mappings (DavidH)
      - user-triggered page migration
      - NUMA hinting/balancing
      - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios
  - Compaction of Large Folios (Zi Yan)
  - Swap out small-size THP without Split (Ryan Roberts)


> 
> 
> Now, these are just my thoughts, and I'm happy about other thoughts.

As always, thanks for taking the time - I really appreciate it.

Thanks,
Ryan



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders
  2023-10-06 22:28       ` Yu Zhao
@ 2023-10-09 11:45         ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-09 11:45 UTC (permalink / raw)
  To: Yu Zhao, David Hildenbrand
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, Catalin Marinas,
	Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On 06/10/2023 23:28, Yu Zhao wrote:
> On Fri, Oct 6, 2023 at 2:08 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 29.09.23 13:44, Ryan Roberts wrote:
>>> In addition to passing a bitfield of folio orders to enable for THP,
>>> allow the string "recommend" to be written, which has the effect of
>>> causing the system to enable the orders preferred by the architecture
>>> and by the mm. The user can see what these orders are by subsequently
>>> reading back the file.
>>>
>>> Note that these recommended orders are expected to be static for a given
>>> boot of the system, and so the keyword "auto" was deliberately not used,
>>> as I want to reserve it for a possible future use where the "best" order
>>> is chosen more dynamically at runtime.
>>>
>>> Recommended orders are determined as follows:
>>>    - PMD_ORDER: The traditional THP size
>>>    - arch_wants_pte_order() if implemented by the arch
>>>    - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list
>>>
>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>> set of ptes map physically contigious, naturally aligned memory, so this
>>> mechanism allows the architecture to optimize as required.
>>>
>>> Here we add the default implementation of arch_wants_pte_order(), used
>>> when the architecture does not define it, which returns -1, implying
>>> that the HW has no preference.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>   Documentation/admin-guide/mm/transhuge.rst |  4 ++++
>>>   include/linux/pgtable.h                    | 13 +++++++++++++
>>>   mm/huge_memory.c                           | 14 +++++++++++---
>>>   3 files changed, 28 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>>> index 732c3b2f4ba8..d6363d4efa3a 100644
>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>> @@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9
>>>   By enabling multiple orders, allocation of each order will be
>>>   attempted, highest to lowest, until a successful allocation is made.
>>>   If the PMD-order is unset, then no PMD-sized THPs will be allocated.
>>> +It is also possible to enable the recommended set of orders, which
>>> +will be optimized for the architecture and mm::
>>> +
>>> +     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>
>>>   The kernel will ignore any orders that it does not support so read the
>>>   file back to determine which orders are enabled::
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index af7639c3b0a3..0e110ce57cc3 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
>>>   }
>>>   #endif
>>>
>>> +#ifndef arch_wants_pte_order
>>> +/*
>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>> + * PMD_ORDER) and must not be order-1 since THP requires large folios to be at
>>> + * least order-2. Negative value implies that the HW has no preference and mm
>>> + * will choose it's own default order.
>>> + */
>>> +static inline int arch_wants_pte_order(void)
>>> +{
>>> +     return -1;
>>> +}
>>> +#endif
>>> +
>>>   #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>   static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>                                      unsigned long address,
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index bcecce769017..e2e2d3906a21 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj,
>>>       int err;
>>>       int ret = count;
>>>       unsigned int orders;
>>> +     int arch;
>>>
>>> -     err = kstrtouint(buf, 0, &orders);
>>> -     if (err)
>>> -             ret = -EINVAL;
>>> +     if (sysfs_streq(buf, "recommend")) {
>>> +             arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>> +             orders = BIT(arch);
>>> +             orders |= BIT(PAGE_ALLOC_COSTLY_ORDER);
>>> +             orders |= BIT(PMD_ORDER);
>>> +     } else {
>>> +             err = kstrtouint(buf, 0, &orders);
>>> +             if (err)
>>> +                     ret = -EINVAL;
>>> +     }
>>>
>>>       if (ret > 0) {
>>>               orders &= THP_ORDERS_ALL_ANON;
>>
>> :/ don't really like that. Regarding my proposal, one could have
>> something like that in an "auto" setting for the "enabled" value, or a
>> "recommended" setting [not sure].
> 
> Me either.
> 
> Again this is something I call random --  we only discussed "auto",
> and yes, the commit message above explained why "recommended" here but
> it has never surfaced in previous discussions, has it?

The context in which we discussed "auto" was for a future aspiration to
automatically determine the order that should be used for a given allocation to
balance perf vs internal fragmentation.

The case we are talking about here is completely different; I had a pre-existing
feature from previous versions of the series, which would allow the arch to
specify its preferred order (originally proposed by Yu, IIRC). In moving the
allocation size decision to user space, I felt that we still needed a mechanism
whereby the arch could express its preference. And "recommend" is what I came up
with.

All of the friction we are currently having is around this feature, I think?
Certainly all the links you provided in the other thread all point to
conversations skirting around it. How about I just drop it for this initial
patch set? Just let user space decide what sizes it wants (per David's interface
proposal)? I can see I'm trying to get a square peg into a round hole.

> 
> If so, this reinforces what I said here [1].
> 
> [1] https://lore.kernel.org/mm-commits/CAOUHufYEKx5_zxRJkeqrmnStFjR+pVQdpZ40ATSTaxLA_iRPGw@mail.gmail.com/


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders
@ 2023-10-09 11:45         ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-09 11:45 UTC (permalink / raw)
  To: Yu Zhao, David Hildenbrand
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, Catalin Marinas,
	Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On 06/10/2023 23:28, Yu Zhao wrote:
> On Fri, Oct 6, 2023 at 2:08 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 29.09.23 13:44, Ryan Roberts wrote:
>>> In addition to passing a bitfield of folio orders to enable for THP,
>>> allow the string "recommend" to be written, which has the effect of
>>> causing the system to enable the orders preferred by the architecture
>>> and by the mm. The user can see what these orders are by subsequently
>>> reading back the file.
>>>
>>> Note that these recommended orders are expected to be static for a given
>>> boot of the system, and so the keyword "auto" was deliberately not used,
>>> as I want to reserve it for a possible future use where the "best" order
>>> is chosen more dynamically at runtime.
>>>
>>> Recommended orders are determined as follows:
>>>    - PMD_ORDER: The traditional THP size
>>>    - arch_wants_pte_order() if implemented by the arch
>>>    - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list
>>>
>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>> set of ptes map physically contigious, naturally aligned memory, so this
>>> mechanism allows the architecture to optimize as required.
>>>
>>> Here we add the default implementation of arch_wants_pte_order(), used
>>> when the architecture does not define it, which returns -1, implying
>>> that the HW has no preference.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>   Documentation/admin-guide/mm/transhuge.rst |  4 ++++
>>>   include/linux/pgtable.h                    | 13 +++++++++++++
>>>   mm/huge_memory.c                           | 14 +++++++++++---
>>>   3 files changed, 28 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>>> index 732c3b2f4ba8..d6363d4efa3a 100644
>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>> @@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9
>>>   By enabling multiple orders, allocation of each order will be
>>>   attempted, highest to lowest, until a successful allocation is made.
>>>   If the PMD-order is unset, then no PMD-sized THPs will be allocated.
>>> +It is also possible to enable the recommended set of orders, which
>>> +will be optimized for the architecture and mm::
>>> +
>>> +     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>
>>>   The kernel will ignore any orders that it does not support so read the
>>>   file back to determine which orders are enabled::
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index af7639c3b0a3..0e110ce57cc3 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
>>>   }
>>>   #endif
>>>
>>> +#ifndef arch_wants_pte_order
>>> +/*
>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>> + * PMD_ORDER) and must not be order-1 since THP requires large folios to be at
>>> + * least order-2. Negative value implies that the HW has no preference and mm
>>> + * will choose it's own default order.
>>> + */
>>> +static inline int arch_wants_pte_order(void)
>>> +{
>>> +     return -1;
>>> +}
>>> +#endif
>>> +
>>>   #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>   static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>                                      unsigned long address,
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index bcecce769017..e2e2d3906a21 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj,
>>>       int err;
>>>       int ret = count;
>>>       unsigned int orders;
>>> +     int arch;
>>>
>>> -     err = kstrtouint(buf, 0, &orders);
>>> -     if (err)
>>> -             ret = -EINVAL;
>>> +     if (sysfs_streq(buf, "recommend")) {
>>> +             arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>> +             orders = BIT(arch);
>>> +             orders |= BIT(PAGE_ALLOC_COSTLY_ORDER);
>>> +             orders |= BIT(PMD_ORDER);
>>> +     } else {
>>> +             err = kstrtouint(buf, 0, &orders);
>>> +             if (err)
>>> +                     ret = -EINVAL;
>>> +     }
>>>
>>>       if (ret > 0) {
>>>               orders &= THP_ORDERS_ALL_ANON;
>>
>> :/ don't really like that. Regarding my proposal, one could have
>> something like that in an "auto" setting for the "enabled" value, or a
>> "recommended" setting [not sure].
> 
> Me either.
> 
> Again this is something I call random --  we only discussed "auto",
> and yes, the commit message above explained why "recommended" here but
> it has never surfaced in previous discussions, has it?

The context in which we discussed "auto" was for a future aspiration to
automatically determine the order that should be used for a given allocation to
balance perf vs internal fragmentation.

The case we are talking about here is completely different; I had a pre-existing
feature from previous versions of the series, which would allow the arch to
specify its preferred order (originally proposed by Yu, IIRC). In moving the
allocation size decision to user space, I felt that we still needed a mechanism
whereby the arch could express its preference. And "recommend" is what I came up
with.

All of the friction we are currently having is around this feature, I think?
Certainly all the links you provided in the other thread all point to
conversations skirting around it. How about I just drop it for this initial
patch set? Just let user space decide what sizes it wants (per David's interface
proposal)? I can see I'm trying to get a square peg into a round hole.

> 
> If so, this reinforces what I said here [1].
> 
> [1] https://lore.kernel.org/mm-commits/CAOUHufYEKx5_zxRJkeqrmnStFjR+pVQdpZ40ATSTaxLA_iRPGw@mail.gmail.com/


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders
  2023-10-09 11:45         ` Ryan Roberts
@ 2023-10-09 14:43           ` David Hildenbrand
  -1 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-09 14:43 UTC (permalink / raw)
  To: Ryan Roberts, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, Catalin Marinas,
	Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On 09.10.23 13:45, Ryan Roberts wrote:
> On 06/10/2023 23:28, Yu Zhao wrote:
>> On Fri, Oct 6, 2023 at 2:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 29.09.23 13:44, Ryan Roberts wrote:
>>>> In addition to passing a bitfield of folio orders to enable for THP,
>>>> allow the string "recommend" to be written, which has the effect of
>>>> causing the system to enable the orders preferred by the architecture
>>>> and by the mm. The user can see what these orders are by subsequently
>>>> reading back the file.
>>>>
>>>> Note that these recommended orders are expected to be static for a given
>>>> boot of the system, and so the keyword "auto" was deliberately not used,
>>>> as I want to reserve it for a possible future use where the "best" order
>>>> is chosen more dynamically at runtime.
>>>>
>>>> Recommended orders are determined as follows:
>>>>     - PMD_ORDER: The traditional THP size
>>>>     - arch_wants_pte_order() if implemented by the arch
>>>>     - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list
>>>>
>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>> mechanism allows the architecture to optimize as required.
>>>>
>>>> Here we add the default implementation of arch_wants_pte_order(), used
>>>> when the architecture does not define it, which returns -1, implying
>>>> that the HW has no preference.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>    Documentation/admin-guide/mm/transhuge.rst |  4 ++++
>>>>    include/linux/pgtable.h                    | 13 +++++++++++++
>>>>    mm/huge_memory.c                           | 14 +++++++++++---
>>>>    3 files changed, 28 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>>>> index 732c3b2f4ba8..d6363d4efa3a 100644
>>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>>> @@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9
>>>>    By enabling multiple orders, allocation of each order will be
>>>>    attempted, highest to lowest, until a successful allocation is made.
>>>>    If the PMD-order is unset, then no PMD-sized THPs will be allocated.
>>>> +It is also possible to enable the recommended set of orders, which
>>>> +will be optimized for the architecture and mm::
>>>> +
>>>> +     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>>
>>>>    The kernel will ignore any orders that it does not support so read the
>>>>    file back to determine which orders are enabled::
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index af7639c3b0a3..0e110ce57cc3 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
>>>>    }
>>>>    #endif
>>>>
>>>> +#ifndef arch_wants_pte_order
>>>> +/*
>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>> + * PMD_ORDER) and must not be order-1 since THP requires large folios to be at
>>>> + * least order-2. Negative value implies that the HW has no preference and mm
>>>> + * will choose it's own default order.
>>>> + */
>>>> +static inline int arch_wants_pte_order(void)
>>>> +{
>>>> +     return -1;
>>>> +}
>>>> +#endif
>>>> +
>>>>    #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>    static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>                                       unsigned long address,
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index bcecce769017..e2e2d3906a21 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj,
>>>>        int err;
>>>>        int ret = count;
>>>>        unsigned int orders;
>>>> +     int arch;
>>>>
>>>> -     err = kstrtouint(buf, 0, &orders);
>>>> -     if (err)
>>>> -             ret = -EINVAL;
>>>> +     if (sysfs_streq(buf, "recommend")) {
>>>> +             arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>> +             orders = BIT(arch);
>>>> +             orders |= BIT(PAGE_ALLOC_COSTLY_ORDER);
>>>> +             orders |= BIT(PMD_ORDER);
>>>> +     } else {
>>>> +             err = kstrtouint(buf, 0, &orders);
>>>> +             if (err)
>>>> +                     ret = -EINVAL;
>>>> +     }
>>>>
>>>>        if (ret > 0) {
>>>>                orders &= THP_ORDERS_ALL_ANON;
>>>
>>> :/ don't really like that. Regarding my proposal, one could have
>>> something like that in an "auto" setting for the "enabled" value, or a
>>> "recommended" setting [not sure].
>>
>> Me either.
>>
>> Again this is something I call random --  we only discussed "auto",
>> and yes, the commit message above explained why "recommended" here but
>> it has never surfaced in previous discussions, has it?
> 
> The context in which we discussed "auto" was for a future aspiration to
> automatically determine the order that should be used for a given allocation to
> balance perf vs internal fragmentation.
> 
> The case we are talking about here is completely different; I had a pre-existing
> feature from previous versions of the series, which would allow the arch to
> specify its preferred order (originally proposed by Yu, IIRC). In moving the
> allocation size decision to user space, I felt that we still needed a mechanism
> whereby the arch could express its preference. And "recommend" is what I came up
> with.
> 
> All of the friction we are currently having is around this feature, I think?
> Certainly all the links you provided in the other thread all point to
> conversations skirting around it. How about I just drop it for this initial
> patch set? Just let user space decide what sizes it wants (per David's interface
> proposal)? I can see I'm trying to get a square peg into a round hole.

Dropping it for the initial patch set sounds like a very good idea. 
Telling people what to enable initially when they want to play with it 
will work out just fine.

[Ideally, we plan ahead to have such "auto" settings in the future, as I 
expressed.]

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders
@ 2023-10-09 14:43           ` David Hildenbrand
  0 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-09 14:43 UTC (permalink / raw)
  To: Ryan Roberts, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, Catalin Marinas,
	Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On 09.10.23 13:45, Ryan Roberts wrote:
> On 06/10/2023 23:28, Yu Zhao wrote:
>> On Fri, Oct 6, 2023 at 2:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 29.09.23 13:44, Ryan Roberts wrote:
>>>> In addition to passing a bitfield of folio orders to enable for THP,
>>>> allow the string "recommend" to be written, which has the effect of
>>>> causing the system to enable the orders preferred by the architecture
>>>> and by the mm. The user can see what these orders are by subsequently
>>>> reading back the file.
>>>>
>>>> Note that these recommended orders are expected to be static for a given
>>>> boot of the system, and so the keyword "auto" was deliberately not used,
>>>> as I want to reserve it for a possible future use where the "best" order
>>>> is chosen more dynamically at runtime.
>>>>
>>>> Recommended orders are determined as follows:
>>>>     - PMD_ORDER: The traditional THP size
>>>>     - arch_wants_pte_order() if implemented by the arch
>>>>     - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list
>>>>
>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>> mechanism allows the architecture to optimize as required.
>>>>
>>>> Here we add the default implementation of arch_wants_pte_order(), used
>>>> when the architecture does not define it, which returns -1, implying
>>>> that the HW has no preference.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>    Documentation/admin-guide/mm/transhuge.rst |  4 ++++
>>>>    include/linux/pgtable.h                    | 13 +++++++++++++
>>>>    mm/huge_memory.c                           | 14 +++++++++++---
>>>>    3 files changed, 28 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>>>> index 732c3b2f4ba8..d6363d4efa3a 100644
>>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>>> @@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9
>>>>    By enabling multiple orders, allocation of each order will be
>>>>    attempted, highest to lowest, until a successful allocation is made.
>>>>    If the PMD-order is unset, then no PMD-sized THPs will be allocated.
>>>> +It is also possible to enable the recommended set of orders, which
>>>> +will be optimized for the architecture and mm::
>>>> +
>>>> +     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>>
>>>>    The kernel will ignore any orders that it does not support so read the
>>>>    file back to determine which orders are enabled::
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index af7639c3b0a3..0e110ce57cc3 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
>>>>    }
>>>>    #endif
>>>>
>>>> +#ifndef arch_wants_pte_order
>>>> +/*
>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>> + * PMD_ORDER) and must not be order-1 since THP requires large folios to be at
>>>> + * least order-2. Negative value implies that the HW has no preference and mm
>>>> + * will choose it's own default order.
>>>> + */
>>>> +static inline int arch_wants_pte_order(void)
>>>> +{
>>>> +     return -1;
>>>> +}
>>>> +#endif
>>>> +
>>>>    #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>    static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>                                       unsigned long address,
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index bcecce769017..e2e2d3906a21 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj,
>>>>        int err;
>>>>        int ret = count;
>>>>        unsigned int orders;
>>>> +     int arch;
>>>>
>>>> -     err = kstrtouint(buf, 0, &orders);
>>>> -     if (err)
>>>> -             ret = -EINVAL;
>>>> +     if (sysfs_streq(buf, "recommend")) {
>>>> +             arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>> +             orders = BIT(arch);
>>>> +             orders |= BIT(PAGE_ALLOC_COSTLY_ORDER);
>>>> +             orders |= BIT(PMD_ORDER);
>>>> +     } else {
>>>> +             err = kstrtouint(buf, 0, &orders);
>>>> +             if (err)
>>>> +                     ret = -EINVAL;
>>>> +     }
>>>>
>>>>        if (ret > 0) {
>>>>                orders &= THP_ORDERS_ALL_ANON;
>>>
>>> :/ don't really like that. Regarding my proposal, one could have
>>> something like that in an "auto" setting for the "enabled" value, or a
>>> "recommended" setting [not sure].
>>
>> Me either.
>>
>> Again this is something I call random --  we only discussed "auto",
>> and yes, the commit message above explained why "recommended" here but
>> it has never surfaced in previous discussions, has it?
> 
> The context in which we discussed "auto" was for a future aspiration to
> automatically determine the order that should be used for a given allocation to
> balance perf vs internal fragmentation.
> 
> The case we are talking about here is completely different; I had a pre-existing
> feature from previous versions of the series, which would allow the arch to
> specify its preferred order (originally proposed by Yu, IIRC). In moving the
> allocation size decision to user space, I felt that we still needed a mechanism
> whereby the arch could express its preference. And "recommend" is what I came up
> with.
> 
> All of the friction we are currently having is around this feature, I think?
> Certainly all the links you provided in the other thread all point to
> conversations skirting around it. How about I just drop it for this initial
> patch set? Just let user space decide what sizes it wants (per David's interface
> proposal)? I can see I'm trying to get a square peg into a round hole.

Dropping it for the initial patch set sounds like a very good idea. 
Telling people what to enable initially when they want to play with it 
will work out just fine.

[Ideally, we plan ahead to have such "auto" settings in the future, as I 
expressed.]

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-09 11:28     ` Ryan Roberts
@ 2023-10-09 16:22       ` David Hildenbrand
  -1 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-09 16:22 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

[...]

>>
>> I dislike exposing "orders" to the users, I'm happy to be convinced why I am
>> wrong and it is a good idea.
>>
>> So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" -- as
>> said, I think FreeBSD tends to call it "Medium-sized superpages". But what's
>> small/medium/large is debatable. "Small" implies at least that it's smaller than
>> what we used to know, which is a fact.
>>
>> Can we also now use the terminology consistently? (e.g., "variable-order, large
>> folios for anonymous memory" -> "Small-sized anonymous THP", you can just point
>> at the previous patch set name in the cover letter)
> 
> Yes absolutely. FWIW, I was deliberately not changing the title of the patchset
> so people could easily see it was an evolution of something posted before. But
> if it's the norm to change the title as the patchset evolves, I'm very happy to
> do that. And there are other places too, in commit logs that I can tidy up. I
> will assume "PMD-sized THP", "small-sized THP" and "anonymous small-sized THP"
> (that last one slightly different from what David suggested above - it means
> "small-sized THP" can still be grepped) unless others object.

Absolutely fine with me. Hoping other people will object when I talk 
nonsense or my suggestions don't make any sense.

Or even better, propose something better :)

> 
>>
>>>
>>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>>      pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>>      and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>>      overhead. This should benefit all architectures.
>>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>>      advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>      speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>>      TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>
>>> The major change in this revision is the addition of sysfs controls to allow
>>> this "small-order THP" to be enabled/disabled/configured independently of
>>> PMD-order THP. The approach I've taken differs a bit from previous discussions;
>>> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
>>> personally think this makes things clearer and more extensible. See [6] for
>>> detailed rationale.
>>
>> Change 2: sysfs interface.
>>
>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>> agree.
>>
>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>> bitmaps at all. We can do better if we want to go down that path.
>>
>> Maybe we should take a look at hugetlb, and how they added support for multiple
>> sizes. What *might* make sense could be (depending on which values we actually
>> support!)
>>
>>
>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>
>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>> first? Start with the "enabled" option.
>>
>>
>> enabled: always [global] madvise never
>>
>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>> to "never".
> 
> My only reservation about this approach is the potential for a future need for a
> "one setting applied across all sizes" class of control (e.g. "auto"). I think
> we agreed in the previous meetings that chasing a solution for "auto" was a good
> aspiration to have, so it would be good to have a place we we can insert that in
> future. The main reason why I chose to expose the "anon_orders" control is
> because it is possible to both enable/disable the various sizes as well as
> specificy (e.g.) "auto", without creating redundancy. But I agree that ideally
> we wouldn't expose orders to the user; I was attempting a compromise to simplify
> the "auto" case.
> 
> A potential (though feels quite complex) solution to make auto work with your
> proposal: Add "auto" as an option to the existing global enabled file, and to
> all of your proposed new enabled files. But its only possible to *set* auto
> through the global file. And when it is set, all of the size-specific enabled
> files read-back "auto" too. Any any writes to the size-specific enabled files
> are ignored (or remembered but not enacted) until the global enabled file is
> changed away from auto.

Yes, I think there are various ways forward regarding that. Or to enable 
"auto" mode only once all are "auto", and as soon as one is not "auto", 
just disable it. A simple

echo "auto" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled

Would do to enable it. Or, have them all be "global" and have a global 
"auto" mode as you raised.

echo "global" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
echo "auto" > /sys/kernel/mm/transparent_hugepage/enabled

> 
> But I'm not sure if adding a new option to the global enabled file might break
> compat?

I think we used to extend the "defrag" option, see

commit 21440d7eb9044001b7fdb71d0163689f60a0f2a1
Author: David Rientjes <rientjes@google.com>
Date:   Wed Feb 22 15:45:49 2017 -0800

     mm, thp: add new defer+madvise defrag option


So I suspect we could extend that one in a similar way.

But again, this is just the thing that came to mind when thinking about 
how to:
a) avoid orders
b) make it configurable and future-proof
c) make it look a bit consistent with other interfaces (hugetlb and
    existing thp)
d) still prepare for an auto mode that we want in the future

I'm happy to hear other ideas.

> 
>>
>>
>>
>> That sounds reasonable at least to me, and we would be using what we learned
>> from THP (as John suggested).  That still gives reasonable flexibility without
>> going too wild, and a better IMHO interface.
>>
>> I understand Yu's point about ABI discussions and "0 knobs". I'm happy as long
>> as we can have something that won't hurt us later and still be able to use this
>> in distributions within a reasonable timeframe. Enabling/disabling individual
>> sizes does not sound too restrictive to me. And we could always add an "auto"
>> setting later and default to that with a new kconfig knob.
>>
>> If someone wants to configure it, why not. Let's just prepare a way to to handle
>> this "better" automatically in the future (if ever ...).
>>
>>
>> Change 3: Stats
>>
>>> /proc/meminfo:
>>>     Introduce new "AnonHugePteMap" field, which reports the amount of
>>>     memory (in KiB) mapped from large folios globally (similar to
>>>     AnonHugePages field).
>>
>> AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", I think
>> we all agree on that. It should have been named "AnonPmdMapped" or
>> "AnonHugePmdMapped", too bad, we can't change that.
> 
> Yes agreed. I did consider redefining "AnonHugePages" to cover PMD- and
> PTE-mapped memory, then introduce both an "AnonHugePmdMapped" and
> "AnonHugePteMapped", but I think that would likely break things. Its further
> complicated because vmstats prints it in PMD-size units, so can't represent
> PTE-mapped memory in that counter.

:/

> 
>>
>> "AnonHugePteMap" better be "AnonHugePteMapped".
> 
> I agree, but I went with the shorter one because any longer and it would unalign
> the value e.g:
> 
>      AnonHugePages:         0 kB
>      AnonHugePteMapped:        0 kB
>      ShmemPmdMapped:        0 kB
>      Shared_Hugetlb:        0 kB
> 

Can't that be handled? We surely have long stuff in there:

HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:  1081344 kB
ShmemPmdMapped:        0 kB

HardwareCorrupted has same length as AnonHugePteMapped

But I'm not convinced about "AnonHugePteMapped" yet :)

> So would need to decide which is preferable, or come up with a shorter name.
> 
>>
>> But, I wonder if we want to expose this "PteMapped" to user space *at all*. Why
>> should they care if it's PTE mapped? For PMD-sized THP it makes a bit of sense,
>> because !PMD implied !performance, and one might have been able to troubleshoot
>> that somehow. For PTE-mapped, it doesn't make much sense really, they are always
>> PTE-mapped.
> 
> I disagree; I've been using it a lot to debug performance issues. It tells you
> how much of your anon memory is allocated with large folios. And making that
> percentage bigger improves performance; fewer page faults, and with a separate
> contpte series on arm64, better use of the TLB. Reasons might include; poorly
> aligned/too small VMAs, memory fragmentation preventing allocation, CoW, etc.

Just because a small-sized THP is PTE-mapped doesn't tell you anything, 
really. What you want to know is if it is "completely" and 
"consecutively" mapped such that the HW can actually benefit from it -- 
if HW even supports it. So "PTE-mapped THP" is just part of the story. 
And that's where it gets tricky I think.

I agree that it's good for debugging, but then maybe it should a) live 
somewhere else (debugfs, bucketing below) and b) be consistent with 
other THPs, meaning we also want similar stats somewhere.

One idea would be to expose such stats in a R/O fashion like 
"nr_allocated" or "nr_hugepages" in 
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/ and friends. Of 
course, maybe tagging them with "anon" prefix.

> 
> I would actually argue for adding similar counters for file-backed memory too
> for the same reasons. (I actually posted an independent patch a while back that
> did this for file- and anon- memory, bucketted by size. But I think the idea of
> the bucketting was NAKed.
For debugging, I *think* it might be valuable to see how many THP of 
each size are allocated. Tracking exactly "how is it mapped" is not easy 
to achieve as we learned. PMD-mapped was easy, but also requires us to 
keep doing that tracking for all eternity ...

Do you have a pointer to the patch set? Did it try to squeeze it into 
/proc/meminfo?

> 
>>
>> That also raises the question how you would account a PTE-mapped THP. The hole
>> thing? Only the parts that are mapped? Let's better not go down that path.
> 
> The approach I've taken in this series is the simple one - account every page
> that belongs to a large folio from when it is first mapped to last unmapped.
> Yes, in this case, you might not actually be mapping the full thing
> contigiously. But it gives a good indication.
> 
> I also considered accounting the whole folio only when all of its pages become
> mapped (although not worrying about them all being contiguous). That's still
> simple to implement for all counters except smaps. So went with the simplest
> approach with the view that its "good enough".

If you take a look at "ShmemHugePages" and "FileHugePages", there we 
actually track them when they get allocated+freed, which is much easier 
than tracking when/how they are (un)mapped. But it's only done for 
PMD-sized THP for now.

> 
>>
>> That leaves the question why we would want to include them here at all in a
>> special PTE-mapped way?
>>
>>
>> Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.
>>
>> HugePages_Total:       1
>> HugePages_Free:        1
>> HugePages_Rsvd:        0
>> HugePages_Surp:        0
>> Hugepagesize:       2048 kB
>> Hugetlb:         1050624 kB
>>
>> -> Only the last one gives the sum, the other stats don't even mention the other
>> ones. [how do we get their stats, if at all?]
> 
> There are some files in /sys/kernel/mm/hugepages/hugepages-XXkB and
> /sys/devices/system/node/node*/hugepages/; nr_hugepages, free_hugepages,
> surplus_hugepages. But this interface also constitutes the allocator, not just
> stats, I think.

Ah, I missed that we expose free vs. reserved vs. surpluse ... there as 
well; I thought we would only have "nr_hugepages".

> 
>>
>> So maybe, we only want a summary of how many anon huge pages of any size are
>> allocated (independent of the PTE vs. PMD mapping),
> 
> Are you proposing (AnonHugePages + AnonHugePteMapped) here or something else? If
> the former, then I don't really see the difference. We have to continue to
> expose PMD-size (AnonHugePages). So either add PTE-only counter, and derive the
> total, or add a total counter and derive PTE-only. I suspect I've misunderstood
> your point.

I don't think we should go down the "PteMapped" path. Probably we want 
"bucketing" stats as you said, and maybe a global one that just combines 
everything (any THP). But naming will be difficult.

> 
>> and some other source to
>> eventually inspect how the different sizes behave.
>>
>> But note that for non-PMD-sized file THP we don't even have special counters!
>> ... so maybe we should also defer any such stats and come up with something
>> uniform for all types of non-PMD-sized THP.
> 
> Indeed, I can see benefit in adding these for file THP - in fact I have a patch
> that does exactly that to help my development work. I had envisaged that we
> could add something like FileHugePteMapped, ShmemHugePteMapped that would follow
> the same semantics as AnonHugePteMapped.

Again, maybe we can find something that does not involve the "PteMapped" 
terminology and just gives us a big total of "allocated" THP. For 
detailed stats for debugging, maybe we can just use a different 
interface then.

> 
>>
>>
>> Sane discussion applies to all other stats.
>>
>>
>>>
>>> Because we now have runtime enable/disable control, I've removed the compile
>>> time Kconfig switch. It still defaults to runtime-disabled.
>>>
>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>> These are in progress and tracked at [7].
>>
>> We should probably list them here, and classify which one we see as strict a
>> requirement, which ones might be an optimization.
> 
> 
> I'll need some help with clasifying them, so showing my working. Final list that
> I would propose as strict requirements at bottom.
> 
> This is my list with status, as per response to Yu in other thread:
> 
>    - David is working on "shared vs exclusive mappings"

Probably "COW reuse support" is a separate item, although my approach 
would cover that.

The question is, if the estimate we're using in most code for now would 
at least be sufficient to merge it. The estimate is easily wrong, but we 
do have that issue with PTE-mapped THP already.

But that argument probably applies to most things here: the difference 
is that PTE-mapped THP are not the default, that's why nobody really cared.

[I'm playing with an almost-lockless scheme right now and hope I have 
something running soonish -- as you know, I got distracted]

>    - Zi Yan has posted an RFC for compaction
>    - Yin Fengwei's mlock series is now in mm-stable
>    - Yin Fengwei's madvise series is in 6.6
>    - I've reworked and posted a series for deferred_split_folio; although I've
>      deprioritied it because Yu said it wasn't really a pre-requisite.
>    - numa balancing depends on David's "shared vs exclusive mappings" work
>    - I've started looking at "large folios in swap cache" in the background,
>      because I'm seeing some slow down with large folios, but we also agreed that
>      wasn't a prerequisite
> 

Probably it would be good to talk about the items and how we would 
classify them in a meeting.


> Although, since sending that, I've determined that when running kernel
> compilation across high number of cores on arm64, the cost of splitting the
> folios gets large due to needing to broadcast the extra TLBIs. So I think the
> last point on that list may be a prerequisite after all. (I've been able to fix
> this by adding support for allocating large folios in the swap file, and
> avoiding the split - planning to send RFC this week).
> 
> There is also this set of things that you mentioned against "shared vs exclusive
> mappings", which I'm not sure if you are planning to cover as part of your work
> or if they are follow on things that will need to be done:
> 
> (1) Detecting shared folios, to not mess with them while they are shared.
>      MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>      replace cases where folio_estimated_sharers() == 1 would currently be the
>      best we can do (and in some cases, page_mapcount() == 1).
> 
> And I recently discovered that khugepaged doesn't collapse file-backed pages to
> a PMD-size THP if they belong to a large folio, so I'm guessing it may also
> suffer the same behaviour for anon memory. I'm not sure if that's what your
> "khugepaged ..." comment refers to?

Yes. But I did not look into all the details yet.

"kuhepaged" collapse support to small-sized THP is probably also a very 
imporant item, although it might be less relevant than for PMD -- and I 
consider it future work. See below.

> 
> So taking all that and trying to put together a complete outstanding list for
> strict requirements:
> 
>    - Shared vs Exclusive Mappings (DavidH)
>        - user-triggered page migration
>        - NUMA hinting/balancing
>        - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios
>    - Compaction of Large Folios (Zi Yan)
>    - Swap out small-size THP without Split (Ryan Roberts)

^ that's going to be tough, I can promise. And the only way to live 
without that would be khugepaged support. (because that's how it's all 
working for PMD-sized THP after all!)

Once a PMD-sized THP was swapped out and evicted, it will always come 
back in order-0 folios. khugeged will re-collapse into PMD-sized chunks. 
If we could do that for PTE-sized THP as well ...

> 
> 
>>
>>
>> Now, these are just my thoughts, and I'm happy about other thoughts.
> 
> As always, thanks for taking the time - I really appreciate it.

Sure. Hoping others can comment.

My gut feeling is that it's best to focus on getting the sysfs interface 
right+future proof and handling the stats independently. While being a 
good debug mechanism, I wouldn't consider these stats a requirement: we 
don't have them for file/shmem small-sized thp so far as well.

So maybe really better to handle the stats in meminfo and friends 
separately.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-09 16:22       ` David Hildenbrand
  0 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-09 16:22 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

[...]

>>
>> I dislike exposing "orders" to the users, I'm happy to be convinced why I am
>> wrong and it is a good idea.
>>
>> So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" -- as
>> said, I think FreeBSD tends to call it "Medium-sized superpages". But what's
>> small/medium/large is debatable. "Small" implies at least that it's smaller than
>> what we used to know, which is a fact.
>>
>> Can we also now use the terminology consistently? (e.g., "variable-order, large
>> folios for anonymous memory" -> "Small-sized anonymous THP", you can just point
>> at the previous patch set name in the cover letter)
> 
> Yes absolutely. FWIW, I was deliberately not changing the title of the patchset
> so people could easily see it was an evolution of something posted before. But
> if it's the norm to change the title as the patchset evolves, I'm very happy to
> do that. And there are other places too, in commit logs that I can tidy up. I
> will assume "PMD-sized THP", "small-sized THP" and "anonymous small-sized THP"
> (that last one slightly different from what David suggested above - it means
> "small-sized THP" can still be grepped) unless others object.

Absolutely fine with me. Hoping other people will object when I talk 
nonsense or my suggestions don't make any sense.

Or even better, propose something better :)

> 
>>
>>>
>>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>>      pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>>      and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>>      overhead. This should benefit all architectures.
>>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>>      advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>      speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>>      TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>
>>> The major change in this revision is the addition of sysfs controls to allow
>>> this "small-order THP" to be enabled/disabled/configured independently of
>>> PMD-order THP. The approach I've taken differs a bit from previous discussions;
>>> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
>>> personally think this makes things clearer and more extensible. See [6] for
>>> detailed rationale.
>>
>> Change 2: sysfs interface.
>>
>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>> agree.
>>
>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>> bitmaps at all. We can do better if we want to go down that path.
>>
>> Maybe we should take a look at hugetlb, and how they added support for multiple
>> sizes. What *might* make sense could be (depending on which values we actually
>> support!)
>>
>>
>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>
>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>> first? Start with the "enabled" option.
>>
>>
>> enabled: always [global] madvise never
>>
>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>> to "never".
> 
> My only reservation about this approach is the potential for a future need for a
> "one setting applied across all sizes" class of control (e.g. "auto"). I think
> we agreed in the previous meetings that chasing a solution for "auto" was a good
> aspiration to have, so it would be good to have a place we we can insert that in
> future. The main reason why I chose to expose the "anon_orders" control is
> because it is possible to both enable/disable the various sizes as well as
> specificy (e.g.) "auto", without creating redundancy. But I agree that ideally
> we wouldn't expose orders to the user; I was attempting a compromise to simplify
> the "auto" case.
> 
> A potential (though feels quite complex) solution to make auto work with your
> proposal: Add "auto" as an option to the existing global enabled file, and to
> all of your proposed new enabled files. But its only possible to *set* auto
> through the global file. And when it is set, all of the size-specific enabled
> files read-back "auto" too. Any any writes to the size-specific enabled files
> are ignored (or remembered but not enacted) until the global enabled file is
> changed away from auto.

Yes, I think there are various ways forward regarding that. Or to enable 
"auto" mode only once all are "auto", and as soon as one is not "auto", 
just disable it. A simple

echo "auto" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled

Would do to enable it. Or, have them all be "global" and have a global 
"auto" mode as you raised.

echo "global" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
echo "auto" > /sys/kernel/mm/transparent_hugepage/enabled

> 
> But I'm not sure if adding a new option to the global enabled file might break
> compat?

I think we used to extend the "defrag" option, see

commit 21440d7eb9044001b7fdb71d0163689f60a0f2a1
Author: David Rientjes <rientjes@google.com>
Date:   Wed Feb 22 15:45:49 2017 -0800

     mm, thp: add new defer+madvise defrag option


So I suspect we could extend that one in a similar way.

But again, this is just the thing that came to mind when thinking about 
how to:
a) avoid orders
b) make it configurable and future-proof
c) make it look a bit consistent with other interfaces (hugetlb and
    existing thp)
d) still prepare for an auto mode that we want in the future

I'm happy to hear other ideas.

> 
>>
>>
>>
>> That sounds reasonable at least to me, and we would be using what we learned
>> from THP (as John suggested).  That still gives reasonable flexibility without
>> going too wild, and a better IMHO interface.
>>
>> I understand Yu's point about ABI discussions and "0 knobs". I'm happy as long
>> as we can have something that won't hurt us later and still be able to use this
>> in distributions within a reasonable timeframe. Enabling/disabling individual
>> sizes does not sound too restrictive to me. And we could always add an "auto"
>> setting later and default to that with a new kconfig knob.
>>
>> If someone wants to configure it, why not. Let's just prepare a way to to handle
>> this "better" automatically in the future (if ever ...).
>>
>>
>> Change 3: Stats
>>
>>> /proc/meminfo:
>>>     Introduce new "AnonHugePteMap" field, which reports the amount of
>>>     memory (in KiB) mapped from large folios globally (similar to
>>>     AnonHugePages field).
>>
>> AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", I think
>> we all agree on that. It should have been named "AnonPmdMapped" or
>> "AnonHugePmdMapped", too bad, we can't change that.
> 
> Yes agreed. I did consider redefining "AnonHugePages" to cover PMD- and
> PTE-mapped memory, then introduce both an "AnonHugePmdMapped" and
> "AnonHugePteMapped", but I think that would likely break things. Its further
> complicated because vmstats prints it in PMD-size units, so can't represent
> PTE-mapped memory in that counter.

:/

> 
>>
>> "AnonHugePteMap" better be "AnonHugePteMapped".
> 
> I agree, but I went with the shorter one because any longer and it would unalign
> the value e.g:
> 
>      AnonHugePages:         0 kB
>      AnonHugePteMapped:        0 kB
>      ShmemPmdMapped:        0 kB
>      Shared_Hugetlb:        0 kB
> 

Can't that be handled? We surely have long stuff in there:

HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:  1081344 kB
ShmemPmdMapped:        0 kB

HardwareCorrupted has same length as AnonHugePteMapped

But I'm not convinced about "AnonHugePteMapped" yet :)

> So would need to decide which is preferable, or come up with a shorter name.
> 
>>
>> But, I wonder if we want to expose this "PteMapped" to user space *at all*. Why
>> should they care if it's PTE mapped? For PMD-sized THP it makes a bit of sense,
>> because !PMD implied !performance, and one might have been able to troubleshoot
>> that somehow. For PTE-mapped, it doesn't make much sense really, they are always
>> PTE-mapped.
> 
> I disagree; I've been using it a lot to debug performance issues. It tells you
> how much of your anon memory is allocated with large folios. And making that
> percentage bigger improves performance; fewer page faults, and with a separate
> contpte series on arm64, better use of the TLB. Reasons might include; poorly
> aligned/too small VMAs, memory fragmentation preventing allocation, CoW, etc.

Just because a small-sized THP is PTE-mapped doesn't tell you anything, 
really. What you want to know is if it is "completely" and 
"consecutively" mapped such that the HW can actually benefit from it -- 
if HW even supports it. So "PTE-mapped THP" is just part of the story. 
And that's where it gets tricky I think.

I agree that it's good for debugging, but then maybe it should a) live 
somewhere else (debugfs, bucketing below) and b) be consistent with 
other THPs, meaning we also want similar stats somewhere.

One idea would be to expose such stats in a R/O fashion like 
"nr_allocated" or "nr_hugepages" in 
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/ and friends. Of 
course, maybe tagging them with "anon" prefix.

> 
> I would actually argue for adding similar counters for file-backed memory too
> for the same reasons. (I actually posted an independent patch a while back that
> did this for file- and anon- memory, bucketted by size. But I think the idea of
> the bucketting was NAKed.
For debugging, I *think* it might be valuable to see how many THP of 
each size are allocated. Tracking exactly "how is it mapped" is not easy 
to achieve as we learned. PMD-mapped was easy, but also requires us to 
keep doing that tracking for all eternity ...

Do you have a pointer to the patch set? Did it try to squeeze it into 
/proc/meminfo?

> 
>>
>> That also raises the question how you would account a PTE-mapped THP. The hole
>> thing? Only the parts that are mapped? Let's better not go down that path.
> 
> The approach I've taken in this series is the simple one - account every page
> that belongs to a large folio from when it is first mapped to last unmapped.
> Yes, in this case, you might not actually be mapping the full thing
> contigiously. But it gives a good indication.
> 
> I also considered accounting the whole folio only when all of its pages become
> mapped (although not worrying about them all being contiguous). That's still
> simple to implement for all counters except smaps. So went with the simplest
> approach with the view that its "good enough".

If you take a look at "ShmemHugePages" and "FileHugePages", there we 
actually track them when they get allocated+freed, which is much easier 
than tracking when/how they are (un)mapped. But it's only done for 
PMD-sized THP for now.

> 
>>
>> That leaves the question why we would want to include them here at all in a
>> special PTE-mapped way?
>>
>>
>> Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.
>>
>> HugePages_Total:       1
>> HugePages_Free:        1
>> HugePages_Rsvd:        0
>> HugePages_Surp:        0
>> Hugepagesize:       2048 kB
>> Hugetlb:         1050624 kB
>>
>> -> Only the last one gives the sum, the other stats don't even mention the other
>> ones. [how do we get their stats, if at all?]
> 
> There are some files in /sys/kernel/mm/hugepages/hugepages-XXkB and
> /sys/devices/system/node/node*/hugepages/; nr_hugepages, free_hugepages,
> surplus_hugepages. But this interface also constitutes the allocator, not just
> stats, I think.

Ah, I missed that we expose free vs. reserved vs. surpluse ... there as 
well; I thought we would only have "nr_hugepages".

> 
>>
>> So maybe, we only want a summary of how many anon huge pages of any size are
>> allocated (independent of the PTE vs. PMD mapping),
> 
> Are you proposing (AnonHugePages + AnonHugePteMapped) here or something else? If
> the former, then I don't really see the difference. We have to continue to
> expose PMD-size (AnonHugePages). So either add PTE-only counter, and derive the
> total, or add a total counter and derive PTE-only. I suspect I've misunderstood
> your point.

I don't think we should go down the "PteMapped" path. Probably we want 
"bucketing" stats as you said, and maybe a global one that just combines 
everything (any THP). But naming will be difficult.

> 
>> and some other source to
>> eventually inspect how the different sizes behave.
>>
>> But note that for non-PMD-sized file THP we don't even have special counters!
>> ... so maybe we should also defer any such stats and come up with something
>> uniform for all types of non-PMD-sized THP.
> 
> Indeed, I can see benefit in adding these for file THP - in fact I have a patch
> that does exactly that to help my development work. I had envisaged that we
> could add something like FileHugePteMapped, ShmemHugePteMapped that would follow
> the same semantics as AnonHugePteMapped.

Again, maybe we can find something that does not involve the "PteMapped" 
terminology and just gives us a big total of "allocated" THP. For 
detailed stats for debugging, maybe we can just use a different 
interface then.

> 
>>
>>
>> Sane discussion applies to all other stats.
>>
>>
>>>
>>> Because we now have runtime enable/disable control, I've removed the compile
>>> time Kconfig switch. It still defaults to runtime-disabled.
>>>
>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>> These are in progress and tracked at [7].
>>
>> We should probably list them here, and classify which one we see as strict a
>> requirement, which ones might be an optimization.
> 
> 
> I'll need some help with clasifying them, so showing my working. Final list that
> I would propose as strict requirements at bottom.
> 
> This is my list with status, as per response to Yu in other thread:
> 
>    - David is working on "shared vs exclusive mappings"

Probably "COW reuse support" is a separate item, although my approach 
would cover that.

The question is, if the estimate we're using in most code for now would 
at least be sufficient to merge it. The estimate is easily wrong, but we 
do have that issue with PTE-mapped THP already.

But that argument probably applies to most things here: the difference 
is that PTE-mapped THP are not the default, that's why nobody really cared.

[I'm playing with an almost-lockless scheme right now and hope I have 
something running soonish -- as you know, I got distracted]

>    - Zi Yan has posted an RFC for compaction
>    - Yin Fengwei's mlock series is now in mm-stable
>    - Yin Fengwei's madvise series is in 6.6
>    - I've reworked and posted a series for deferred_split_folio; although I've
>      deprioritied it because Yu said it wasn't really a pre-requisite.
>    - numa balancing depends on David's "shared vs exclusive mappings" work
>    - I've started looking at "large folios in swap cache" in the background,
>      because I'm seeing some slow down with large folios, but we also agreed that
>      wasn't a prerequisite
> 

Probably it would be good to talk about the items and how we would 
classify them in a meeting.


> Although, since sending that, I've determined that when running kernel
> compilation across high number of cores on arm64, the cost of splitting the
> folios gets large due to needing to broadcast the extra TLBIs. So I think the
> last point on that list may be a prerequisite after all. (I've been able to fix
> this by adding support for allocating large folios in the swap file, and
> avoiding the split - planning to send RFC this week).
> 
> There is also this set of things that you mentioned against "shared vs exclusive
> mappings", which I'm not sure if you are planning to cover as part of your work
> or if they are follow on things that will need to be done:
> 
> (1) Detecting shared folios, to not mess with them while they are shared.
>      MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>      replace cases where folio_estimated_sharers() == 1 would currently be the
>      best we can do (and in some cases, page_mapcount() == 1).
> 
> And I recently discovered that khugepaged doesn't collapse file-backed pages to
> a PMD-size THP if they belong to a large folio, so I'm guessing it may also
> suffer the same behaviour for anon memory. I'm not sure if that's what your
> "khugepaged ..." comment refers to?

Yes. But I did not look into all the details yet.

"kuhepaged" collapse support to small-sized THP is probably also a very 
imporant item, although it might be less relevant than for PMD -- and I 
consider it future work. See below.

> 
> So taking all that and trying to put together a complete outstanding list for
> strict requirements:
> 
>    - Shared vs Exclusive Mappings (DavidH)
>        - user-triggered page migration
>        - NUMA hinting/balancing
>        - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios
>    - Compaction of Large Folios (Zi Yan)
>    - Swap out small-size THP without Split (Ryan Roberts)

^ that's going to be tough, I can promise. And the only way to live 
without that would be khugepaged support. (because that's how it's all 
working for PMD-sized THP after all!)

Once a PMD-sized THP was swapped out and evicted, it will always come 
back in order-0 folios. khugeged will re-collapse into PMD-sized chunks. 
If we could do that for PTE-sized THP as well ...

> 
> 
>>
>>
>> Now, these are just my thoughts, and I'm happy about other thoughts.
> 
> As always, thanks for taking the time - I really appreciate it.

Sure. Hoping others can comment.

My gut feeling is that it's best to focus on getting the sysfs interface 
right+future proof and handling the stats independently. While being a 
good debug mechanism, I wouldn't consider these stats a requirement: we 
don't have them for file/shmem small-sized thp so far as well.

So maybe really better to handle the stats in meminfo and friends 
separately.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders
  2023-10-09 11:45         ` Ryan Roberts
@ 2023-10-09 20:04           ` Yu Zhao
  -1 siblings, 0 replies; 140+ messages in thread
From: Yu Zhao @ 2023-10-09 20:04 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel, Suren Baghdasaryan

On Mon, Oct 9, 2023 at 5:45 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 06/10/2023 23:28, Yu Zhao wrote:
> > On Fri, Oct 6, 2023 at 2:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 29.09.23 13:44, Ryan Roberts wrote:
> >>> In addition to passing a bitfield of folio orders to enable for THP,
> >>> allow the string "recommend" to be written, which has the effect of
> >>> causing the system to enable the orders preferred by the architecture
> >>> and by the mm. The user can see what these orders are by subsequently
> >>> reading back the file.
> >>>
> >>> Note that these recommended orders are expected to be static for a given
> >>> boot of the system, and so the keyword "auto" was deliberately not used,
> >>> as I want to reserve it for a possible future use where the "best" order
> >>> is chosen more dynamically at runtime.
> >>>
> >>> Recommended orders are determined as follows:
> >>>    - PMD_ORDER: The traditional THP size
> >>>    - arch_wants_pte_order() if implemented by the arch
> >>>    - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list
> >>>
> >>> arch_wants_pte_order() can be overridden by the architecture if desired.
> >>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> >>> set of ptes map physically contigious, naturally aligned memory, so this
> >>> mechanism allows the architecture to optimize as required.
> >>>
> >>> Here we add the default implementation of arch_wants_pte_order(), used
> >>> when the architecture does not define it, which returns -1, implying
> >>> that the HW has no preference.
> >>>
> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>> ---
> >>>   Documentation/admin-guide/mm/transhuge.rst |  4 ++++
> >>>   include/linux/pgtable.h                    | 13 +++++++++++++
> >>>   mm/huge_memory.c                           | 14 +++++++++++---
> >>>   3 files changed, 28 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> >>> index 732c3b2f4ba8..d6363d4efa3a 100644
> >>> --- a/Documentation/admin-guide/mm/transhuge.rst
> >>> +++ b/Documentation/admin-guide/mm/transhuge.rst
> >>> @@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9
> >>>   By enabling multiple orders, allocation of each order will be
> >>>   attempted, highest to lowest, until a successful allocation is made.
> >>>   If the PMD-order is unset, then no PMD-sized THPs will be allocated.
> >>> +It is also possible to enable the recommended set of orders, which
> >>> +will be optimized for the architecture and mm::
> >>> +
> >>> +     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
> >>>
> >>>   The kernel will ignore any orders that it does not support so read the
> >>>   file back to determine which orders are enabled::
> >>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>> index af7639c3b0a3..0e110ce57cc3 100644
> >>> --- a/include/linux/pgtable.h
> >>> +++ b/include/linux/pgtable.h
> >>> @@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
> >>>   }
> >>>   #endif
> >>>
> >>> +#ifndef arch_wants_pte_order
> >>> +/*
> >>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> >>> + * PMD_ORDER) and must not be order-1 since THP requires large folios to be at
> >>> + * least order-2. Negative value implies that the HW has no preference and mm
> >>> + * will choose it's own default order.
> >>> + */
> >>> +static inline int arch_wants_pte_order(void)
> >>> +{
> >>> +     return -1;
> >>> +}
> >>> +#endif
> >>> +
> >>>   #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
> >>>   static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> >>>                                      unsigned long address,
> >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>> index bcecce769017..e2e2d3906a21 100644
> >>> --- a/mm/huge_memory.c
> >>> +++ b/mm/huge_memory.c
> >>> @@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj,
> >>>       int err;
> >>>       int ret = count;
> >>>       unsigned int orders;
> >>> +     int arch;
> >>>
> >>> -     err = kstrtouint(buf, 0, &orders);
> >>> -     if (err)
> >>> -             ret = -EINVAL;
> >>> +     if (sysfs_streq(buf, "recommend")) {
> >>> +             arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> >>> +             orders = BIT(arch);
> >>> +             orders |= BIT(PAGE_ALLOC_COSTLY_ORDER);
> >>> +             orders |= BIT(PMD_ORDER);
> >>> +     } else {
> >>> +             err = kstrtouint(buf, 0, &orders);
> >>> +             if (err)
> >>> +                     ret = -EINVAL;
> >>> +     }
> >>>
> >>>       if (ret > 0) {
> >>>               orders &= THP_ORDERS_ALL_ANON;
> >>
> >> :/ don't really like that. Regarding my proposal, one could have
> >> something like that in an "auto" setting for the "enabled" value, or a
> >> "recommended" setting [not sure].
> >
> > Me either.
> >
> > Again this is something I call random --  we only discussed "auto",
> > and yes, the commit message above explained why "recommended" here but
> > it has never surfaced in previous discussions, has it?
>
> The context in which we discussed "auto" was for a future aspiration to
> automatically determine the order that should be used for a given allocation to
> balance perf vs internal fragmentation.
>
> The case we are talking about here is completely different; I had a pre-existing
> feature from previous versions of the series, which would allow the arch to
> specify its preferred order (originally proposed by Yu, IIRC). In moving the
> allocation size decision to user space, I felt that we still needed a mechanism
> whereby the arch could express its preference. And "recommend" is what I came up
> with.
>
> All of the friction we are currently having is around this feature, I think?
> Certainly all the links you provided in the other thread all point to
> conversations skirting around it. How about I just drop it for this initial
> patch set? Just let user space decide what sizes it wants (per David's interface
> proposal)? I can see I'm trying to get a square peg into a round hole.

Yes, and I think I've been fairly clear since the beginning: why can't
the initial patchset only have what we agreed on so that it can get
merged asap?

Since we haven't agreed on any ABI changes (sysfs, stats, etc.),
debugfs (Suren @ Android), boot parameters, etc., Kconfig is the only
mergeable option at the moment. To answer your questions [1][2], i.e.,
why "a compile time option": it's not to make *my testing* easier;
it's for *your series* to make immediate progress.

[1] https://lore.kernel.org/mm-commits/137d2fc4-de8b-4dda-a51d-31ce6b29a3d0@arm.com/
[2] https://lore.kernel.org/mm-commits/316054fd-0acb-4277-b9da-d21f0dae2d29@arm.com/

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders
@ 2023-10-09 20:04           ` Yu Zhao
  0 siblings, 0 replies; 140+ messages in thread
From: Yu Zhao @ 2023-10-09 20:04 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel, Suren Baghdasaryan

On Mon, Oct 9, 2023 at 5:45 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 06/10/2023 23:28, Yu Zhao wrote:
> > On Fri, Oct 6, 2023 at 2:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 29.09.23 13:44, Ryan Roberts wrote:
> >>> In addition to passing a bitfield of folio orders to enable for THP,
> >>> allow the string "recommend" to be written, which has the effect of
> >>> causing the system to enable the orders preferred by the architecture
> >>> and by the mm. The user can see what these orders are by subsequently
> >>> reading back the file.
> >>>
> >>> Note that these recommended orders are expected to be static for a given
> >>> boot of the system, and so the keyword "auto" was deliberately not used,
> >>> as I want to reserve it for a possible future use where the "best" order
> >>> is chosen more dynamically at runtime.
> >>>
> >>> Recommended orders are determined as follows:
> >>>    - PMD_ORDER: The traditional THP size
> >>>    - arch_wants_pte_order() if implemented by the arch
> >>>    - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list
> >>>
> >>> arch_wants_pte_order() can be overridden by the architecture if desired.
> >>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> >>> set of ptes map physically contigious, naturally aligned memory, so this
> >>> mechanism allows the architecture to optimize as required.
> >>>
> >>> Here we add the default implementation of arch_wants_pte_order(), used
> >>> when the architecture does not define it, which returns -1, implying
> >>> that the HW has no preference.
> >>>
> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>> ---
> >>>   Documentation/admin-guide/mm/transhuge.rst |  4 ++++
> >>>   include/linux/pgtable.h                    | 13 +++++++++++++
> >>>   mm/huge_memory.c                           | 14 +++++++++++---
> >>>   3 files changed, 28 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> >>> index 732c3b2f4ba8..d6363d4efa3a 100644
> >>> --- a/Documentation/admin-guide/mm/transhuge.rst
> >>> +++ b/Documentation/admin-guide/mm/transhuge.rst
> >>> @@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9
> >>>   By enabling multiple orders, allocation of each order will be
> >>>   attempted, highest to lowest, until a successful allocation is made.
> >>>   If the PMD-order is unset, then no PMD-sized THPs will be allocated.
> >>> +It is also possible to enable the recommended set of orders, which
> >>> +will be optimized for the architecture and mm::
> >>> +
> >>> +     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
> >>>
> >>>   The kernel will ignore any orders that it does not support so read the
> >>>   file back to determine which orders are enabled::
> >>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>> index af7639c3b0a3..0e110ce57cc3 100644
> >>> --- a/include/linux/pgtable.h
> >>> +++ b/include/linux/pgtable.h
> >>> @@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
> >>>   }
> >>>   #endif
> >>>
> >>> +#ifndef arch_wants_pte_order
> >>> +/*
> >>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> >>> + * PMD_ORDER) and must not be order-1 since THP requires large folios to be at
> >>> + * least order-2. Negative value implies that the HW has no preference and mm
> >>> + * will choose it's own default order.
> >>> + */
> >>> +static inline int arch_wants_pte_order(void)
> >>> +{
> >>> +     return -1;
> >>> +}
> >>> +#endif
> >>> +
> >>>   #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
> >>>   static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> >>>                                      unsigned long address,
> >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>> index bcecce769017..e2e2d3906a21 100644
> >>> --- a/mm/huge_memory.c
> >>> +++ b/mm/huge_memory.c
> >>> @@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj,
> >>>       int err;
> >>>       int ret = count;
> >>>       unsigned int orders;
> >>> +     int arch;
> >>>
> >>> -     err = kstrtouint(buf, 0, &orders);
> >>> -     if (err)
> >>> -             ret = -EINVAL;
> >>> +     if (sysfs_streq(buf, "recommend")) {
> >>> +             arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> >>> +             orders = BIT(arch);
> >>> +             orders |= BIT(PAGE_ALLOC_COSTLY_ORDER);
> >>> +             orders |= BIT(PMD_ORDER);
> >>> +     } else {
> >>> +             err = kstrtouint(buf, 0, &orders);
> >>> +             if (err)
> >>> +                     ret = -EINVAL;
> >>> +     }
> >>>
> >>>       if (ret > 0) {
> >>>               orders &= THP_ORDERS_ALL_ANON;
> >>
> >> :/ don't really like that. Regarding my proposal, one could have
> >> something like that in an "auto" setting for the "enabled" value, or a
> >> "recommended" setting [not sure].
> >
> > Me either.
> >
> > Again this is something I call random --  we only discussed "auto",
> > and yes, the commit message above explained why "recommended" here but
> > it has never surfaced in previous discussions, has it?
>
> The context in which we discussed "auto" was for a future aspiration to
> automatically determine the order that should be used for a given allocation to
> balance perf vs internal fragmentation.
>
> The case we are talking about here is completely different; I had a pre-existing
> feature from previous versions of the series, which would allow the arch to
> specify its preferred order (originally proposed by Yu, IIRC). In moving the
> allocation size decision to user space, I felt that we still needed a mechanism
> whereby the arch could express its preference. And "recommend" is what I came up
> with.
>
> All of the friction we are currently having is around this feature, I think?
> Certainly all the links you provided in the other thread all point to
> conversations skirting around it. How about I just drop it for this initial
> patch set? Just let user space decide what sizes it wants (per David's interface
> proposal)? I can see I'm trying to get a square peg into a round hole.

Yes, and I think I've been fairly clear since the beginning: why can't
the initial patchset only have what we agreed on so that it can get
merged asap?

Since we haven't agreed on any ABI changes (sysfs, stats, etc.),
debugfs (Suren @ Android), boot parameters, etc., Kconfig is the only
mergeable option at the moment. To answer your questions [1][2], i.e.,
why "a compile time option": it's not to make *my testing* easier;
it's for *your series* to make immediate progress.

[1] https://lore.kernel.org/mm-commits/137d2fc4-de8b-4dda-a51d-31ce6b29a3d0@arm.com/
[2] https://lore.kernel.org/mm-commits/316054fd-0acb-4277-b9da-d21f0dae2d29@arm.com/

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
  2023-10-07 22:54       ` Michael Ellerman
  (?)
@ 2023-10-10  0:20         ` Andrew Morton
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrew Morton @ 2023-10-10  0:20 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Ryan Roberts, Aneesh Kumar K.V, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins, linux-mm, linux-kernel,
	linux-arm-kernel, linuxppc-dev

On Sun, 08 Oct 2023 09:54:22 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:

> > I don't know why powerpc's PTE_INDEX_SIZE is variable.
> 
> To allow a single vmlinux to boot using either the Hashed Page Table
> MMU, or Radix Tree MMU, which have different page table geometry.
> 
> That's a pretty crucial feature for distros, so that they can build a
> single kernel to boot on Power8/9/10.

Dumb question: why can't distros ship two kernels and have the boot
loader (or something else) pick the appropriate one?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
@ 2023-10-10  0:20         ` Andrew Morton
  0 siblings, 0 replies; 140+ messages in thread
From: Andrew Morton @ 2023-10-10  0:20 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: David Hildenbrand, Catalin Marinas, Yang Shi, linux-kernel,
	linux-mm, Yu Zhao, Aneesh Kumar K.V, Hugh Dickins,
	Matthew Wilcox, Vlastimil Babka, Zi Yan, Huang, Ying,
	Ryan Roberts, Anshuman Khandual, John Hubbard, David Rientjes,
	Itaru Kitayama, linux-arm-kernel, Yin Fengwei, Luis Chamberlain,
	linuxppc-dev, Kirill A. Shutemov

On Sun, 08 Oct 2023 09:54:22 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:

> > I don't know why powerpc's PTE_INDEX_SIZE is variable.
> 
> To allow a single vmlinux to boot using either the Hashed Page Table
> MMU, or Radix Tree MMU, which have different page table geometry.
> 
> That's a pretty crucial feature for distros, so that they can build a
> single kernel to boot on Power8/9/10.

Dumb question: why can't distros ship two kernels and have the boot
loader (or something else) pick the appropriate one?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
@ 2023-10-10  0:20         ` Andrew Morton
  0 siblings, 0 replies; 140+ messages in thread
From: Andrew Morton @ 2023-10-10  0:20 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Ryan Roberts, Aneesh Kumar K.V, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins, linux-mm, linux-kernel,
	linux-arm-kernel, linuxppc-dev

On Sun, 08 Oct 2023 09:54:22 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:

> > I don't know why powerpc's PTE_INDEX_SIZE is variable.
> 
> To allow a single vmlinux to boot using either the Hashed Page Table
> MMU, or Radix Tree MMU, which have different page table geometry.
> 
> That's a pretty crucial feature for distros, so that they can build a
> single kernel to boot on Power8/9/10.

Dumb question: why can't distros ship two kernels and have the boot
loader (or something else) pick the appropriate one?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders
  2023-10-09 20:04           ` Yu Zhao
@ 2023-10-10 10:16             ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-10 10:16 UTC (permalink / raw)
  To: Yu Zhao
  Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel, Suren Baghdasaryan

On 09/10/2023 21:04, Yu Zhao wrote:
> On Mon, Oct 9, 2023 at 5:45 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 06/10/2023 23:28, Yu Zhao wrote:
>>> On Fri, Oct 6, 2023 at 2:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 29.09.23 13:44, Ryan Roberts wrote:
>>>>> In addition to passing a bitfield of folio orders to enable for THP,
>>>>> allow the string "recommend" to be written, which has the effect of
>>>>> causing the system to enable the orders preferred by the architecture
>>>>> and by the mm. The user can see what these orders are by subsequently
>>>>> reading back the file.
>>>>>
>>>>> Note that these recommended orders are expected to be static for a given
>>>>> boot of the system, and so the keyword "auto" was deliberately not used,
>>>>> as I want to reserve it for a possible future use where the "best" order
>>>>> is chosen more dynamically at runtime.
>>>>>
>>>>> Recommended orders are determined as follows:
>>>>>    - PMD_ORDER: The traditional THP size
>>>>>    - arch_wants_pte_order() if implemented by the arch
>>>>>    - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list
>>>>>
>>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>>> mechanism allows the architecture to optimize as required.
>>>>>
>>>>> Here we add the default implementation of arch_wants_pte_order(), used
>>>>> when the architecture does not define it, which returns -1, implying
>>>>> that the HW has no preference.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>   Documentation/admin-guide/mm/transhuge.rst |  4 ++++
>>>>>   include/linux/pgtable.h                    | 13 +++++++++++++
>>>>>   mm/huge_memory.c                           | 14 +++++++++++---
>>>>>   3 files changed, 28 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>>>>> index 732c3b2f4ba8..d6363d4efa3a 100644
>>>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>>>> @@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9
>>>>>   By enabling multiple orders, allocation of each order will be
>>>>>   attempted, highest to lowest, until a successful allocation is made.
>>>>>   If the PMD-order is unset, then no PMD-sized THPs will be allocated.
>>>>> +It is also possible to enable the recommended set of orders, which
>>>>> +will be optimized for the architecture and mm::
>>>>> +
>>>>> +     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>>>
>>>>>   The kernel will ignore any orders that it does not support so read the
>>>>>   file back to determine which orders are enabled::
>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>> index af7639c3b0a3..0e110ce57cc3 100644
>>>>> --- a/include/linux/pgtable.h
>>>>> +++ b/include/linux/pgtable.h
>>>>> @@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
>>>>>   }
>>>>>   #endif
>>>>>
>>>>> +#ifndef arch_wants_pte_order
>>>>> +/*
>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>>> + * PMD_ORDER) and must not be order-1 since THP requires large folios to be at
>>>>> + * least order-2. Negative value implies that the HW has no preference and mm
>>>>> + * will choose it's own default order.
>>>>> + */
>>>>> +static inline int arch_wants_pte_order(void)
>>>>> +{
>>>>> +     return -1;
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>>   #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>>   static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>>                                      unsigned long address,
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index bcecce769017..e2e2d3906a21 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj,
>>>>>       int err;
>>>>>       int ret = count;
>>>>>       unsigned int orders;
>>>>> +     int arch;
>>>>>
>>>>> -     err = kstrtouint(buf, 0, &orders);
>>>>> -     if (err)
>>>>> -             ret = -EINVAL;
>>>>> +     if (sysfs_streq(buf, "recommend")) {
>>>>> +             arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>> +             orders = BIT(arch);
>>>>> +             orders |= BIT(PAGE_ALLOC_COSTLY_ORDER);
>>>>> +             orders |= BIT(PMD_ORDER);
>>>>> +     } else {
>>>>> +             err = kstrtouint(buf, 0, &orders);
>>>>> +             if (err)
>>>>> +                     ret = -EINVAL;
>>>>> +     }
>>>>>
>>>>>       if (ret > 0) {
>>>>>               orders &= THP_ORDERS_ALL_ANON;
>>>>
>>>> :/ don't really like that. Regarding my proposal, one could have
>>>> something like that in an "auto" setting for the "enabled" value, or a
>>>> "recommended" setting [not sure].
>>>
>>> Me either.
>>>
>>> Again this is something I call random --  we only discussed "auto",
>>> and yes, the commit message above explained why "recommended" here but
>>> it has never surfaced in previous discussions, has it?
>>
>> The context in which we discussed "auto" was for a future aspiration to
>> automatically determine the order that should be used for a given allocation to
>> balance perf vs internal fragmentation.
>>
>> The case we are talking about here is completely different; I had a pre-existing
>> feature from previous versions of the series, which would allow the arch to
>> specify its preferred order (originally proposed by Yu, IIRC). In moving the
>> allocation size decision to user space, I felt that we still needed a mechanism
>> whereby the arch could express its preference. And "recommend" is what I came up
>> with.
>>
>> All of the friction we are currently having is around this feature, I think?
>> Certainly all the links you provided in the other thread all point to
>> conversations skirting around it. How about I just drop it for this initial
>> patch set? Just let user space decide what sizes it wants (per David's interface
>> proposal)? I can see I'm trying to get a square peg into a round hole.
> 
> Yes, and I think I've been fairly clear since the beginning: why can't
> the initial patchset only have what we agreed on so that it can get
> merged asap?
> 
> Since we haven't agreed on any ABI changes (sysfs, stats, etc.),
> debugfs (Suren @ Android), boot parameters, etc., Kconfig is the only
> mergeable option at the moment. To answer your questions [1][2], i.e.,
> why "a compile time option": it's not to make *my testing* easier;
> it's for *your series* to make immediate progress.

My problem is that I need a mechanism to conditionally decide whether to
allocate a small-sized THP or just a single page; unconditionally doing it when
compiled in is a problem for the 16K and 64K base page cases, where the arm64
preferred small-sized THP is 2M. I need a way to solve this for the patch set to
be usable. All my attempts to do it without introducing ABI have been rejected
(I'm not complaining about that - I understand the reasons). So I'm now relying
on ABI to solve it - I think we need to sort that in order to submit.

We've also agreed that there is a list of prerequisite items that need to be
completed before this can be merged (please do chime in if you think that list
is wrong or unneccessary), so we can use that time to discuss the ABI in parallel.

> 
> [1] https://lore.kernel.org/mm-commits/137d2fc4-de8b-4dda-a51d-31ce6b29a3d0@arm.com/
> [2] https://lore.kernel.org/mm-commits/316054fd-0acb-4277-b9da-d21f0dae2d29@arm.com/


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders
@ 2023-10-10 10:16             ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-10 10:16 UTC (permalink / raw)
  To: Yu Zhao
  Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel, Suren Baghdasaryan

On 09/10/2023 21:04, Yu Zhao wrote:
> On Mon, Oct 9, 2023 at 5:45 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 06/10/2023 23:28, Yu Zhao wrote:
>>> On Fri, Oct 6, 2023 at 2:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 29.09.23 13:44, Ryan Roberts wrote:
>>>>> In addition to passing a bitfield of folio orders to enable for THP,
>>>>> allow the string "recommend" to be written, which has the effect of
>>>>> causing the system to enable the orders preferred by the architecture
>>>>> and by the mm. The user can see what these orders are by subsequently
>>>>> reading back the file.
>>>>>
>>>>> Note that these recommended orders are expected to be static for a given
>>>>> boot of the system, and so the keyword "auto" was deliberately not used,
>>>>> as I want to reserve it for a possible future use where the "best" order
>>>>> is chosen more dynamically at runtime.
>>>>>
>>>>> Recommended orders are determined as follows:
>>>>>    - PMD_ORDER: The traditional THP size
>>>>>    - arch_wants_pte_order() if implemented by the arch
>>>>>    - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list
>>>>>
>>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>>> mechanism allows the architecture to optimize as required.
>>>>>
>>>>> Here we add the default implementation of arch_wants_pte_order(), used
>>>>> when the architecture does not define it, which returns -1, implying
>>>>> that the HW has no preference.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>   Documentation/admin-guide/mm/transhuge.rst |  4 ++++
>>>>>   include/linux/pgtable.h                    | 13 +++++++++++++
>>>>>   mm/huge_memory.c                           | 14 +++++++++++---
>>>>>   3 files changed, 28 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>>>>> index 732c3b2f4ba8..d6363d4efa3a 100644
>>>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>>>> @@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9
>>>>>   By enabling multiple orders, allocation of each order will be
>>>>>   attempted, highest to lowest, until a successful allocation is made.
>>>>>   If the PMD-order is unset, then no PMD-sized THPs will be allocated.
>>>>> +It is also possible to enable the recommended set of orders, which
>>>>> +will be optimized for the architecture and mm::
>>>>> +
>>>>> +     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>>>
>>>>>   The kernel will ignore any orders that it does not support so read the
>>>>>   file back to determine which orders are enabled::
>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>> index af7639c3b0a3..0e110ce57cc3 100644
>>>>> --- a/include/linux/pgtable.h
>>>>> +++ b/include/linux/pgtable.h
>>>>> @@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
>>>>>   }
>>>>>   #endif
>>>>>
>>>>> +#ifndef arch_wants_pte_order
>>>>> +/*
>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>>> + * PMD_ORDER) and must not be order-1 since THP requires large folios to be at
>>>>> + * least order-2. Negative value implies that the HW has no preference and mm
>>>>> + * will choose it's own default order.
>>>>> + */
>>>>> +static inline int arch_wants_pte_order(void)
>>>>> +{
>>>>> +     return -1;
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>>   #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>>   static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>>                                      unsigned long address,
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index bcecce769017..e2e2d3906a21 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj,
>>>>>       int err;
>>>>>       int ret = count;
>>>>>       unsigned int orders;
>>>>> +     int arch;
>>>>>
>>>>> -     err = kstrtouint(buf, 0, &orders);
>>>>> -     if (err)
>>>>> -             ret = -EINVAL;
>>>>> +     if (sysfs_streq(buf, "recommend")) {
>>>>> +             arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>> +             orders = BIT(arch);
>>>>> +             orders |= BIT(PAGE_ALLOC_COSTLY_ORDER);
>>>>> +             orders |= BIT(PMD_ORDER);
>>>>> +     } else {
>>>>> +             err = kstrtouint(buf, 0, &orders);
>>>>> +             if (err)
>>>>> +                     ret = -EINVAL;
>>>>> +     }
>>>>>
>>>>>       if (ret > 0) {
>>>>>               orders &= THP_ORDERS_ALL_ANON;
>>>>
>>>> :/ don't really like that. Regarding my proposal, one could have
>>>> something like that in an "auto" setting for the "enabled" value, or a
>>>> "recommended" setting [not sure].
>>>
>>> Me either.
>>>
>>> Again this is something I call random --  we only discussed "auto",
>>> and yes, the commit message above explained why "recommended" here but
>>> it has never surfaced in previous discussions, has it?
>>
>> The context in which we discussed "auto" was for a future aspiration to
>> automatically determine the order that should be used for a given allocation to
>> balance perf vs internal fragmentation.
>>
>> The case we are talking about here is completely different; I had a pre-existing
>> feature from previous versions of the series, which would allow the arch to
>> specify its preferred order (originally proposed by Yu, IIRC). In moving the
>> allocation size decision to user space, I felt that we still needed a mechanism
>> whereby the arch could express its preference. And "recommend" is what I came up
>> with.
>>
>> All of the friction we are currently having is around this feature, I think?
>> Certainly all the links you provided in the other thread all point to
>> conversations skirting around it. How about I just drop it for this initial
>> patch set? Just let user space decide what sizes it wants (per David's interface
>> proposal)? I can see I'm trying to get a square peg into a round hole.
> 
> Yes, and I think I've been fairly clear since the beginning: why can't
> the initial patchset only have what we agreed on so that it can get
> merged asap?
> 
> Since we haven't agreed on any ABI changes (sysfs, stats, etc.),
> debugfs (Suren @ Android), boot parameters, etc., Kconfig is the only
> mergeable option at the moment. To answer your questions [1][2], i.e.,
> why "a compile time option": it's not to make *my testing* easier;
> it's for *your series* to make immediate progress.

My problem is that I need a mechanism to conditionally decide whether to
allocate a small-sized THP or just a single page; unconditionally doing it when
compiled in is a problem for the 16K and 64K base page cases, where the arm64
preferred small-sized THP is 2M. I need a way to solve this for the patch set to
be usable. All my attempts to do it without introducing ABI have been rejected
(I'm not complaining about that - I understand the reasons). So I'm now relying
on ABI to solve it - I think we need to sort that in order to submit.

We've also agreed that there is a list of prerequisite items that need to be
completed before this can be merged (please do chime in if you think that list
is wrong or unneccessary), so we can use that time to discuss the ABI in parallel.

> 
> [1] https://lore.kernel.org/mm-commits/137d2fc4-de8b-4dda-a51d-31ce6b29a3d0@arm.com/
> [2] https://lore.kernel.org/mm-commits/316054fd-0acb-4277-b9da-d21f0dae2d29@arm.com/


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-09 16:22       ` David Hildenbrand
@ 2023-10-10 10:47         ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-10 10:47 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 09/10/2023 17:22, David Hildenbrand wrote:
> [...]
> 
>>>
>>> I dislike exposing "orders" to the users, I'm happy to be convinced why I am
>>> wrong and it is a good idea.
>>>
>>> So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" -- as
>>> said, I think FreeBSD tends to call it "Medium-sized superpages". But what's
>>> small/medium/large is debatable. "Small" implies at least that it's smaller than
>>> what we used to know, which is a fact.
>>>
>>> Can we also now use the terminology consistently? (e.g., "variable-order, large
>>> folios for anonymous memory" -> "Small-sized anonymous THP", you can just point
>>> at the previous patch set name in the cover letter)
>>
>> Yes absolutely. FWIW, I was deliberately not changing the title of the patchset
>> so people could easily see it was an evolution of something posted before. But
>> if it's the norm to change the title as the patchset evolves, I'm very happy to
>> do that. And there are other places too, in commit logs that I can tidy up. I
>> will assume "PMD-sized THP", "small-sized THP" and "anonymous small-sized THP"
>> (that last one slightly different from what David suggested above - it means
>> "small-sized THP" can still be grepped) unless others object.
> 
> Absolutely fine with me. Hoping other people will object when I talk nonsense or
> my suggestions don't make any sense.
> 
> Or even better, propose something better :)
> 
>>
>>>
>>>>
>>>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>>>      pages, there are efficiency savings to be had; fewer page faults,
>>>> batched PTE
>>>>      and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>>>      overhead. This should benefit all architectures.
>>>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>>>      advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>>      speeds up kernel and user space. arm64 systems have 2 mechanisms to
>>>> coalesce
>>>>      TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>>
>>>> The major change in this revision is the addition of sysfs controls to allow
>>>> this "small-order THP" to be enabled/disabled/configured independently of
>>>> PMD-order THP. The approach I've taken differs a bit from previous discussions;
>>>> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
>>>> personally think this makes things clearer and more extensible. See [6] for
>>>> detailed rationale.
>>>
>>> Change 2: sysfs interface.
>>>
>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>> agree.
>>>
>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>> bitmaps at all. We can do better if we want to go down that path.
>>>
>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>> sizes. What *might* make sense could be (depending on which values we actually
>>> support!)
>>>
>>>
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>
>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>>> first? Start with the "enabled" option.
>>>
>>>
>>> enabled: always [global] madvise never
>>>
>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>>> to "never".
>>
>> My only reservation about this approach is the potential for a future need for a
>> "one setting applied across all sizes" class of control (e.g. "auto"). I think
>> we agreed in the previous meetings that chasing a solution for "auto" was a good
>> aspiration to have, so it would be good to have a place we we can insert that in
>> future. The main reason why I chose to expose the "anon_orders" control is
>> because it is possible to both enable/disable the various sizes as well as
>> specificy (e.g.) "auto", without creating redundancy. But I agree that ideally
>> we wouldn't expose orders to the user; I was attempting a compromise to simplify
>> the "auto" case.
>>
>> A potential (though feels quite complex) solution to make auto work with your
>> proposal: Add "auto" as an option to the existing global enabled file, and to
>> all of your proposed new enabled files. But its only possible to *set* auto
>> through the global file. And when it is set, all of the size-specific enabled
>> files read-back "auto" too. Any any writes to the size-specific enabled files
>> are ignored (or remembered but not enacted) until the global enabled file is
>> changed away from auto.
> 
> Yes, I think there are various ways forward regarding that. Or to enable "auto"
> mode only once all are "auto", and as soon as one is not "auto", just disable
> it. A simple
> 
> echo "auto" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled

I'm not really a fan, because this implies that you have a period where "auto"
is reported for a size, but its not really in "auto" mode yet.

> 
> Would do to enable it. Or, have them all be "global" and have a global "auto"
> mode as you raised.
> 
> echo "global" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
> echo "auto" > /sys/kernel/mm/transparent_hugepage/enabled
> 

Again, this isn't atomic either. I tend to prefer my proposal because it
switches atomically - there are no weird intermediate states. Anyway, I guess
the important point is we have demonstrated that your proposed interface could
be extended to support "auto" in future, should we need it.

>>
>> But I'm not sure if adding a new option to the global enabled file might break
>> compat?
> 
> I think we used to extend the "defrag" option, see
> 
> commit 21440d7eb9044001b7fdb71d0163689f60a0f2a1
> Author: David Rientjes <rientjes@google.com>
> Date:   Wed Feb 22 15:45:49 2017 -0800
> 
>     mm, thp: add new defer+madvise defrag option
> 
> 
> So I suspect we could extend that one in a similar way.
> 
> But again, this is just the thing that came to mind when thinking about how to:
> a) avoid orders
> b) make it configurable and future-proof
> c) make it look a bit consistent with other interfaces (hugetlb and
>    existing thp)
> d) still prepare for an auto mode that we want in the future
> 
> I'm happy to hear other ideas.
> 
>>
>>>
>>>
>>>
>>> That sounds reasonable at least to me, and we would be using what we learned
>>> from THP (as John suggested).  That still gives reasonable flexibility without
>>> going too wild, and a better IMHO interface.
>>>
>>> I understand Yu's point about ABI discussions and "0 knobs". I'm happy as long
>>> as we can have something that won't hurt us later and still be able to use this
>>> in distributions within a reasonable timeframe. Enabling/disabling individual
>>> sizes does not sound too restrictive to me. And we could always add an "auto"
>>> setting later and default to that with a new kconfig knob.
>>>
>>> If someone wants to configure it, why not. Let's just prepare a way to to handle
>>> this "better" automatically in the future (if ever ...).
>>>
>>>
>>> Change 3: Stats
>>>
>>>> /proc/meminfo:
>>>>     Introduce new "AnonHugePteMap" field, which reports the amount of
>>>>     memory (in KiB) mapped from large folios globally (similar to
>>>>     AnonHugePages field).
>>>
>>> AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", I think
>>> we all agree on that. It should have been named "AnonPmdMapped" or
>>> "AnonHugePmdMapped", too bad, we can't change that.
>>
>> Yes agreed. I did consider redefining "AnonHugePages" to cover PMD- and
>> PTE-mapped memory, then introduce both an "AnonHugePmdMapped" and
>> "AnonHugePteMapped", but I think that would likely break things. Its further
>> complicated because vmstats prints it in PMD-size units, so can't represent
>> PTE-mapped memory in that counter.
> 
> :/
> 
>>
>>>
>>> "AnonHugePteMap" better be "AnonHugePteMapped".
>>
>> I agree, but I went with the shorter one because any longer and it would unalign
>> the value e.g:
>>
>>      AnonHugePages:         0 kB
>>      AnonHugePteMapped:        0 kB
>>      ShmemPmdMapped:        0 kB
>>      Shared_Hugetlb:        0 kB
>>
> 
> Can't that be handled? We surely have long stuff in there:
> 
> HardwareCorrupted:     0 kB
> AnonHugePages:         0 kB
> ShmemHugePages:  1081344 kB
> ShmemPmdMapped:        0 kB
> 
> HardwareCorrupted has same length as AnonHugePteMapped

HardwareCorrupted is special cased and has a field length of 5, so the largest
value you can represent before it gets pushed out is ~97MB. I imagine that's
plenty for HardwareCorrupted, but unlikely enough for AnonHugePteMapped. The
standard field size is 8, which provides for ~95GB before becoming unaligned.

> 
> But I'm not convinced about "AnonHugePteMapped" yet :)

OK

> 
>> So would need to decide which is preferable, or come up with a shorter name.
>>
>>>
>>> But, I wonder if we want to expose this "PteMapped" to user space *at all*. Why
>>> should they care if it's PTE mapped? For PMD-sized THP it makes a bit of sense,
>>> because !PMD implied !performance, and one might have been able to troubleshoot
>>> that somehow. For PTE-mapped, it doesn't make much sense really, they are always
>>> PTE-mapped.
>>
>> I disagree; I've been using it a lot to debug performance issues. It tells you
>> how much of your anon memory is allocated with large folios. And making that
>> percentage bigger improves performance; fewer page faults, and with a separate
>> contpte series on arm64, better use of the TLB. Reasons might include; poorly
>> aligned/too small VMAs, memory fragmentation preventing allocation, CoW, etc.
> 
> Just because a small-sized THP is PTE-mapped doesn't tell you anything, really.
> What you want to know is if it is "completely" and "consecutively" mapped such
> that the HW can actually benefit from it -- if HW even supports it. So
> "PTE-mapped THP" is just part of the story. And that's where it gets tricky I
> think.
> 
> I agree that it's good for debugging, but then maybe it should a) live somewhere
> else (debugfs, bucketing below) and b) be consistent with other THPs, meaning we
> also want similar stats somewhere.
> 
> One idea would be to expose such stats in a R/O fashion like "nr_allocated" or
> "nr_hugepages" in /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ and
> friends. Of course, maybe tagging them with "anon" prefix.

I see your point, but I don't completely agree with it all. That said, given
your conclusion at the bottom, perhaps we should park the discussion about the
accounting for a separate series in future? Then we can focus on the ABI?

> 
>>
>> I would actually argue for adding similar counters for file-backed memory too
>> for the same reasons. (I actually posted an independent patch a while back that
>> did this for file- and anon- memory, bucketted by size. But I think the idea of
>> the bucketting was NAKed.
> For debugging, I *think* it might be valuable to see how many THP of each size
> are allocated. Tracking exactly "how is it mapped" is not easy to achieve as we
> learned. PMD-mapped was easy, but also requires us to keep doing that tracking
> for all eternity ...
> 
> Do you have a pointer to the patch set? Did it try to squeeze it into
> /proc/meminfo?

I was actually only working on smaps/smaps_rollup, which has been my main tool
for debugging. patches at [1].

[1] https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@arm.com/

> 
>>
>>>
>>> That also raises the question how you would account a PTE-mapped THP. The hole
>>> thing? Only the parts that are mapped? Let's better not go down that path.
>>
>> The approach I've taken in this series is the simple one - account every page
>> that belongs to a large folio from when it is first mapped to last unmapped.
>> Yes, in this case, you might not actually be mapping the full thing
>> contigiously. But it gives a good indication.
>>
>> I also considered accounting the whole folio only when all of its pages become
>> mapped (although not worrying about them all being contiguous). That's still
>> simple to implement for all counters except smaps. So went with the simplest
>> approach with the view that its "good enough".
> 
> If you take a look at "ShmemHugePages" and "FileHugePages", there we actually
> track them when they get allocated+freed, which is much easier than tracking
> when/how they are (un)mapped. But it's only done for PMD-sized THP for now.
> 
>>
>>>
>>> That leaves the question why we would want to include them here at all in a
>>> special PTE-mapped way?
>>>
>>>
>>> Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.
>>>
>>> HugePages_Total:       1
>>> HugePages_Free:        1
>>> HugePages_Rsvd:        0
>>> HugePages_Surp:        0
>>> Hugepagesize:       2048 kB
>>> Hugetlb:         1050624 kB
>>>
>>> -> Only the last one gives the sum, the other stats don't even mention the other
>>> ones. [how do we get their stats, if at all?]
>>
>> There are some files in /sys/kernel/mm/hugepages/hugepages-XXkB and
>> /sys/devices/system/node/node*/hugepages/; nr_hugepages, free_hugepages,
>> surplus_hugepages. But this interface also constitutes the allocator, not just
>> stats, I think.
> 
> Ah, I missed that we expose free vs. reserved vs. surpluse ... there as well; I
> thought we would only have "nr_hugepages".
> 
>>
>>>
>>> So maybe, we only want a summary of how many anon huge pages of any size are
>>> allocated (independent of the PTE vs. PMD mapping),
>>
>> Are you proposing (AnonHugePages + AnonHugePteMapped) here or something else? If
>> the former, then I don't really see the difference. We have to continue to
>> expose PMD-size (AnonHugePages). So either add PTE-only counter, and derive the
>> total, or add a total counter and derive PTE-only. I suspect I've misunderstood
>> your point.
> 
> I don't think we should go down the "PteMapped" path. Probably we want
> "bucketing" stats as you said, and maybe a global one that just combines
> everything (any THP). But naming will be difficult.
> 
>>
>>> and some other source to
>>> eventually inspect how the different sizes behave.
>>>
>>> But note that for non-PMD-sized file THP we don't even have special counters!
>>> ... so maybe we should also defer any such stats and come up with something
>>> uniform for all types of non-PMD-sized THP.
>>
>> Indeed, I can see benefit in adding these for file THP - in fact I have a patch
>> that does exactly that to help my development work. I had envisaged that we
>> could add something like FileHugePteMapped, ShmemHugePteMapped that would follow
>> the same semantics as AnonHugePteMapped.
> 
> Again, maybe we can find something that does not involve the "PteMapped"
> terminology and just gives us a big total of "allocated" THP. For detailed stats
> for debugging, maybe we can just use a different interface then.
> 
>>
>>>
>>>
>>> Sane discussion applies to all other stats.
>>>
>>>
>>>>
>>>> Because we now have runtime enable/disable control, I've removed the compile
>>>> time Kconfig switch. It still defaults to runtime-disabled.
>>>>
>>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>>> These are in progress and tracked at [7].
>>>
>>> We should probably list them here, and classify which one we see as strict a
>>> requirement, which ones might be an optimization.
>>
>>
>> I'll need some help with clasifying them, so showing my working. Final list that
>> I would propose as strict requirements at bottom.
>>
>> This is my list with status, as per response to Yu in other thread:
>>
>>    - David is working on "shared vs exclusive mappings"
> 
> Probably "COW reuse support" is a separate item, although my approach would
> cover that.

Yeah that's the in the original thread as (2), but I thought we were all agreed
that is not a prerequisite so didn't bring it over here.

> 
> The question is, if the estimate we're using in most code for now would at least
> be sufficient to merge it. The estimate is easily wrong, but we do have that
> issue with PTE-mapped THP already.

Well as I understand it, at least the NUMA balancing code and khugepaged are
ignoring all folios > order-0. That's probably ok for the occasional PTE-mapped
THP, but I assume it becomes untenable if large folios are the norm. Perhaps we
can modify those paths to work with the current estimators in order to remove
your work from the critical path - is that what you are getting at?

> 
> But that argument probably applies to most things here: the difference is that
> PTE-mapped THP are not the default, that's why nobody really cared.
> 
> [I'm playing with an almost-lockless scheme right now and hope I have something
> running soonish -- as you know, I got distracted]
> 
>>    - Zi Yan has posted an RFC for compaction
>>    - Yin Fengwei's mlock series is now in mm-stable
>>    - Yin Fengwei's madvise series is in 6.6
>>    - I've reworked and posted a series for deferred_split_folio; although I've
>>      deprioritied it because Yu said it wasn't really a pre-requisite.
>>    - numa balancing depends on David's "shared vs exclusive mappings" work
>>    - I've started looking at "large folios in swap cache" in the background,
>>      because I'm seeing some slow down with large folios, but we also agreed that
>>      wasn't a prerequisite
>>
> 
> Probably it would be good to talk about the items and how we would classify them
> in a meeting.

Perhaps we can get a slot in Matthew's meeting tomorrow?

> 
> 
>> Although, since sending that, I've determined that when running kernel
>> compilation across high number of cores on arm64, the cost of splitting the
>> folios gets large due to needing to broadcast the extra TLBIs. So I think the
>> last point on that list may be a prerequisite after all. (I've been able to fix
>> this by adding support for allocating large folios in the swap file, and
>> avoiding the split - planning to send RFC this week).
>>
>> There is also this set of things that you mentioned against "shared vs exclusive
>> mappings", which I'm not sure if you are planning to cover as part of your work
>> or if they are follow on things that will need to be done:
>>
>> (1) Detecting shared folios, to not mess with them while they are shared.
>>      MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>>      replace cases where folio_estimated_sharers() == 1 would currently be the
>>      best we can do (and in some cases, page_mapcount() == 1).
>>
>> And I recently discovered that khugepaged doesn't collapse file-backed pages to
>> a PMD-size THP if they belong to a large folio, so I'm guessing it may also
>> suffer the same behaviour for anon memory. I'm not sure if that's what your
>> "khugepaged ..." comment refers to?
> 
> Yes. But I did not look into all the details yet.
> 
> "kuhepaged" collapse support to small-sized THP is probably also a very imporant
> item, although it might be less relevant than for PMD -- and I consider it
> future work. See below.

Yes I agree that it's definitely future work. Nothing regresses from today's
performance if you don't have that.

> 
>>
>> So taking all that and trying to put together a complete outstanding list for
>> strict requirements:
>>
>>    - Shared vs Exclusive Mappings (DavidH)
>>        - user-triggered page migration
>>        - NUMA hinting/balancing
>>        - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios
>>    - Compaction of Large Folios (Zi Yan)
>>    - Swap out small-size THP without Split (Ryan Roberts)
> 
> ^ that's going to be tough, I can promise. And the only way to live without that
> would be khugepaged support. (because that's how it's all working for PMD-sized
> THP after all!)

Are you referring specifically to the "swap out" line there? If so, it wasn't my
plan that we would *swap in* large folios - only swap them *out* as large folios
to avoid the cost of splitting. Then when they come back in, the come in as
single pages, just like PMD-sized THP, if I've understood things correctly. I
have a patch working and showing the perf improvement as a result. I'm planning
to post an RFC today, hopefully.

I don't see the swap-in side as a problem for the initial patch set. OK, they
come in as single pages, so you lost the potential TLB benefits. But that's no
worse than today's performance so not a regression. And the ratio of SW savings
on THP allocation to HW savings from the TLB is very different for small-sized
THP; much more of the benefit comes from the SW and that's still there.

> 
> Once a PMD-sized THP was swapped out and evicted, it will always come back in
> order-0 folios. khugeged will re-collapse into PMD-sized chunks. If we could do
> that for PTE-sized THP as well ...

Yes, sure, but that's a future improvement, not a requirement to prevent
regression vs today, right?

> 
>>
>>
>>>
>>>
>>> Now, these are just my thoughts, and I'm happy about other thoughts.
>>
>> As always, thanks for taking the time - I really appreciate it.
> 
> Sure. Hoping others can comment.
> 
> My gut feeling is that it's best to focus on getting the sysfs interface
> right+future proof and handling the stats independently. While being a good
> debug mechanism, I wouldn't consider these stats a requirement: we don't have
> them for file/shmem small-sized thp so far as well.
> 
> So maybe really better to handle the stats in meminfo and friends separately.
> 

I'd be very happy with that approach if others are bought in.

Thanks,
Ryan




^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-10 10:47         ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-10 10:47 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 09/10/2023 17:22, David Hildenbrand wrote:
> [...]
> 
>>>
>>> I dislike exposing "orders" to the users, I'm happy to be convinced why I am
>>> wrong and it is a good idea.
>>>
>>> So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" -- as
>>> said, I think FreeBSD tends to call it "Medium-sized superpages". But what's
>>> small/medium/large is debatable. "Small" implies at least that it's smaller than
>>> what we used to know, which is a fact.
>>>
>>> Can we also now use the terminology consistently? (e.g., "variable-order, large
>>> folios for anonymous memory" -> "Small-sized anonymous THP", you can just point
>>> at the previous patch set name in the cover letter)
>>
>> Yes absolutely. FWIW, I was deliberately not changing the title of the patchset
>> so people could easily see it was an evolution of something posted before. But
>> if it's the norm to change the title as the patchset evolves, I'm very happy to
>> do that. And there are other places too, in commit logs that I can tidy up. I
>> will assume "PMD-sized THP", "small-sized THP" and "anonymous small-sized THP"
>> (that last one slightly different from what David suggested above - it means
>> "small-sized THP" can still be grepped) unless others object.
> 
> Absolutely fine with me. Hoping other people will object when I talk nonsense or
> my suggestions don't make any sense.
> 
> Or even better, propose something better :)
> 
>>
>>>
>>>>
>>>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>>>      pages, there are efficiency savings to be had; fewer page faults,
>>>> batched PTE
>>>>      and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>>>      overhead. This should benefit all architectures.
>>>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>>>      advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>>      speeds up kernel and user space. arm64 systems have 2 mechanisms to
>>>> coalesce
>>>>      TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>>
>>>> The major change in this revision is the addition of sysfs controls to allow
>>>> this "small-order THP" to be enabled/disabled/configured independently of
>>>> PMD-order THP. The approach I've taken differs a bit from previous discussions;
>>>> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
>>>> personally think this makes things clearer and more extensible. See [6] for
>>>> detailed rationale.
>>>
>>> Change 2: sysfs interface.
>>>
>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>> agree.
>>>
>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>> bitmaps at all. We can do better if we want to go down that path.
>>>
>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>> sizes. What *might* make sense could be (depending on which values we actually
>>> support!)
>>>
>>>
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>
>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>>> first? Start with the "enabled" option.
>>>
>>>
>>> enabled: always [global] madvise never
>>>
>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>>> to "never".
>>
>> My only reservation about this approach is the potential for a future need for a
>> "one setting applied across all sizes" class of control (e.g. "auto"). I think
>> we agreed in the previous meetings that chasing a solution for "auto" was a good
>> aspiration to have, so it would be good to have a place we we can insert that in
>> future. The main reason why I chose to expose the "anon_orders" control is
>> because it is possible to both enable/disable the various sizes as well as
>> specificy (e.g.) "auto", without creating redundancy. But I agree that ideally
>> we wouldn't expose orders to the user; I was attempting a compromise to simplify
>> the "auto" case.
>>
>> A potential (though feels quite complex) solution to make auto work with your
>> proposal: Add "auto" as an option to the existing global enabled file, and to
>> all of your proposed new enabled files. But its only possible to *set* auto
>> through the global file. And when it is set, all of the size-specific enabled
>> files read-back "auto" too. Any any writes to the size-specific enabled files
>> are ignored (or remembered but not enacted) until the global enabled file is
>> changed away from auto.
> 
> Yes, I think there are various ways forward regarding that. Or to enable "auto"
> mode only once all are "auto", and as soon as one is not "auto", just disable
> it. A simple
> 
> echo "auto" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled

I'm not really a fan, because this implies that you have a period where "auto"
is reported for a size, but its not really in "auto" mode yet.

> 
> Would do to enable it. Or, have them all be "global" and have a global "auto"
> mode as you raised.
> 
> echo "global" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
> echo "auto" > /sys/kernel/mm/transparent_hugepage/enabled
> 

Again, this isn't atomic either. I tend to prefer my proposal because it
switches atomically - there are no weird intermediate states. Anyway, I guess
the important point is we have demonstrated that your proposed interface could
be extended to support "auto" in future, should we need it.

>>
>> But I'm not sure if adding a new option to the global enabled file might break
>> compat?
> 
> I think we used to extend the "defrag" option, see
> 
> commit 21440d7eb9044001b7fdb71d0163689f60a0f2a1
> Author: David Rientjes <rientjes@google.com>
> Date:   Wed Feb 22 15:45:49 2017 -0800
> 
>     mm, thp: add new defer+madvise defrag option
> 
> 
> So I suspect we could extend that one in a similar way.
> 
> But again, this is just the thing that came to mind when thinking about how to:
> a) avoid orders
> b) make it configurable and future-proof
> c) make it look a bit consistent with other interfaces (hugetlb and
>    existing thp)
> d) still prepare for an auto mode that we want in the future
> 
> I'm happy to hear other ideas.
> 
>>
>>>
>>>
>>>
>>> That sounds reasonable at least to me, and we would be using what we learned
>>> from THP (as John suggested).  That still gives reasonable flexibility without
>>> going too wild, and a better IMHO interface.
>>>
>>> I understand Yu's point about ABI discussions and "0 knobs". I'm happy as long
>>> as we can have something that won't hurt us later and still be able to use this
>>> in distributions within a reasonable timeframe. Enabling/disabling individual
>>> sizes does not sound too restrictive to me. And we could always add an "auto"
>>> setting later and default to that with a new kconfig knob.
>>>
>>> If someone wants to configure it, why not. Let's just prepare a way to to handle
>>> this "better" automatically in the future (if ever ...).
>>>
>>>
>>> Change 3: Stats
>>>
>>>> /proc/meminfo:
>>>>     Introduce new "AnonHugePteMap" field, which reports the amount of
>>>>     memory (in KiB) mapped from large folios globally (similar to
>>>>     AnonHugePages field).
>>>
>>> AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", I think
>>> we all agree on that. It should have been named "AnonPmdMapped" or
>>> "AnonHugePmdMapped", too bad, we can't change that.
>>
>> Yes agreed. I did consider redefining "AnonHugePages" to cover PMD- and
>> PTE-mapped memory, then introduce both an "AnonHugePmdMapped" and
>> "AnonHugePteMapped", but I think that would likely break things. Its further
>> complicated because vmstats prints it in PMD-size units, so can't represent
>> PTE-mapped memory in that counter.
> 
> :/
> 
>>
>>>
>>> "AnonHugePteMap" better be "AnonHugePteMapped".
>>
>> I agree, but I went with the shorter one because any longer and it would unalign
>> the value e.g:
>>
>>      AnonHugePages:         0 kB
>>      AnonHugePteMapped:        0 kB
>>      ShmemPmdMapped:        0 kB
>>      Shared_Hugetlb:        0 kB
>>
> 
> Can't that be handled? We surely have long stuff in there:
> 
> HardwareCorrupted:     0 kB
> AnonHugePages:         0 kB
> ShmemHugePages:  1081344 kB
> ShmemPmdMapped:        0 kB
> 
> HardwareCorrupted has same length as AnonHugePteMapped

HardwareCorrupted is special cased and has a field length of 5, so the largest
value you can represent before it gets pushed out is ~97MB. I imagine that's
plenty for HardwareCorrupted, but unlikely enough for AnonHugePteMapped. The
standard field size is 8, which provides for ~95GB before becoming unaligned.

> 
> But I'm not convinced about "AnonHugePteMapped" yet :)

OK

> 
>> So would need to decide which is preferable, or come up with a shorter name.
>>
>>>
>>> But, I wonder if we want to expose this "PteMapped" to user space *at all*. Why
>>> should they care if it's PTE mapped? For PMD-sized THP it makes a bit of sense,
>>> because !PMD implied !performance, and one might have been able to troubleshoot
>>> that somehow. For PTE-mapped, it doesn't make much sense really, they are always
>>> PTE-mapped.
>>
>> I disagree; I've been using it a lot to debug performance issues. It tells you
>> how much of your anon memory is allocated with large folios. And making that
>> percentage bigger improves performance; fewer page faults, and with a separate
>> contpte series on arm64, better use of the TLB. Reasons might include; poorly
>> aligned/too small VMAs, memory fragmentation preventing allocation, CoW, etc.
> 
> Just because a small-sized THP is PTE-mapped doesn't tell you anything, really.
> What you want to know is if it is "completely" and "consecutively" mapped such
> that the HW can actually benefit from it -- if HW even supports it. So
> "PTE-mapped THP" is just part of the story. And that's where it gets tricky I
> think.
> 
> I agree that it's good for debugging, but then maybe it should a) live somewhere
> else (debugfs, bucketing below) and b) be consistent with other THPs, meaning we
> also want similar stats somewhere.
> 
> One idea would be to expose such stats in a R/O fashion like "nr_allocated" or
> "nr_hugepages" in /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ and
> friends. Of course, maybe tagging them with "anon" prefix.

I see your point, but I don't completely agree with it all. That said, given
your conclusion at the bottom, perhaps we should park the discussion about the
accounting for a separate series in future? Then we can focus on the ABI?

> 
>>
>> I would actually argue for adding similar counters for file-backed memory too
>> for the same reasons. (I actually posted an independent patch a while back that
>> did this for file- and anon- memory, bucketted by size. But I think the idea of
>> the bucketting was NAKed.
> For debugging, I *think* it might be valuable to see how many THP of each size
> are allocated. Tracking exactly "how is it mapped" is not easy to achieve as we
> learned. PMD-mapped was easy, but also requires us to keep doing that tracking
> for all eternity ...
> 
> Do you have a pointer to the patch set? Did it try to squeeze it into
> /proc/meminfo?

I was actually only working on smaps/smaps_rollup, which has been my main tool
for debugging. patches at [1].

[1] https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@arm.com/

> 
>>
>>>
>>> That also raises the question how you would account a PTE-mapped THP. The hole
>>> thing? Only the parts that are mapped? Let's better not go down that path.
>>
>> The approach I've taken in this series is the simple one - account every page
>> that belongs to a large folio from when it is first mapped to last unmapped.
>> Yes, in this case, you might not actually be mapping the full thing
>> contigiously. But it gives a good indication.
>>
>> I also considered accounting the whole folio only when all of its pages become
>> mapped (although not worrying about them all being contiguous). That's still
>> simple to implement for all counters except smaps. So went with the simplest
>> approach with the view that its "good enough".
> 
> If you take a look at "ShmemHugePages" and "FileHugePages", there we actually
> track them when they get allocated+freed, which is much easier than tracking
> when/how they are (un)mapped. But it's only done for PMD-sized THP for now.
> 
>>
>>>
>>> That leaves the question why we would want to include them here at all in a
>>> special PTE-mapped way?
>>>
>>>
>>> Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.
>>>
>>> HugePages_Total:       1
>>> HugePages_Free:        1
>>> HugePages_Rsvd:        0
>>> HugePages_Surp:        0
>>> Hugepagesize:       2048 kB
>>> Hugetlb:         1050624 kB
>>>
>>> -> Only the last one gives the sum, the other stats don't even mention the other
>>> ones. [how do we get their stats, if at all?]
>>
>> There are some files in /sys/kernel/mm/hugepages/hugepages-XXkB and
>> /sys/devices/system/node/node*/hugepages/; nr_hugepages, free_hugepages,
>> surplus_hugepages. But this interface also constitutes the allocator, not just
>> stats, I think.
> 
> Ah, I missed that we expose free vs. reserved vs. surpluse ... there as well; I
> thought we would only have "nr_hugepages".
> 
>>
>>>
>>> So maybe, we only want a summary of how many anon huge pages of any size are
>>> allocated (independent of the PTE vs. PMD mapping),
>>
>> Are you proposing (AnonHugePages + AnonHugePteMapped) here or something else? If
>> the former, then I don't really see the difference. We have to continue to
>> expose PMD-size (AnonHugePages). So either add PTE-only counter, and derive the
>> total, or add a total counter and derive PTE-only. I suspect I've misunderstood
>> your point.
> 
> I don't think we should go down the "PteMapped" path. Probably we want
> "bucketing" stats as you said, and maybe a global one that just combines
> everything (any THP). But naming will be difficult.
> 
>>
>>> and some other source to
>>> eventually inspect how the different sizes behave.
>>>
>>> But note that for non-PMD-sized file THP we don't even have special counters!
>>> ... so maybe we should also defer any such stats and come up with something
>>> uniform for all types of non-PMD-sized THP.
>>
>> Indeed, I can see benefit in adding these for file THP - in fact I have a patch
>> that does exactly that to help my development work. I had envisaged that we
>> could add something like FileHugePteMapped, ShmemHugePteMapped that would follow
>> the same semantics as AnonHugePteMapped.
> 
> Again, maybe we can find something that does not involve the "PteMapped"
> terminology and just gives us a big total of "allocated" THP. For detailed stats
> for debugging, maybe we can just use a different interface then.
> 
>>
>>>
>>>
>>> Sane discussion applies to all other stats.
>>>
>>>
>>>>
>>>> Because we now have runtime enable/disable control, I've removed the compile
>>>> time Kconfig switch. It still defaults to runtime-disabled.
>>>>
>>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>>> These are in progress and tracked at [7].
>>>
>>> We should probably list them here, and classify which one we see as strict a
>>> requirement, which ones might be an optimization.
>>
>>
>> I'll need some help with clasifying them, so showing my working. Final list that
>> I would propose as strict requirements at bottom.
>>
>> This is my list with status, as per response to Yu in other thread:
>>
>>    - David is working on "shared vs exclusive mappings"
> 
> Probably "COW reuse support" is a separate item, although my approach would
> cover that.

Yeah that's the in the original thread as (2), but I thought we were all agreed
that is not a prerequisite so didn't bring it over here.

> 
> The question is, if the estimate we're using in most code for now would at least
> be sufficient to merge it. The estimate is easily wrong, but we do have that
> issue with PTE-mapped THP already.

Well as I understand it, at least the NUMA balancing code and khugepaged are
ignoring all folios > order-0. That's probably ok for the occasional PTE-mapped
THP, but I assume it becomes untenable if large folios are the norm. Perhaps we
can modify those paths to work with the current estimators in order to remove
your work from the critical path - is that what you are getting at?

> 
> But that argument probably applies to most things here: the difference is that
> PTE-mapped THP are not the default, that's why nobody really cared.
> 
> [I'm playing with an almost-lockless scheme right now and hope I have something
> running soonish -- as you know, I got distracted]
> 
>>    - Zi Yan has posted an RFC for compaction
>>    - Yin Fengwei's mlock series is now in mm-stable
>>    - Yin Fengwei's madvise series is in 6.6
>>    - I've reworked and posted a series for deferred_split_folio; although I've
>>      deprioritied it because Yu said it wasn't really a pre-requisite.
>>    - numa balancing depends on David's "shared vs exclusive mappings" work
>>    - I've started looking at "large folios in swap cache" in the background,
>>      because I'm seeing some slow down with large folios, but we also agreed that
>>      wasn't a prerequisite
>>
> 
> Probably it would be good to talk about the items and how we would classify them
> in a meeting.

Perhaps we can get a slot in Matthew's meeting tomorrow?

> 
> 
>> Although, since sending that, I've determined that when running kernel
>> compilation across high number of cores on arm64, the cost of splitting the
>> folios gets large due to needing to broadcast the extra TLBIs. So I think the
>> last point on that list may be a prerequisite after all. (I've been able to fix
>> this by adding support for allocating large folios in the swap file, and
>> avoiding the split - planning to send RFC this week).
>>
>> There is also this set of things that you mentioned against "shared vs exclusive
>> mappings", which I'm not sure if you are planning to cover as part of your work
>> or if they are follow on things that will need to be done:
>>
>> (1) Detecting shared folios, to not mess with them while they are shared.
>>      MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>>      replace cases where folio_estimated_sharers() == 1 would currently be the
>>      best we can do (and in some cases, page_mapcount() == 1).
>>
>> And I recently discovered that khugepaged doesn't collapse file-backed pages to
>> a PMD-size THP if they belong to a large folio, so I'm guessing it may also
>> suffer the same behaviour for anon memory. I'm not sure if that's what your
>> "khugepaged ..." comment refers to?
> 
> Yes. But I did not look into all the details yet.
> 
> "kuhepaged" collapse support to small-sized THP is probably also a very imporant
> item, although it might be less relevant than for PMD -- and I consider it
> future work. See below.

Yes I agree that it's definitely future work. Nothing regresses from today's
performance if you don't have that.

> 
>>
>> So taking all that and trying to put together a complete outstanding list for
>> strict requirements:
>>
>>    - Shared vs Exclusive Mappings (DavidH)
>>        - user-triggered page migration
>>        - NUMA hinting/balancing
>>        - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios
>>    - Compaction of Large Folios (Zi Yan)
>>    - Swap out small-size THP without Split (Ryan Roberts)
> 
> ^ that's going to be tough, I can promise. And the only way to live without that
> would be khugepaged support. (because that's how it's all working for PMD-sized
> THP after all!)

Are you referring specifically to the "swap out" line there? If so, it wasn't my
plan that we would *swap in* large folios - only swap them *out* as large folios
to avoid the cost of splitting. Then when they come back in, the come in as
single pages, just like PMD-sized THP, if I've understood things correctly. I
have a patch working and showing the perf improvement as a result. I'm planning
to post an RFC today, hopefully.

I don't see the swap-in side as a problem for the initial patch set. OK, they
come in as single pages, so you lost the potential TLB benefits. But that's no
worse than today's performance so not a regression. And the ratio of SW savings
on THP allocation to HW savings from the TLB is very different for small-sized
THP; much more of the benefit comes from the SW and that's still there.

> 
> Once a PMD-sized THP was swapped out and evicted, it will always come back in
> order-0 folios. khugeged will re-collapse into PMD-sized chunks. If we could do
> that for PTE-sized THP as well ...

Yes, sure, but that's a future improvement, not a requirement to prevent
regression vs today, right?

> 
>>
>>
>>>
>>>
>>> Now, these are just my thoughts, and I'm happy about other thoughts.
>>
>> As always, thanks for taking the time - I really appreciate it.
> 
> Sure. Hoping others can comment.
> 
> My gut feeling is that it's best to focus on getting the sysfs interface
> right+future proof and handling the stats independently. While being a good
> debug mechanism, I wouldn't consider these stats a requirement: we don't have
> them for file/shmem small-sized thp so far as well.
> 
> So maybe really better to handle the stats in meminfo and friends separately.
> 

I'd be very happy with that approach if others are bought in.

Thanks,
Ryan




_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
  2023-09-29 11:44   ` Ryan Roberts
@ 2023-10-11  6:02     ` kernel test robot
  -1 siblings, 0 replies; 140+ messages in thread
From: kernel test robot @ 2023-10-11  6:02 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: oe-kbuild-all, Linux Memory Management List, Ryan Roberts,
	linux-kernel, linux-arm-kernel

Hi Ryan,

kernel test robot noticed the following build errors:

[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on linus/master v6.6-rc5]
[cannot apply to akpm-mm/mm-everything tj-cgroup/for-next next-20231010]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Allow-deferred-splitting-of-arbitrary-anon-large-folios/20230929-194541
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
patch link:    https://lore.kernel.org/r/20230929114421.3761121-5-ryan.roberts%40arm.com
patch subject: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
config: powerpc-allmodconfig (https://download.01.org/0day-ci/archive/20231011/202310111302.ahYvNKX4-lkp@intel.com/config)
compiler: powerpc64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231011/202310111302.ahYvNKX4-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202310111302.ahYvNKX4-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/linux/bits.h:6,
                    from include/linux/ratelimit_types.h:5,
                    from include/linux/printk.h:9,
                    from include/asm-generic/bug.h:22,
                    from arch/powerpc/include/asm/bug.h:116,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/mm.h:6,
                    from mm/huge_memory.c:8:
>> include/vdso/bits.h:7:33: error: initializer element is not constant
       7 | #define BIT(nr)                 (UL(1) << (nr))
         |                                 ^
   mm/huge_memory.c:73:47: note: in expansion of macro 'BIT'
      73 | unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
         |                                               ^~~


vim +7 include/vdso/bits.h

3945ff37d2f48d Vincenzo Frascino 2020-03-20  6  
3945ff37d2f48d Vincenzo Frascino 2020-03-20 @7  #define BIT(nr)			(UL(1) << (nr))
cbdb1f163af2bb Andy Shevchenko   2022-11-28  8  #define BIT_ULL(nr)		(ULL(1) << (nr))
3945ff37d2f48d Vincenzo Frascino 2020-03-20  9  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
@ 2023-10-11  6:02     ` kernel test robot
  0 siblings, 0 replies; 140+ messages in thread
From: kernel test robot @ 2023-10-11  6:02 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: oe-kbuild-all, Linux Memory Management List, Ryan Roberts,
	linux-kernel, linux-arm-kernel

Hi Ryan,

kernel test robot noticed the following build errors:

[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on linus/master v6.6-rc5]
[cannot apply to akpm-mm/mm-everything tj-cgroup/for-next next-20231010]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Allow-deferred-splitting-of-arbitrary-anon-large-folios/20230929-194541
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
patch link:    https://lore.kernel.org/r/20230929114421.3761121-5-ryan.roberts%40arm.com
patch subject: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
config: powerpc-allmodconfig (https://download.01.org/0day-ci/archive/20231011/202310111302.ahYvNKX4-lkp@intel.com/config)
compiler: powerpc64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231011/202310111302.ahYvNKX4-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202310111302.ahYvNKX4-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/linux/bits.h:6,
                    from include/linux/ratelimit_types.h:5,
                    from include/linux/printk.h:9,
                    from include/asm-generic/bug.h:22,
                    from arch/powerpc/include/asm/bug.h:116,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/mm.h:6,
                    from mm/huge_memory.c:8:
>> include/vdso/bits.h:7:33: error: initializer element is not constant
       7 | #define BIT(nr)                 (UL(1) << (nr))
         |                                 ^
   mm/huge_memory.c:73:47: note: in expansion of macro 'BIT'
      73 | unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
         |                                               ^~~


vim +7 include/vdso/bits.h

3945ff37d2f48d Vincenzo Frascino 2020-03-20  6  
3945ff37d2f48d Vincenzo Frascino 2020-03-20 @7  #define BIT(nr)			(UL(1) << (nr))
cbdb1f163af2bb Andy Shevchenko   2022-11-28  8  #define BIT_ULL(nr)		(ULL(1) << (nr))
3945ff37d2f48d Vincenzo Frascino 2020-03-20  9  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
  2023-10-10  0:20         ` Andrew Morton
  (?)
@ 2023-10-12  9:31           ` David Hildenbrand
  -1 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-12  9:31 UTC (permalink / raw)
  To: Andrew Morton, Michael Ellerman
  Cc: Ryan Roberts, Aneesh Kumar K.V, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins, linux-mm, linux-kernel,
	linux-arm-kernel, linuxppc-dev

On 10.10.23 02:20, Andrew Morton wrote:
> On Sun, 08 Oct 2023 09:54:22 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:
> 
>>> I don't know why powerpc's PTE_INDEX_SIZE is variable.
>>
>> To allow a single vmlinux to boot using either the Hashed Page Table
>> MMU, or Radix Tree MMU, which have different page table geometry.
>>
>> That's a pretty crucial feature for distros, so that they can build a
>> single kernel to boot on Power8/9/10.
> 
> Dumb question: why can't distros ship two kernels and have the boot
> loader (or something else) pick the appropriate one?

One answer I keep hearing over and over again is "test matrix 
explosion". So distros only do it when unavoidable: for example, when 
differing PAGE_SIZE is required (e.g., 4k vs 64k) or we're dealing with 
RT support (RT vs !RT).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
@ 2023-10-12  9:31           ` David Hildenbrand
  0 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-12  9:31 UTC (permalink / raw)
  To: Andrew Morton, Michael Ellerman
  Cc: Ryan Roberts, Aneesh Kumar K.V, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins, linux-mm, linux-kernel,
	linux-arm-kernel, linuxppc-dev

On 10.10.23 02:20, Andrew Morton wrote:
> On Sun, 08 Oct 2023 09:54:22 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:
> 
>>> I don't know why powerpc's PTE_INDEX_SIZE is variable.
>>
>> To allow a single vmlinux to boot using either the Hashed Page Table
>> MMU, or Radix Tree MMU, which have different page table geometry.
>>
>> That's a pretty crucial feature for distros, so that they can build a
>> single kernel to boot on Power8/9/10.
> 
> Dumb question: why can't distros ship two kernels and have the boot
> loader (or something else) pick the appropriate one?

One answer I keep hearing over and over again is "test matrix 
explosion". So distros only do it when unavoidable: for example, when 
differing PAGE_SIZE is required (e.g., 4k vs 64k) or we're dealing with 
RT support (RT vs !RT).

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
@ 2023-10-12  9:31           ` David Hildenbrand
  0 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-12  9:31 UTC (permalink / raw)
  To: Andrew Morton, Michael Ellerman
  Cc: linux-arm-kernel, David Rientjes, Ryan Roberts, Yu Zhao,
	John Hubbard, Anshuman Khandual, Catalin Marinas, Yang Shi,
	Huang, Ying, Hugh Dickins, Yin Fengwei, Matthew Wilcox,
	linux-kernel, linux-mm, Luis Chamberlain, Vlastimil Babka,
	Zi Yan, Aneesh Kumar K.V, Itaru Kitayama, linuxppc-dev,
	Kirill A. Shutemov

On 10.10.23 02:20, Andrew Morton wrote:
> On Sun, 08 Oct 2023 09:54:22 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:
> 
>>> I don't know why powerpc's PTE_INDEX_SIZE is variable.
>>
>> To allow a single vmlinux to boot using either the Hashed Page Table
>> MMU, or Radix Tree MMU, which have different page table geometry.
>>
>> That's a pretty crucial feature for distros, so that they can build a
>> single kernel to boot on Power8/9/10.
> 
> Dumb question: why can't distros ship two kernels and have the boot
> loader (or something else) pick the appropriate one?

One answer I keep hearing over and over again is "test matrix 
explosion". So distros only do it when unavoidable: for example, when 
differing PAGE_SIZE is required (e.g., 4k vs 64k) or we're dealing with 
RT support (RT vs !RT).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
  2023-10-10  0:20         ` Andrew Morton
  (?)
@ 2023-10-12 11:07           ` Michael Ellerman
  -1 siblings, 0 replies; 140+ messages in thread
From: Michael Ellerman @ 2023-10-12 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ryan Roberts, Aneesh Kumar K.V, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins, linux-mm, linux-kernel,
	linux-arm-kernel, linuxppc-dev

Andrew Morton <akpm@linux-foundation.org> writes:
> On Sun, 08 Oct 2023 09:54:22 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:
>
>> > I don't know why powerpc's PTE_INDEX_SIZE is variable.
>> 
>> To allow a single vmlinux to boot using either the Hashed Page Table
>> MMU, or Radix Tree MMU, which have different page table geometry.
>> 
>> That's a pretty crucial feature for distros, so that they can build a
>> single kernel to boot on Power8/9/10.
>
> Dumb question: why can't distros ship two kernels and have the boot
> loader (or something else) pick the appropriate one?

I'm not a grub expert, but AFAIK it doesn't support loading a different
kernel based on CPU/firwmare features. I'm quite sure it can't do that
on powerpc at least.

We also have another bootloader (petitboot) that is still supported by
some distros, and can't do that.

The other problem is like David says, distros are generally reluctant to
add new kernel configurations unless they absolutely have to. It adds
more work for them, more things to track, and can confuse users leading
to spurious bug reports.

cheers

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
@ 2023-10-12 11:07           ` Michael Ellerman
  0 siblings, 0 replies; 140+ messages in thread
From: Michael Ellerman @ 2023-10-12 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ryan Roberts, Aneesh Kumar K.V, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins, linux-mm, linux-kernel,
	linux-arm-kernel, linuxppc-dev

Andrew Morton <akpm@linux-foundation.org> writes:
> On Sun, 08 Oct 2023 09:54:22 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:
>
>> > I don't know why powerpc's PTE_INDEX_SIZE is variable.
>> 
>> To allow a single vmlinux to boot using either the Hashed Page Table
>> MMU, or Radix Tree MMU, which have different page table geometry.
>> 
>> That's a pretty crucial feature for distros, so that they can build a
>> single kernel to boot on Power8/9/10.
>
> Dumb question: why can't distros ship two kernels and have the boot
> loader (or something else) pick the appropriate one?

I'm not a grub expert, but AFAIK it doesn't support loading a different
kernel based on CPU/firwmare features. I'm quite sure it can't do that
on powerpc at least.

We also have another bootloader (petitboot) that is still supported by
some distros, and can't do that.

The other problem is like David says, distros are generally reluctant to
add new kernel configurations unless they absolutely have to. It adds
more work for them, more things to track, and can confuse users leading
to spurious bug reports.

cheers

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files
@ 2023-10-12 11:07           ` Michael Ellerman
  0 siblings, 0 replies; 140+ messages in thread
From: Michael Ellerman @ 2023-10-12 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Catalin Marinas, Yang Shi, linux-kernel,
	linux-mm, Yu Zhao, Aneesh Kumar K.V, Hugh Dickins,
	Matthew Wilcox, Vlastimil Babka, Zi Yan, Huang, Ying,
	Ryan Roberts, Anshuman Khandual, John Hubbard, David Rientjes,
	Itaru Kitayama, linux-arm-kernel, Yin Fengwei, Luis Chamberlain,
	linuxppc-dev, Kirill A. Shutemov

Andrew Morton <akpm@linux-foundation.org> writes:
> On Sun, 08 Oct 2023 09:54:22 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:
>
>> > I don't know why powerpc's PTE_INDEX_SIZE is variable.
>> 
>> To allow a single vmlinux to boot using either the Hashed Page Table
>> MMU, or Radix Tree MMU, which have different page table geometry.
>> 
>> That's a pretty crucial feature for distros, so that they can build a
>> single kernel to boot on Power8/9/10.
>
> Dumb question: why can't distros ship two kernels and have the boot
> loader (or something else) pick the appropriate one?

I'm not a grub expert, but AFAIK it doesn't support loading a different
kernel based on CPU/firwmare features. I'm quite sure it can't do that
on powerpc at least.

We also have another bootloader (petitboot) that is still supported by
some distros, and can't do that.

The other problem is like David says, distros are generally reluctant to
add new kernel configurations unless they absolutely have to. It adds
more work for them, more things to track, and can confuse users leading
to spurious bug reports.

cheers

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-10 10:47         ` Ryan Roberts
@ 2023-10-13 20:14           ` David Hildenbrand
  -1 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-13 20:14 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel


>>
>> Yes, I think there are various ways forward regarding that. Or to enable "auto"
>> mode only once all are "auto", and as soon as one is not "auto", just disable
>> it. A simple
>>
>> echo "auto" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
> 
> I'm not really a fan, because this implies that you have a period where "auto"
> is reported for a size, but its not really in "auto" mode yet.

I think there are various alternatives that are feasible.

For most systems later, you'd want to just "auto" via compile-time 
CONFIG option as default or via some cmdline option like 
"transparent_hugepage=auto".

> 
>>
>> Would do to enable it. Or, have them all be "global" and have a global "auto"
>> mode as you raised.
>>
>> echo "global" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
>> echo "auto" > /sys/kernel/mm/transparent_hugepage/enabled
>>
> 
> Again, this isn't atomic either. I tend to prefer my proposal because it
> switches atomically - there are no weird intermediate states. Anyway, I guess
> the important point is we have demonstrated that your proposed interface could
> be extended to support "auto" in future, should we need it.

I don't think the atomic switch is really relevant. But that's probably 
a separate discussion.

[...]

>>
>> Just because a small-sized THP is PTE-mapped doesn't tell you anything, really.
>> What you want to know is if it is "completely" and "consecutively" mapped such
>> that the HW can actually benefit from it -- if HW even supports it. So
>> "PTE-mapped THP" is just part of the story. And that's where it gets tricky I
>> think.
>>
>> I agree that it's good for debugging, but then maybe it should a) live somewhere
>> else (debugfs, bucketing below) and b) be consistent with other THPs, meaning we
>> also want similar stats somewhere.
>>
>> One idea would be to expose such stats in a R/O fashion like "nr_allocated" or
>> "nr_hugepages" in /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ and
>> friends. Of course, maybe tagging them with "anon" prefix.
> 
> I see your point, but I don't completely agree with it all. That said, given
> your conclusion at the bottom, perhaps we should park the discussion about the
> accounting for a separate series in future? Then we can focus on the ABI?

Yes!

> 
>>
>>>
>>> I would actually argue for adding similar counters for file-backed memory too
>>> for the same reasons. (I actually posted an independent patch a while back that
>>> did this for file- and anon- memory, bucketted by size. But I think the idea of
>>> the bucketting was NAKed.
>> For debugging, I *think* it might be valuable to see how many THP of each size
>> are allocated. Tracking exactly "how is it mapped" is not easy to achieve as we
>> learned. PMD-mapped was easy, but also requires us to keep doing that tracking
>> for all eternity ...
>>
>> Do you have a pointer to the patch set? Did it try to squeeze it into
>> /proc/meminfo?
> 
> I was actually only working on smaps/smaps_rollup, which has been my main tool
> for debugging. patches at [1].
> 
> [1] https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@arm.com/
> 

Thanks for the pointer!

[...]

>>>
>>> I'll need some help with clasifying them, so showing my working. Final list that
>>> I would propose as strict requirements at bottom.
>>>
>>> This is my list with status, as per response to Yu in other thread:
>>>
>>>     - David is working on "shared vs exclusive mappings"
>>
>> Probably "COW reuse support" is a separate item, although my approach would
>> cover that.
> 
> Yeah that's the in the original thread as (2), but I thought we were all agreed
> that is not a prerequisite so didn't bring it over here.

Agreed. Having a full list of todo items might be reasonable.

> 
>>
>> The question is, if the estimate we're using in most code for now would at least
>> be sufficient to merge it. The estimate is easily wrong, but we do have that
>> issue with PTE-mapped THP already.
> 
> Well as I understand it, at least the NUMA balancing code and khugepaged are
> ignoring all folios > order-0. That's probably ok for the occasional PTE-mapped
> THP, but I assume it becomes untenable if large folios are the norm. Perhaps we
> can modify those paths to work with the current estimators in order to remove
> your work from the critical path - is that what you are getting at?

IMHO most of the code that now uses the estimate-logic is really 
suboptimal, but it's all we have. It's probably interesting to see where 
the false negative/positives are tolerable for now ... I hate to be at 
the critical path ;) But I'm getting somewhere slowly but steadily 
(slowly, because I'm constantly distracted -- and apparently sick).

[...]

>>
>>
>>> Although, since sending that, I've determined that when running kernel
>>> compilation across high number of cores on arm64, the cost of splitting the
>>> folios gets large due to needing to broadcast the extra TLBIs. So I think the
>>> last point on that list may be a prerequisite after all. (I've been able to fix
>>> this by adding support for allocating large folios in the swap file, and
>>> avoiding the split - planning to send RFC this week).
>>>
>>> There is also this set of things that you mentioned against "shared vs exclusive
>>> mappings", which I'm not sure if you are planning to cover as part of your work
>>> or if they are follow on things that will need to be done:
>>>
>>> (1) Detecting shared folios, to not mess with them while they are shared.
>>>       MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>>>       replace cases where folio_estimated_sharers() == 1 would currently be the
>>>       best we can do (and in some cases, page_mapcount() == 1).
>>>
>>> And I recently discovered that khugepaged doesn't collapse file-backed pages to
>>> a PMD-size THP if they belong to a large folio, so I'm guessing it may also
>>> suffer the same behaviour for anon memory. I'm not sure if that's what your
>>> "khugepaged ..." comment refers to?
>>
>> Yes. But I did not look into all the details yet.
>>
>> "kuhepaged" collapse support to small-sized THP is probably also a very imporant
>> item, although it might be less relevant than for PMD -- and I consider it
>> future work. See below.
> 
> Yes I agree that it's definitely future work. Nothing regresses from today's
> performance if you don't have that.
> 
>>
>>>
>>> So taking all that and trying to put together a complete outstanding list for
>>> strict requirements:
>>>
>>>     - Shared vs Exclusive Mappings (DavidH)
>>>         - user-triggered page migration
>>>         - NUMA hinting/balancing
>>>         - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios
>>>     - Compaction of Large Folios (Zi Yan)
>>>     - Swap out small-size THP without Split (Ryan Roberts)
>>
>> ^ that's going to be tough, I can promise. And the only way to live without that
>> would be khugepaged support. (because that's how it's all working for PMD-sized
>> THP after all!)
> 
> Are you referring specifically to the "swap out" line there? If so, it wasn't my
> plan that we would *swap in* large folios - only swap them *out* as large folios

Ah!

> to avoid the cost of splitting. Then when they come back in, the come in as
> single pages, just like PMD-sized THP, if I've understood things correctly. I
> have a patch working and showing the perf improvement as a result. I'm planning
> to post an RFC today, hopefully.
> 
> I don't see the swap-in side as a problem for the initial patch set. OK, they
> come in as single pages, so you lost the potential TLB benefits. But that's no
> worse than today's performance so not a regression. And the ratio of SW savings
> on THP allocation to HW savings from the TLB is very different for small-sized
> THP; much more of the benefit comes from the SW and that's still there.
> 
>>
>> Once a PMD-sized THP was swapped out and evicted, it will always come back in
>> order-0 folios. khugeged will re-collapse into PMD-sized chunks. If we could do
>> that for PTE-sized THP as well ...
> 
> Yes, sure, but that's a future improvement, not a requirement to prevent
> regression vs today, right?

Yes. You can't just currently assume: as soon as you swap, the whole 
benefit is gone because you end up will all order-0 pages.

These are certainly not limiting "merge" factors IMHO, but it's 
certainly one of the things users of distributions will heavily complain 
about ;)

PMD-sized THP are mostly self-healing in that sense.

> 
>>
>>>
>>>
>>>>
>>>>
>>>> Now, these are just my thoughts, and I'm happy about other thoughts.
>>>
>>> As always, thanks for taking the time - I really appreciate it.
>>
>> Sure. Hoping others can comment.
>>
>> My gut feeling is that it's best to focus on getting the sysfs interface
>> right+future proof and handling the stats independently. While being a good
>> debug mechanism, I wouldn't consider these stats a requirement: we don't have
>> them for file/shmem small-sized thp so far as well.
>>
>> So maybe really better to handle the stats in meminfo and friends separately.
>>
> 
> I'd be very happy with that approach if others are bought in.

Yeah. I'm expecting there still to be discussions, but then we shall 
also here other proposals. memcg controls are IMHO certainly future work.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-13 20:14           ` David Hildenbrand
  0 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-13 20:14 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel


>>
>> Yes, I think there are various ways forward regarding that. Or to enable "auto"
>> mode only once all are "auto", and as soon as one is not "auto", just disable
>> it. A simple
>>
>> echo "auto" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
> 
> I'm not really a fan, because this implies that you have a period where "auto"
> is reported for a size, but its not really in "auto" mode yet.

I think there are various alternatives that are feasible.

For most systems later, you'd want to just "auto" via compile-time 
CONFIG option as default or via some cmdline option like 
"transparent_hugepage=auto".

> 
>>
>> Would do to enable it. Or, have them all be "global" and have a global "auto"
>> mode as you raised.
>>
>> echo "global" > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
>> echo "auto" > /sys/kernel/mm/transparent_hugepage/enabled
>>
> 
> Again, this isn't atomic either. I tend to prefer my proposal because it
> switches atomically - there are no weird intermediate states. Anyway, I guess
> the important point is we have demonstrated that your proposed interface could
> be extended to support "auto" in future, should we need it.

I don't think the atomic switch is really relevant. But that's probably 
a separate discussion.

[...]

>>
>> Just because a small-sized THP is PTE-mapped doesn't tell you anything, really.
>> What you want to know is if it is "completely" and "consecutively" mapped such
>> that the HW can actually benefit from it -- if HW even supports it. So
>> "PTE-mapped THP" is just part of the story. And that's where it gets tricky I
>> think.
>>
>> I agree that it's good for debugging, but then maybe it should a) live somewhere
>> else (debugfs, bucketing below) and b) be consistent with other THPs, meaning we
>> also want similar stats somewhere.
>>
>> One idea would be to expose such stats in a R/O fashion like "nr_allocated" or
>> "nr_hugepages" in /sys/kernel/mm/transparent_hugepage/hugepages-64kB/ and
>> friends. Of course, maybe tagging them with "anon" prefix.
> 
> I see your point, but I don't completely agree with it all. That said, given
> your conclusion at the bottom, perhaps we should park the discussion about the
> accounting for a separate series in future? Then we can focus on the ABI?

Yes!

> 
>>
>>>
>>> I would actually argue for adding similar counters for file-backed memory too
>>> for the same reasons. (I actually posted an independent patch a while back that
>>> did this for file- and anon- memory, bucketted by size. But I think the idea of
>>> the bucketting was NAKed.
>> For debugging, I *think* it might be valuable to see how many THP of each size
>> are allocated. Tracking exactly "how is it mapped" is not easy to achieve as we
>> learned. PMD-mapped was easy, but also requires us to keep doing that tracking
>> for all eternity ...
>>
>> Do you have a pointer to the patch set? Did it try to squeeze it into
>> /proc/meminfo?
> 
> I was actually only working on smaps/smaps_rollup, which has been my main tool
> for debugging. patches at [1].
> 
> [1] https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@arm.com/
> 

Thanks for the pointer!

[...]

>>>
>>> I'll need some help with clasifying them, so showing my working. Final list that
>>> I would propose as strict requirements at bottom.
>>>
>>> This is my list with status, as per response to Yu in other thread:
>>>
>>>     - David is working on "shared vs exclusive mappings"
>>
>> Probably "COW reuse support" is a separate item, although my approach would
>> cover that.
> 
> Yeah that's the in the original thread as (2), but I thought we were all agreed
> that is not a prerequisite so didn't bring it over here.

Agreed. Having a full list of todo items might be reasonable.

> 
>>
>> The question is, if the estimate we're using in most code for now would at least
>> be sufficient to merge it. The estimate is easily wrong, but we do have that
>> issue with PTE-mapped THP already.
> 
> Well as I understand it, at least the NUMA balancing code and khugepaged are
> ignoring all folios > order-0. That's probably ok for the occasional PTE-mapped
> THP, but I assume it becomes untenable if large folios are the norm. Perhaps we
> can modify those paths to work with the current estimators in order to remove
> your work from the critical path - is that what you are getting at?

IMHO most of the code that now uses the estimate-logic is really 
suboptimal, but it's all we have. It's probably interesting to see where 
the false negative/positives are tolerable for now ... I hate to be at 
the critical path ;) But I'm getting somewhere slowly but steadily 
(slowly, because I'm constantly distracted -- and apparently sick).

[...]

>>
>>
>>> Although, since sending that, I've determined that when running kernel
>>> compilation across high number of cores on arm64, the cost of splitting the
>>> folios gets large due to needing to broadcast the extra TLBIs. So I think the
>>> last point on that list may be a prerequisite after all. (I've been able to fix
>>> this by adding support for allocating large folios in the swap file, and
>>> avoiding the split - planning to send RFC this week).
>>>
>>> There is also this set of things that you mentioned against "shared vs exclusive
>>> mappings", which I'm not sure if you are planning to cover as part of your work
>>> or if they are follow on things that will need to be done:
>>>
>>> (1) Detecting shared folios, to not mess with them while they are shared.
>>>       MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>>>       replace cases where folio_estimated_sharers() == 1 would currently be the
>>>       best we can do (and in some cases, page_mapcount() == 1).
>>>
>>> And I recently discovered that khugepaged doesn't collapse file-backed pages to
>>> a PMD-size THP if they belong to a large folio, so I'm guessing it may also
>>> suffer the same behaviour for anon memory. I'm not sure if that's what your
>>> "khugepaged ..." comment refers to?
>>
>> Yes. But I did not look into all the details yet.
>>
>> "kuhepaged" collapse support to small-sized THP is probably also a very imporant
>> item, although it might be less relevant than for PMD -- and I consider it
>> future work. See below.
> 
> Yes I agree that it's definitely future work. Nothing regresses from today's
> performance if you don't have that.
> 
>>
>>>
>>> So taking all that and trying to put together a complete outstanding list for
>>> strict requirements:
>>>
>>>     - Shared vs Exclusive Mappings (DavidH)
>>>         - user-triggered page migration
>>>         - NUMA hinting/balancing
>>>         - Enhance khugepaged to collapse to PMD-size from PTE-mapped large folios
>>>     - Compaction of Large Folios (Zi Yan)
>>>     - Swap out small-size THP without Split (Ryan Roberts)
>>
>> ^ that's going to be tough, I can promise. And the only way to live without that
>> would be khugepaged support. (because that's how it's all working for PMD-sized
>> THP after all!)
> 
> Are you referring specifically to the "swap out" line there? If so, it wasn't my
> plan that we would *swap in* large folios - only swap them *out* as large folios

Ah!

> to avoid the cost of splitting. Then when they come back in, the come in as
> single pages, just like PMD-sized THP, if I've understood things correctly. I
> have a patch working and showing the perf improvement as a result. I'm planning
> to post an RFC today, hopefully.
> 
> I don't see the swap-in side as a problem for the initial patch set. OK, they
> come in as single pages, so you lost the potential TLB benefits. But that's no
> worse than today's performance so not a regression. And the ratio of SW savings
> on THP allocation to HW savings from the TLB is very different for small-sized
> THP; much more of the benefit comes from the SW and that's still there.
> 
>>
>> Once a PMD-sized THP was swapped out and evicted, it will always come back in
>> order-0 folios. khugeged will re-collapse into PMD-sized chunks. If we could do
>> that for PTE-sized THP as well ...
> 
> Yes, sure, but that's a future improvement, not a requirement to prevent
> regression vs today, right?

Yes. You can't just currently assume: as soon as you swap, the whole 
benefit is gone because you end up will all order-0 pages.

These are certainly not limiting "merge" factors IMHO, but it's 
certainly one of the things users of distributions will heavily complain 
about ;)

PMD-sized THP are mostly self-healing in that sense.

> 
>>
>>>
>>>
>>>>
>>>>
>>>> Now, these are just my thoughts, and I'm happy about other thoughts.
>>>
>>> As always, thanks for taking the time - I really appreciate it.
>>
>> Sure. Hoping others can comment.
>>
>> My gut feeling is that it's best to focus on getting the sysfs interface
>> right+future proof and handling the stats independently. While being a good
>> debug mechanism, I wouldn't consider these stats a requirement: we don't have
>> them for file/shmem small-sized thp so far as well.
>>
>> So maybe really better to handle the stats in meminfo and friends separately.
>>
> 
> I'd be very happy with that approach if others are bought in.

Yeah. I'm expecting there still to be discussions, but then we shall 
also here other proposals. memcg controls are IMHO certainly future work.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-06 20:06   ` David Hildenbrand
@ 2023-10-20 12:33     ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-20 12:33 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 06/10/2023 21:06, David Hildenbrand wrote:
> On 29.09.23 13:44, Ryan Roberts wrote:
>> Hi All,
> 

[...]

>> NOTE: These changes should not be merged until the prerequisites are complete.
>> These are in progress and tracked at [7].
> 
> We should probably list them here, and classify which one we see as strict a
> requirement, which ones might be an optimization.
> 

Bringing back the discussion of prerequistes to this thread following the
discussion at the mm-alignment meeting on Wednesday.

Slides, updated following discussion to reflect all the agreed items that are
prerequisites and enhancements, are at [1].

I've taken a closer look at the situation with khugepaged, and can confirm that
it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
small-sized THP is enabled+always. So I've fixed that test up and will add the
patch to the next version of my series.

So I believe the khugepaged prerequisite can be marked as done.

[1]
https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-20 12:33     ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-20 12:33 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 06/10/2023 21:06, David Hildenbrand wrote:
> On 29.09.23 13:44, Ryan Roberts wrote:
>> Hi All,
> 

[...]

>> NOTE: These changes should not be merged until the prerequisites are complete.
>> These are in progress and tracked at [7].
> 
> We should probably list them here, and classify which one we see as strict a
> requirement, which ones might be an optimization.
> 

Bringing back the discussion of prerequistes to this thread following the
discussion at the mm-alignment meeting on Wednesday.

Slides, updated following discussion to reflect all the agreed items that are
prerequisites and enhancements, are at [1].

I've taken a closer look at the situation with khugepaged, and can confirm that
it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
small-sized THP is enabled+always. So I've fixed that test up and will add the
patch to the next version of my series.

So I believe the khugepaged prerequisite can be marked as done.

[1]
https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA

Thanks,
Ryan


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-20 12:33     ` Ryan Roberts
@ 2023-10-25 16:24       ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-25 16:24 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 20/10/2023 13:33, Ryan Roberts wrote:
> On 06/10/2023 21:06, David Hildenbrand wrote:
>> On 29.09.23 13:44, Ryan Roberts wrote:
>>> Hi All,
>>
> 
> [...]
> 
>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>> These are in progress and tracked at [7].
>>
>> We should probably list them here, and classify which one we see as strict a
>> requirement, which ones might be an optimization.
>>
> 
> Bringing back the discussion of prerequistes to this thread following the
> discussion at the mm-alignment meeting on Wednesday.
> 
> Slides, updated following discussion to reflect all the agreed items that are
> prerequisites and enhancements, are at [1].
> 
> I've taken a closer look at the situation with khugepaged, and can confirm that
> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
> small-sized THP is enabled+always. So I've fixed that test up and will add the
> patch to the next version of my series.
> 
> So I believe the khugepaged prerequisite can be marked as done.
> 
> [1]
> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA

Hi All,

It's been a week since the mm alignment meeting discussion we had around
prerequisites and the ABI. I haven't heard any further feedback on the ABI
proposal, so I'm going to be optimistic and assume that nobody has found any
fatal flaws in it :).

Certainly, I think it held up to the potential future policies that Yu Zhou
cited on the call - the possibility of preferring a smaller size over a bigger
one, if the smaller size can be allocated without splitting a contiguous block.
I think the suggestion of adding a per-size priority file would solve it. And in
general because we have a per-size directory, that gives us lots of flexibility
for growth.

Anyway, given the lack of feedback, I'm proposing to spin a new version. I'm
planning to do the following:

  - Drop the accounting patch (#3); we will continue to only account PMD-sized
    THP for now. We can add more counters in future if needed. page cache large
    folios haven't needed any new counters yet.

  - Pivot to the ABI proposed by DavidH; per-size directories in a similar shape
    to that used by hugetlb

  - Drop the "recommend" keyword patch (#6); For now, users will need to
    understand implicitly which sizes are beneficial to their HW perf

  - Drop patch (#7); arch_wants_pte_order() is no longer needed due to dropping
    patch #6

  - Add patch for khugepaged selftest improvement (described in previous email
    above).

  - Ensure that PMD_ORDER is not assumed to be compile-time constant (current
    code is broken on powerpc)

Please shout if you think this is the wrong approach.

On the prerequisites front, we have 2 items still to land:

  - compaction; Zi Yan is working on a v2

  - numa balancing; A developer has signed up and is working on it (I'll leave
    them to reveal themself as preferred).

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-25 16:24       ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-25 16:24 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 20/10/2023 13:33, Ryan Roberts wrote:
> On 06/10/2023 21:06, David Hildenbrand wrote:
>> On 29.09.23 13:44, Ryan Roberts wrote:
>>> Hi All,
>>
> 
> [...]
> 
>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>> These are in progress and tracked at [7].
>>
>> We should probably list them here, and classify which one we see as strict a
>> requirement, which ones might be an optimization.
>>
> 
> Bringing back the discussion of prerequistes to this thread following the
> discussion at the mm-alignment meeting on Wednesday.
> 
> Slides, updated following discussion to reflect all the agreed items that are
> prerequisites and enhancements, are at [1].
> 
> I've taken a closer look at the situation with khugepaged, and can confirm that
> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
> small-sized THP is enabled+always. So I've fixed that test up and will add the
> patch to the next version of my series.
> 
> So I believe the khugepaged prerequisite can be marked as done.
> 
> [1]
> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA

Hi All,

It's been a week since the mm alignment meeting discussion we had around
prerequisites and the ABI. I haven't heard any further feedback on the ABI
proposal, so I'm going to be optimistic and assume that nobody has found any
fatal flaws in it :).

Certainly, I think it held up to the potential future policies that Yu Zhou
cited on the call - the possibility of preferring a smaller size over a bigger
one, if the smaller size can be allocated without splitting a contiguous block.
I think the suggestion of adding a per-size priority file would solve it. And in
general because we have a per-size directory, that gives us lots of flexibility
for growth.

Anyway, given the lack of feedback, I'm proposing to spin a new version. I'm
planning to do the following:

  - Drop the accounting patch (#3); we will continue to only account PMD-sized
    THP for now. We can add more counters in future if needed. page cache large
    folios haven't needed any new counters yet.

  - Pivot to the ABI proposed by DavidH; per-size directories in a similar shape
    to that used by hugetlb

  - Drop the "recommend" keyword patch (#6); For now, users will need to
    understand implicitly which sizes are beneficial to their HW perf

  - Drop patch (#7); arch_wants_pte_order() is no longer needed due to dropping
    patch #6

  - Add patch for khugepaged selftest improvement (described in previous email
    above).

  - Ensure that PMD_ORDER is not assumed to be compile-time constant (current
    code is broken on powerpc)

Please shout if you think this is the wrong approach.

On the prerequisites front, we have 2 items still to land:

  - compaction; Zi Yan is working on a v2

  - numa balancing; A developer has signed up and is working on it (I'll leave
    them to reveal themself as preferred).

Thanks,
Ryan


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-25 16:24       ` Ryan Roberts
@ 2023-10-25 18:47         ` David Hildenbrand
  -1 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-25 18:47 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 25.10.23 18:24, Ryan Roberts wrote:
> On 20/10/2023 13:33, Ryan Roberts wrote:
>> On 06/10/2023 21:06, David Hildenbrand wrote:
>>> On 29.09.23 13:44, Ryan Roberts wrote:
>>>> Hi All,
>>>
>>
>> [...]
>>
>>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>>> These are in progress and tracked at [7].
>>>
>>> We should probably list them here, and classify which one we see as strict a
>>> requirement, which ones might be an optimization.
>>>
>>
>> Bringing back the discussion of prerequistes to this thread following the
>> discussion at the mm-alignment meeting on Wednesday.
>>
>> Slides, updated following discussion to reflect all the agreed items that are
>> prerequisites and enhancements, are at [1].
>>
>> I've taken a closer look at the situation with khugepaged, and can confirm that
>> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
>> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
>> small-sized THP is enabled+always. So I've fixed that test up and will add the
>> patch to the next version of my series.
>>
>> So I believe the khugepaged prerequisite can be marked as done.
>>
>> [1]
>> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
> 
> Hi All,

Hi,

I wanted to remind people in the THP cabal meeting, but that either 
didn't happen or zoomed decided to not let me join :)

> 
> It's been a week since the mm alignment meeting discussion we had around
> prerequisites and the ABI. I haven't heard any further feedback on the ABI
> proposal, so I'm going to be optimistic and assume that nobody has found any
> fatal flaws in it :).

After saying in the call probably 10 times that people should comment 
here if there are reasonable alternatives worth discussing, call me 
"optimistic" as well; but, it's only been a week and people might still 
be thinking about this/

There were two things discussed in the call:

* Yu brought up "lists" so we can have priorities. As briefly discussed
   in the  call, this (a) might not be needed right now in an initial
   version;  (b) the kernel might be able to handle that (or many cases)
   automatically, TBD. Adding lists now would kind-of set the semantics
   of that interface in stone. As you describe below, the approach
   discussed here could easily be extended to cover priorities, if need
   be.

* Hugh raised the point that "bitmap of orders" could be replaced by
   "added THP sizes", which really is "bitmap of orders" shifted to the
   left. To configure 2 MiB +  64Kib, one would get "2097152 + 65536" =
   "2162688" or in KiB "2112". Hm.

Both approaches would require single-option files like "enable_always", 
"enable_madvise" ... which I don't particularly like, but who am I to judge.


> 
> Certainly, I think it held up to the potential future policies that Yu Zhou
> cited on the call - the possibility of preferring a smaller size over a bigger
> one, if the smaller size can be allocated without splitting a contiguous block.
> I think the suggestion of adding a per-size priority file would solve it. And in
> general because we have a per-size directory, that gives us lots of flexibility
> for growth.

Jup, same opinion here. But again, I'm very happy to hear other 
alternatives and why they are better.

> 
> Anyway, given the lack of feedback, I'm proposing to spin a new version. I'm
> planning to do the following:
> 
>    - Drop the accounting patch (#3); we will continue to only account PMD-sized
>      THP for now. We can add more counters in future if needed. page cache large
>      folios haven't needed any new counters yet.
> 
>    - Pivot to the ABI proposed by DavidH; per-size directories in a similar shape
>      to that used by hugetlb
> 
>    - Drop the "recommend" keyword patch (#6); For now, users will need to
>      understand implicitly which sizes are beneficial to their HW perf
> 
>    - Drop patch (#7); arch_wants_pte_order() is no longer needed due to dropping
>      patch #6
> 
>    - Add patch for khugepaged selftest improvement (described in previous email
>      above).
> 
>    - Ensure that PMD_ORDER is not assumed to be compile-time constant (current
>      code is broken on powerpc)
> 
> Please shout if you think this is the wrong approach.

I'll shout that this sounds good to me; rather wait a bit more for more 
opinions. It probably makes sense to post something after the (upcoming) 
merge window, if there are no further discussions here.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-25 18:47         ` David Hildenbrand
  0 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-25 18:47 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 25.10.23 18:24, Ryan Roberts wrote:
> On 20/10/2023 13:33, Ryan Roberts wrote:
>> On 06/10/2023 21:06, David Hildenbrand wrote:
>>> On 29.09.23 13:44, Ryan Roberts wrote:
>>>> Hi All,
>>>
>>
>> [...]
>>
>>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>>> These are in progress and tracked at [7].
>>>
>>> We should probably list them here, and classify which one we see as strict a
>>> requirement, which ones might be an optimization.
>>>
>>
>> Bringing back the discussion of prerequistes to this thread following the
>> discussion at the mm-alignment meeting on Wednesday.
>>
>> Slides, updated following discussion to reflect all the agreed items that are
>> prerequisites and enhancements, are at [1].
>>
>> I've taken a closer look at the situation with khugepaged, and can confirm that
>> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
>> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
>> small-sized THP is enabled+always. So I've fixed that test up and will add the
>> patch to the next version of my series.
>>
>> So I believe the khugepaged prerequisite can be marked as done.
>>
>> [1]
>> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
> 
> Hi All,

Hi,

I wanted to remind people in the THP cabal meeting, but that either 
didn't happen or zoomed decided to not let me join :)

> 
> It's been a week since the mm alignment meeting discussion we had around
> prerequisites and the ABI. I haven't heard any further feedback on the ABI
> proposal, so I'm going to be optimistic and assume that nobody has found any
> fatal flaws in it :).

After saying in the call probably 10 times that people should comment 
here if there are reasonable alternatives worth discussing, call me 
"optimistic" as well; but, it's only been a week and people might still 
be thinking about this/

There were two things discussed in the call:

* Yu brought up "lists" so we can have priorities. As briefly discussed
   in the  call, this (a) might not be needed right now in an initial
   version;  (b) the kernel might be able to handle that (or many cases)
   automatically, TBD. Adding lists now would kind-of set the semantics
   of that interface in stone. As you describe below, the approach
   discussed here could easily be extended to cover priorities, if need
   be.

* Hugh raised the point that "bitmap of orders" could be replaced by
   "added THP sizes", which really is "bitmap of orders" shifted to the
   left. To configure 2 MiB +  64Kib, one would get "2097152 + 65536" =
   "2162688" or in KiB "2112". Hm.

Both approaches would require single-option files like "enable_always", 
"enable_madvise" ... which I don't particularly like, but who am I to judge.


> 
> Certainly, I think it held up to the potential future policies that Yu Zhou
> cited on the call - the possibility of preferring a smaller size over a bigger
> one, if the smaller size can be allocated without splitting a contiguous block.
> I think the suggestion of adding a per-size priority file would solve it. And in
> general because we have a per-size directory, that gives us lots of flexibility
> for growth.

Jup, same opinion here. But again, I'm very happy to hear other 
alternatives and why they are better.

> 
> Anyway, given the lack of feedback, I'm proposing to spin a new version. I'm
> planning to do the following:
> 
>    - Drop the accounting patch (#3); we will continue to only account PMD-sized
>      THP for now. We can add more counters in future if needed. page cache large
>      folios haven't needed any new counters yet.
> 
>    - Pivot to the ABI proposed by DavidH; per-size directories in a similar shape
>      to that used by hugetlb
> 
>    - Drop the "recommend" keyword patch (#6); For now, users will need to
>      understand implicitly which sizes are beneficial to their HW perf
> 
>    - Drop patch (#7); arch_wants_pte_order() is no longer needed due to dropping
>      patch #6
> 
>    - Add patch for khugepaged selftest improvement (described in previous email
>      above).
> 
>    - Ensure that PMD_ORDER is not assumed to be compile-time constant (current
>      code is broken on powerpc)
> 
> Please shout if you think this is the wrong approach.

I'll shout that this sounds good to me; rather wait a bit more for more 
opinions. It probably makes sense to post something after the (upcoming) 
merge window, if there are no further discussions here.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-25 16:24       ` Ryan Roberts
@ 2023-10-25 19:10         ` John Hubbard
  -1 siblings, 0 replies; 140+ messages in thread
From: John Hubbard @ 2023-10-25 19:10 UTC (permalink / raw)
  To: Ryan Roberts, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Yin Fengwei, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 10/25/23 09:24, Ryan Roberts wrote:
> On the prerequisites front, we have 2 items still to land:
> 
>    - compaction; Zi Yan is working on a v2
> 
>    - numa balancing; A developer has signed up and is working on it (I'll leave
>      them to reveal themself as preferred).
> 

Oh yes, that's me, thanks for providing an easy place to reply-to, for that.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-25 19:10         ` John Hubbard
  0 siblings, 0 replies; 140+ messages in thread
From: John Hubbard @ 2023-10-25 19:10 UTC (permalink / raw)
  To: Ryan Roberts, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Yin Fengwei, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 10/25/23 09:24, Ryan Roberts wrote:
> On the prerequisites front, we have 2 items still to land:
> 
>    - compaction; Zi Yan is working on a v2
> 
>    - numa balancing; A developer has signed up and is working on it (I'll leave
>      them to reveal themself as preferred).
> 

Oh yes, that's me, thanks for providing an easy place to reply-to, for that.


thanks,
-- 
John Hubbard
NVIDIA


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-25 18:47         ` David Hildenbrand
@ 2023-10-25 19:11           ` Yu Zhao
  -1 siblings, 0 replies; 140+ messages in thread
From: Yu Zhao @ 2023-10-25 19:11 UTC (permalink / raw)
  To: David Hildenbrand, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, Catalin Marinas,
	Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On Wed, Oct 25, 2023 at 12:47 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 25.10.23 18:24, Ryan Roberts wrote:
> > On 20/10/2023 13:33, Ryan Roberts wrote:
> >> On 06/10/2023 21:06, David Hildenbrand wrote:
> >>> On 29.09.23 13:44, Ryan Roberts wrote:
> >>>> Hi All,
> >>>
> >>
> >> [...]
> >>
> >>>> NOTE: These changes should not be merged until the prerequisites are complete.
> >>>> These are in progress and tracked at [7].
> >>>
> >>> We should probably list them here, and classify which one we see as strict a
> >>> requirement, which ones might be an optimization.
> >>>
> >>
> >> Bringing back the discussion of prerequistes to this thread following the
> >> discussion at the mm-alignment meeting on Wednesday.
> >>
> >> Slides, updated following discussion to reflect all the agreed items that are
> >> prerequisites and enhancements, are at [1].
> >>
> >> I've taken a closer look at the situation with khugepaged, and can confirm that
> >> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
> >> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
> >> small-sized THP is enabled+always. So I've fixed that test up and will add the
> >> patch to the next version of my series.
> >>
> >> So I believe the khugepaged prerequisite can be marked as done.
> >>
> >> [1]
> >> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
> >
> > Hi All,
>
> Hi,
>
> I wanted to remind people in the THP cabal meeting, but that either
> didn't happen or zoomed decided to not let me join :)
>
> >
> > It's been a week since the mm alignment meeting discussion we had around
> > prerequisites and the ABI. I haven't heard any further feedback on the ABI
> > proposal, so I'm going to be optimistic and assume that nobody has found any
> > fatal flaws in it :).
>
> After saying in the call probably 10 times that people should comment
> here if there are reasonable alternatives worth discussing, call me
> "optimistic" as well; but, it's only been a week and people might still
> be thinking about this/
>
> There were two things discussed in the call:
>
> * Yu brought up "lists" so we can have priorities. As briefly discussed
>    in the  call, this (a) might not be needed right now in an initial
>    version;  (b) the kernel might be able to handle that (or many cases)
>    automatically, TBD. Adding lists now would kind-of set the semantics
>    of that interface in stone. As you describe below, the approach
>    discussed here could easily be extended to cover priorities, if need
>    be.

I want to expand on this: the argument that "if you could allocate a
higher order you should use it" is too simplistic. There are many
reasons in addition to the one above that we want to "fall back" to
higher orders, e.g., those higher orders are not on PCP or from the
local node. When we consider the sequence of orders to try, user
preference is just one of the parameters to the cost function. The
bottom line is that I think we should all agree that there needs to be
a cost function down the road, whatever it looks like. Otherwise I
don't know how we can make "auto" happen.

> * Hugh raised the point that "bitmap of orders" could be replaced by
>    "added THP sizes", which really is "bitmap of orders" shifted to the
>    left. To configure 2 MiB +  64Kib, one would get "2097152 + 65536" =
>    "2162688" or in KiB "2112". Hm.

I'm not a big fan of the "bitmap of orders" approach, because it
doesn't address my concern above.

> Both approaches would require single-option files like "enable_always",
> "enable_madvise" ... which I don't particularly like, but who am I to judge.
>
>
> >
> > Certainly, I think it held up to the potential future policies that Yu Zhou
> > cited on the call - the possibility of preferring a smaller size over a bigger
> > one, if the smaller size can be allocated without splitting a contiguous block.
> > I think the suggestion of adding a per-size priority file would solve it. And in
> > general because we have a per-size directory, that gives us lots of flexibility
> > for growth.
>
> Jup, same opinion here. But again, I'm very happy to hear other
> alternatives and why they are better.

I'm not against David's proposal but I want to hear a lot more about
"lots of flexibility for growth" before I'm fully convinced. Why do I
need more convincing? When I brought up that we need to consider the
priority of each order and the potential need to fall back to higher
orders during the meeting, I got the impression people were surprised
why we want to fall back to higher orders. TBH, I was surprised too
that this possibility was never considered. I missed today's THP
meeting too but I'll join next time and if anyone has more ideas on
this, we can spend some time discussing it, especially on how LAF
should cooperate with the page allocator to make better decisions.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-25 19:11           ` Yu Zhao
  0 siblings, 0 replies; 140+ messages in thread
From: Yu Zhao @ 2023-10-25 19:11 UTC (permalink / raw)
  To: David Hildenbrand, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, Catalin Marinas,
	Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On Wed, Oct 25, 2023 at 12:47 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 25.10.23 18:24, Ryan Roberts wrote:
> > On 20/10/2023 13:33, Ryan Roberts wrote:
> >> On 06/10/2023 21:06, David Hildenbrand wrote:
> >>> On 29.09.23 13:44, Ryan Roberts wrote:
> >>>> Hi All,
> >>>
> >>
> >> [...]
> >>
> >>>> NOTE: These changes should not be merged until the prerequisites are complete.
> >>>> These are in progress and tracked at [7].
> >>>
> >>> We should probably list them here, and classify which one we see as strict a
> >>> requirement, which ones might be an optimization.
> >>>
> >>
> >> Bringing back the discussion of prerequistes to this thread following the
> >> discussion at the mm-alignment meeting on Wednesday.
> >>
> >> Slides, updated following discussion to reflect all the agreed items that are
> >> prerequisites and enhancements, are at [1].
> >>
> >> I've taken a closer look at the situation with khugepaged, and can confirm that
> >> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
> >> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
> >> small-sized THP is enabled+always. So I've fixed that test up and will add the
> >> patch to the next version of my series.
> >>
> >> So I believe the khugepaged prerequisite can be marked as done.
> >>
> >> [1]
> >> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
> >
> > Hi All,
>
> Hi,
>
> I wanted to remind people in the THP cabal meeting, but that either
> didn't happen or zoomed decided to not let me join :)
>
> >
> > It's been a week since the mm alignment meeting discussion we had around
> > prerequisites and the ABI. I haven't heard any further feedback on the ABI
> > proposal, so I'm going to be optimistic and assume that nobody has found any
> > fatal flaws in it :).
>
> After saying in the call probably 10 times that people should comment
> here if there are reasonable alternatives worth discussing, call me
> "optimistic" as well; but, it's only been a week and people might still
> be thinking about this/
>
> There were two things discussed in the call:
>
> * Yu brought up "lists" so we can have priorities. As briefly discussed
>    in the  call, this (a) might not be needed right now in an initial
>    version;  (b) the kernel might be able to handle that (or many cases)
>    automatically, TBD. Adding lists now would kind-of set the semantics
>    of that interface in stone. As you describe below, the approach
>    discussed here could easily be extended to cover priorities, if need
>    be.

I want to expand on this: the argument that "if you could allocate a
higher order you should use it" is too simplistic. There are many
reasons in addition to the one above that we want to "fall back" to
higher orders, e.g., those higher orders are not on PCP or from the
local node. When we consider the sequence of orders to try, user
preference is just one of the parameters to the cost function. The
bottom line is that I think we should all agree that there needs to be
a cost function down the road, whatever it looks like. Otherwise I
don't know how we can make "auto" happen.

> * Hugh raised the point that "bitmap of orders" could be replaced by
>    "added THP sizes", which really is "bitmap of orders" shifted to the
>    left. To configure 2 MiB +  64Kib, one would get "2097152 + 65536" =
>    "2162688" or in KiB "2112". Hm.

I'm not a big fan of the "bitmap of orders" approach, because it
doesn't address my concern above.

> Both approaches would require single-option files like "enable_always",
> "enable_madvise" ... which I don't particularly like, but who am I to judge.
>
>
> >
> > Certainly, I think it held up to the potential future policies that Yu Zhou
> > cited on the call - the possibility of preferring a smaller size over a bigger
> > one, if the smaller size can be allocated without splitting a contiguous block.
> > I think the suggestion of adding a per-size priority file would solve it. And in
> > general because we have a per-size directory, that gives us lots of flexibility
> > for growth.
>
> Jup, same opinion here. But again, I'm very happy to hear other
> alternatives and why they are better.

I'm not against David's proposal but I want to hear a lot more about
"lots of flexibility for growth" before I'm fully convinced. Why do I
need more convincing? When I brought up that we need to consider the
priority of each order and the potential need to fall back to higher
orders during the meeting, I got the impression people were surprised
why we want to fall back to higher orders. TBH, I was surprised too
that this possibility was never considered. I missed today's THP
meeting too but I'll join next time and if anyone has more ideas on
this, we can spend some time discussing it, especially on how LAF
should cooperate with the page allocator to make better decisions.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-25 19:11           ` Yu Zhao
@ 2023-10-26  9:53             ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-26  9:53 UTC (permalink / raw)
  To: Yu Zhao, David Hildenbrand
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, Catalin Marinas,
	Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On 25/10/2023 20:11, Yu Zhao wrote:
> On Wed, Oct 25, 2023 at 12:47 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 25.10.23 18:24, Ryan Roberts wrote:
>>> On 20/10/2023 13:33, Ryan Roberts wrote:
>>>> On 06/10/2023 21:06, David Hildenbrand wrote:
>>>>> On 29.09.23 13:44, Ryan Roberts wrote:
>>>>>> Hi All,
>>>>>
>>>>
>>>> [...]
>>>>
>>>>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>>>>> These are in progress and tracked at [7].
>>>>>
>>>>> We should probably list them here, and classify which one we see as strict a
>>>>> requirement, which ones might be an optimization.
>>>>>
>>>>
>>>> Bringing back the discussion of prerequistes to this thread following the
>>>> discussion at the mm-alignment meeting on Wednesday.
>>>>
>>>> Slides, updated following discussion to reflect all the agreed items that are
>>>> prerequisites and enhancements, are at [1].
>>>>
>>>> I've taken a closer look at the situation with khugepaged, and can confirm that
>>>> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
>>>> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
>>>> small-sized THP is enabled+always. So I've fixed that test up and will add the
>>>> patch to the next version of my series.
>>>>
>>>> So I believe the khugepaged prerequisite can be marked as done.
>>>>
>>>> [1]
>>>> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
>>>
>>> Hi All,
>>
>> Hi,
>>
>> I wanted to remind people in the THP cabal meeting, but that either
>> didn't happen or zoomed decided to not let me join :)

I didn't make it yesterday either - was having to juggle child care.

>>
>>>
>>> It's been a week since the mm alignment meeting discussion we had around
>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI
>>> proposal, so I'm going to be optimistic and assume that nobody has found any
>>> fatal flaws in it :).
>>
>> After saying in the call probably 10 times that people should comment
>> here if there are reasonable alternatives worth discussing, call me
>> "optimistic" as well; but, it's only been a week and people might still
>> be thinking about this/
>>
>> There were two things discussed in the call:
>>
>> * Yu brought up "lists" so we can have priorities. As briefly discussed
>>    in the  call, this (a) might not be needed right now in an initial
>>    version;  (b) the kernel might be able to handle that (or many cases)
>>    automatically, TBD. Adding lists now would kind-of set the semantics
>>    of that interface in stone. As you describe below, the approach
>>    discussed here could easily be extended to cover priorities, if need
>>    be.
> 
> I want to expand on this: the argument that "if you could allocate a
> higher order you should use it" is too simplistic. There are many
> reasons in addition to the one above that we want to "fall back" to
> higher orders, e.g., those higher orders are not on PCP or from the
> local node. When we consider the sequence of orders to try, user
> preference is just one of the parameters to the cost function. The
> bottom line is that I think we should all agree that there needs to be
> a cost function down the road, whatever it looks like. Otherwise I
> don't know how we can make "auto" happen.

I don't dispute that this sounds like it could be beneficial, but I see it as
research to happen further down the road (as you say), and we don't know what
that research might conclude. Also, I think the scope of this is bigger than
anonymous memory - you would also likely want to look at the policy for page
cache folio order too, since today that's based solely on readahead. So I see it
as an optimization that is somewhat orthogonal to small-sized THP.

The proposed interface does not imply any preference order - it only states
which sizes the user wants the kernel to select from, so I think there is lots
of freedom to change this down the track if the kernel wants to start using the
buddy allocator's state as a signal to make its decisions.

> 
>> * Hugh raised the point that "bitmap of orders" could be replaced by
>>    "added THP sizes", which really is "bitmap of orders" shifted to the
>>    left. To configure 2 MiB +  64Kib, one would get "2097152 + 65536" =
>>    "2162688" or in KiB "2112". Hm.
> 
> I'm not a big fan of the "bitmap of orders" approach, because it
> doesn't address my concern above.
> 
>> Both approaches would require single-option files like "enable_always",
>> "enable_madvise" ... which I don't particularly like, but who am I to judge.
>>
>>
>>>
>>> Certainly, I think it held up to the potential future policies that Yu Zhou
>>> cited on the call - the possibility of preferring a smaller size over a bigger
>>> one, if the smaller size can be allocated without splitting a contiguous block.
>>> I think the suggestion of adding a per-size priority file would solve it. And in
>>> general because we have a per-size directory, that gives us lots of flexibility
>>> for growth.
>>
>> Jup, same opinion here. But again, I'm very happy to hear other
>> alternatives and why they are better.
> 
> I'm not against David's proposal but I want to hear a lot more about
> "lots of flexibility for growth" before I'm fully convinced. 

My point was that in an abstract sense, there are properties a user may wish to
apply individually to a size, which is catered for by having a per-size
directory into which we can add more files if/when requirements for new per-size
properties arise. There are also properties that may be applied globally, for
which we have the top-level transparent_hugepage directory where properties can
be extended or added.

For your case around tighter integration with the buddy allocator, I could
imagine a per-size file allowing the user to specify if the kernel should allow
splitting a higher order to make a THP of that size (I'm not suggesting that's a
good idea, I'm just pointing out that this sort of thing is possible with the
interface). And we have discussed how the global enabled prpoerty could be
extended to support "auto" [1].

But perhaps what we really need are lots more ideas for future directions for
small-sized THP to allow us to evaluate this interface more widely.

> Why do I
> need more convincing? When I brought up that we need to consider the
> priority of each order and the potential need to fall back to higher
> orders during the meeting, I got the impression people were surprised
> why we want to fall back to higher orders. TBH, I was surprised too
> that this possibility was never considered. I missed today's THP
> meeting too but I'll join next time and if anyone has more ideas on
> this, we can spend some time discussing it, especially on how LAF
> should cooperate with the page allocator to make better decisions.

[1]
https://lore.kernel.org/linux-mm/99f8294b-b4e5-424f-d761-24a70a82cc1a@redhat.com/

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-26  9:53             ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-26  9:53 UTC (permalink / raw)
  To: Yu Zhao, David Hildenbrand
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, Catalin Marinas,
	Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On 25/10/2023 20:11, Yu Zhao wrote:
> On Wed, Oct 25, 2023 at 12:47 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 25.10.23 18:24, Ryan Roberts wrote:
>>> On 20/10/2023 13:33, Ryan Roberts wrote:
>>>> On 06/10/2023 21:06, David Hildenbrand wrote:
>>>>> On 29.09.23 13:44, Ryan Roberts wrote:
>>>>>> Hi All,
>>>>>
>>>>
>>>> [...]
>>>>
>>>>>> NOTE: These changes should not be merged until the prerequisites are complete.
>>>>>> These are in progress and tracked at [7].
>>>>>
>>>>> We should probably list them here, and classify which one we see as strict a
>>>>> requirement, which ones might be an optimization.
>>>>>
>>>>
>>>> Bringing back the discussion of prerequistes to this thread following the
>>>> discussion at the mm-alignment meeting on Wednesday.
>>>>
>>>> Slides, updated following discussion to reflect all the agreed items that are
>>>> prerequisites and enhancements, are at [1].
>>>>
>>>> I've taken a closer look at the situation with khugepaged, and can confirm that
>>>> it does correctly collapse anon small-sized THP into PMD-sized THP. I did notice
>>>> though, that one of the khugepaged selftests (collapse_max_ptes_none) fails when
>>>> small-sized THP is enabled+always. So I've fixed that test up and will add the
>>>> patch to the next version of my series.
>>>>
>>>> So I believe the khugepaged prerequisite can be marked as done.
>>>>
>>>> [1]
>>>> https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
>>>
>>> Hi All,
>>
>> Hi,
>>
>> I wanted to remind people in the THP cabal meeting, but that either
>> didn't happen or zoomed decided to not let me join :)

I didn't make it yesterday either - was having to juggle child care.

>>
>>>
>>> It's been a week since the mm alignment meeting discussion we had around
>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI
>>> proposal, so I'm going to be optimistic and assume that nobody has found any
>>> fatal flaws in it :).
>>
>> After saying in the call probably 10 times that people should comment
>> here if there are reasonable alternatives worth discussing, call me
>> "optimistic" as well; but, it's only been a week and people might still
>> be thinking about this/
>>
>> There were two things discussed in the call:
>>
>> * Yu brought up "lists" so we can have priorities. As briefly discussed
>>    in the  call, this (a) might not be needed right now in an initial
>>    version;  (b) the kernel might be able to handle that (or many cases)
>>    automatically, TBD. Adding lists now would kind-of set the semantics
>>    of that interface in stone. As you describe below, the approach
>>    discussed here could easily be extended to cover priorities, if need
>>    be.
> 
> I want to expand on this: the argument that "if you could allocate a
> higher order you should use it" is too simplistic. There are many
> reasons in addition to the one above that we want to "fall back" to
> higher orders, e.g., those higher orders are not on PCP or from the
> local node. When we consider the sequence of orders to try, user
> preference is just one of the parameters to the cost function. The
> bottom line is that I think we should all agree that there needs to be
> a cost function down the road, whatever it looks like. Otherwise I
> don't know how we can make "auto" happen.

I don't dispute that this sounds like it could be beneficial, but I see it as
research to happen further down the road (as you say), and we don't know what
that research might conclude. Also, I think the scope of this is bigger than
anonymous memory - you would also likely want to look at the policy for page
cache folio order too, since today that's based solely on readahead. So I see it
as an optimization that is somewhat orthogonal to small-sized THP.

The proposed interface does not imply any preference order - it only states
which sizes the user wants the kernel to select from, so I think there is lots
of freedom to change this down the track if the kernel wants to start using the
buddy allocator's state as a signal to make its decisions.

> 
>> * Hugh raised the point that "bitmap of orders" could be replaced by
>>    "added THP sizes", which really is "bitmap of orders" shifted to the
>>    left. To configure 2 MiB +  64Kib, one would get "2097152 + 65536" =
>>    "2162688" or in KiB "2112". Hm.
> 
> I'm not a big fan of the "bitmap of orders" approach, because it
> doesn't address my concern above.
> 
>> Both approaches would require single-option files like "enable_always",
>> "enable_madvise" ... which I don't particularly like, but who am I to judge.
>>
>>
>>>
>>> Certainly, I think it held up to the potential future policies that Yu Zhou
>>> cited on the call - the possibility of preferring a smaller size over a bigger
>>> one, if the smaller size can be allocated without splitting a contiguous block.
>>> I think the suggestion of adding a per-size priority file would solve it. And in
>>> general because we have a per-size directory, that gives us lots of flexibility
>>> for growth.
>>
>> Jup, same opinion here. But again, I'm very happy to hear other
>> alternatives and why they are better.
> 
> I'm not against David's proposal but I want to hear a lot more about
> "lots of flexibility for growth" before I'm fully convinced. 

My point was that in an abstract sense, there are properties a user may wish to
apply individually to a size, which is catered for by having a per-size
directory into which we can add more files if/when requirements for new per-size
properties arise. There are also properties that may be applied globally, for
which we have the top-level transparent_hugepage directory where properties can
be extended or added.

For your case around tighter integration with the buddy allocator, I could
imagine a per-size file allowing the user to specify if the kernel should allow
splitting a higher order to make a THP of that size (I'm not suggesting that's a
good idea, I'm just pointing out that this sort of thing is possible with the
interface). And we have discussed how the global enabled prpoerty could be
extended to support "auto" [1].

But perhaps what we really need are lots more ideas for future directions for
small-sized THP to allow us to evaluate this interface more widely.

> Why do I
> need more convincing? When I brought up that we need to consider the
> priority of each order and the potential need to fall back to higher
> orders during the meeting, I got the impression people were surprised
> why we want to fall back to higher orders. TBH, I was surprised too
> that this possibility was never considered. I missed today's THP
> meeting too but I'll join next time and if anyone has more ideas on
> this, we can spend some time discussing it, especially on how LAF
> should cooperate with the page allocator to make better decisions.

[1]
https://lore.kernel.org/linux-mm/99f8294b-b4e5-424f-d761-24a70a82cc1a@redhat.com/

Thanks,
Ryan


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-26  9:53             ` Ryan Roberts
@ 2023-10-26 15:19               ` David Hildenbrand
  -1 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-26 15:19 UTC (permalink / raw)
  To: Ryan Roberts, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, Catalin Marinas,
	Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

[...]

>>> Hi,
>>>
>>> I wanted to remind people in the THP cabal meeting, but that either
>>> didn't happen or zoomed decided to not let me join :)
> 
> I didn't make it yesterday either - was having to juggle child care.

I think it didn't happen, or started quite late (>20 min).

> 
>>>
>>>>
>>>> It's been a week since the mm alignment meeting discussion we had around
>>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI
>>>> proposal, so I'm going to be optimistic and assume that nobody has found any
>>>> fatal flaws in it :).
>>>
>>> After saying in the call probably 10 times that people should comment
>>> here if there are reasonable alternatives worth discussing, call me
>>> "optimistic" as well; but, it's only been a week and people might still
>>> be thinking about this/
>>>
>>> There were two things discussed in the call:
>>>
>>> * Yu brought up "lists" so we can have priorities. As briefly discussed
>>>     in the  call, this (a) might not be needed right now in an initial
>>>     version;  (b) the kernel might be able to handle that (or many cases)
>>>     automatically, TBD. Adding lists now would kind-of set the semantics
>>>     of that interface in stone. As you describe below, the approach
>>>     discussed here could easily be extended to cover priorities, if need
>>>     be.
>>
>> I want to expand on this: the argument that "if you could allocate a
>> higher order you should use it" is too simplistic. There are many
>> reasons in addition to the one above that we want to "fall back" to
>> higher orders, e.g., those higher orders are not on PCP or from the
>> local node. When we consider the sequence of orders to try, user
>> preference is just one of the parameters to the cost function. The
>> bottom line is that I think we should all agree that there needs to be
>> a cost function down the road, whatever it looks like. Otherwise I
>> don't know how we can make "auto" happen.

I agree that there needs to be a cost function, and as pagecache showed 
that's independent of initial enablement.

> 
> I don't dispute that this sounds like it could be beneficial, but I see it as
> research to happen further down the road (as you say), and we don't know what
> that research might conclude. Also, I think the scope of this is bigger than
> anonymous memory - you would also likely want to look at the policy for page
> cache folio order too, since today that's based solely on readahead. So I see it
> as an optimization that is somewhat orthogonal to small-sized THP.

Exactly my thoughts.

The important thing is that we should plan ahead that we still have the 
option to let the admin configure if we cannot make this work 
automatically in the kernel.

What we'll need, nobody knows. Maybe it's a per-size priority, maybe 
it's a single global toggle.

> 
> The proposed interface does not imply any preference order - it only states
> which sizes the user wants the kernel to select from, so I think there is lots
> of freedom to change this down the track if the kernel wants to start using the
> buddy allocator's state as a signal to make its decisions.

Yes.

[..]

>>> Jup, same opinion here. But again, I'm very happy to hear other
>>> alternatives and why they are better.
>>
>> I'm not against David's proposal but I want to hear a lot more about
>> "lots of flexibility for growth" before I'm fully convinced.
> 
> My point was that in an abstract sense, there are properties a user may wish to
> apply individually to a size, which is catered for by having a per-size
> directory into which we can add more files if/when requirements for new per-size
> properties arise. There are also properties that may be applied globally, for
> which we have the top-level transparent_hugepage directory where properties can
> be extended or added.

Exactly, well said.

> 
> For your case around tighter integration with the buddy allocator, I could
> imagine a per-size file allowing the user to specify if the kernel should allow
> splitting a higher order to make a THP of that size (I'm not suggesting that's a
> good idea, I'm just pointing out that this sort of thing is possible with the
> interface). And we have discussed how the global enabled prpoerty could be
> extended to support "auto" [1].
> 
> But perhaps what we really need are lots more ideas for future directions for
> small-sized THP to allow us to evaluate this interface more widely.

David R. motivated a future size-aware setting of the defrag option. As 
discussed we might want something similar to shmem_enable. What will 
happen with khugepaged, nobody knows yet :)

I could imagine exposing per-size boolean read-only properties like 
"native-hw-size" (PMD, cont-pte). But these things require much more 
thought.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-26 15:19               ` David Hildenbrand
  0 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-26 15:19 UTC (permalink / raw)
  To: Ryan Roberts, Yu Zhao
  Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, Catalin Marinas,
	Anshuman Khandual, Yang Shi, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

[...]

>>> Hi,
>>>
>>> I wanted to remind people in the THP cabal meeting, but that either
>>> didn't happen or zoomed decided to not let me join :)
> 
> I didn't make it yesterday either - was having to juggle child care.

I think it didn't happen, or started quite late (>20 min).

> 
>>>
>>>>
>>>> It's been a week since the mm alignment meeting discussion we had around
>>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI
>>>> proposal, so I'm going to be optimistic and assume that nobody has found any
>>>> fatal flaws in it :).
>>>
>>> After saying in the call probably 10 times that people should comment
>>> here if there are reasonable alternatives worth discussing, call me
>>> "optimistic" as well; but, it's only been a week and people might still
>>> be thinking about this/
>>>
>>> There were two things discussed in the call:
>>>
>>> * Yu brought up "lists" so we can have priorities. As briefly discussed
>>>     in the  call, this (a) might not be needed right now in an initial
>>>     version;  (b) the kernel might be able to handle that (or many cases)
>>>     automatically, TBD. Adding lists now would kind-of set the semantics
>>>     of that interface in stone. As you describe below, the approach
>>>     discussed here could easily be extended to cover priorities, if need
>>>     be.
>>
>> I want to expand on this: the argument that "if you could allocate a
>> higher order you should use it" is too simplistic. There are many
>> reasons in addition to the one above that we want to "fall back" to
>> higher orders, e.g., those higher orders are not on PCP or from the
>> local node. When we consider the sequence of orders to try, user
>> preference is just one of the parameters to the cost function. The
>> bottom line is that I think we should all agree that there needs to be
>> a cost function down the road, whatever it looks like. Otherwise I
>> don't know how we can make "auto" happen.

I agree that there needs to be a cost function, and as pagecache showed 
that's independent of initial enablement.

> 
> I don't dispute that this sounds like it could be beneficial, but I see it as
> research to happen further down the road (as you say), and we don't know what
> that research might conclude. Also, I think the scope of this is bigger than
> anonymous memory - you would also likely want to look at the policy for page
> cache folio order too, since today that's based solely on readahead. So I see it
> as an optimization that is somewhat orthogonal to small-sized THP.

Exactly my thoughts.

The important thing is that we should plan ahead that we still have the 
option to let the admin configure if we cannot make this work 
automatically in the kernel.

What we'll need, nobody knows. Maybe it's a per-size priority, maybe 
it's a single global toggle.

> 
> The proposed interface does not imply any preference order - it only states
> which sizes the user wants the kernel to select from, so I think there is lots
> of freedom to change this down the track if the kernel wants to start using the
> buddy allocator's state as a signal to make its decisions.

Yes.

[..]

>>> Jup, same opinion here. But again, I'm very happy to hear other
>>> alternatives and why they are better.
>>
>> I'm not against David's proposal but I want to hear a lot more about
>> "lots of flexibility for growth" before I'm fully convinced.
> 
> My point was that in an abstract sense, there are properties a user may wish to
> apply individually to a size, which is catered for by having a per-size
> directory into which we can add more files if/when requirements for new per-size
> properties arise. There are also properties that may be applied globally, for
> which we have the top-level transparent_hugepage directory where properties can
> be extended or added.

Exactly, well said.

> 
> For your case around tighter integration with the buddy allocator, I could
> imagine a per-size file allowing the user to specify if the kernel should allow
> splitting a higher order to make a THP of that size (I'm not suggesting that's a
> good idea, I'm just pointing out that this sort of thing is possible with the
> interface). And we have discussed how the global enabled prpoerty could be
> extended to support "auto" [1].
> 
> But perhaps what we really need are lots more ideas for future directions for
> small-sized THP to allow us to evaluate this interface more widely.

David R. motivated a future size-aware setting of the defrag option. As 
discussed we might want something similar to shmem_enable. What will 
happen with khugepaged, nobody knows yet :)

I could imagine exposing per-size boolean read-only properties like 
"native-hw-size" (PMD, cont-pte). But these things require much more 
thought.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
  2023-09-29 11:44   ` Ryan Roberts
@ 2023-10-27 23:04     ` John Hubbard
  -1 siblings, 0 replies; 140+ messages in thread
From: John Hubbard @ 2023-10-27 23:04 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 9/29/23 04:44, Ryan Roberts wrote:

Hi Ryan,

A few clarifying questions below.

...
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2e7c338229a6..c4860476a1f5 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>   #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>   
>   /*
> - * Mask of all large folio orders supported for anonymous THP.
> + * Mask of all large folio orders supported for anonymous THP; all orders up to
> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> + * (which is a limitation of the THP implementation).
>    */
> -#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
> +#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>   
>   /*
>    * Mask of all large folio orders supported for file THP.
> diff --git a/mm/memory.c b/mm/memory.c
> index b5b82fc8e164..92ed9c782dc9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   	return ret;
>   }
>   
> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> +{
> +	int i;
> +
> +	if (nr_pages == 1)
> +		return vmf_pte_changed(vmf);
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> +			return true;

This seems like something different than the function name implies.
It's really confusing: for a single page case, return true if the
pte in the page tables has changed, yes that is very clear.

But then for multiple page cases, which is really the main
focus here--for that, claim that the range has changed if any
pte is present (!pte_none). Can you please help me understand
what this means?

> +	}
> +
> +	return false;
> +}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> +{
> +	gfp_t gfp;
> +	pte_t *pte;
> +	unsigned long addr;
> +	struct folio *folio;
> +	struct vm_area_struct *vma = vmf->vma;
> +	unsigned int orders;
> +	int order;
> +
> +	/*
> +	 * If uffd is active for the vma we need per-page fault fidelity to
> +	 * maintain the uffd semantics.
> +	 */
> +	if (userfaultfd_armed(vma))
> +		goto fallback;
> +
> +	/*
> +	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
> +	 * for this vma. Then filter out the orders that can't be allocated over
> +	 * the faulting address and still be fully contained in the vma.
> +	 */
> +	orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
> +				    BIT(PMD_ORDER) - 1);
> +	orders = transhuge_vma_suitable(vma, vmf->address, orders);
> +
> +	if (!orders)
> +		goto fallback;
> +
> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> +	if (!pte)
> +		return ERR_PTR(-EAGAIN);

pte_offset_map() can only fail due to:

     a) Wrong pmd type. These include:
         pmd_none
         pmd_bad
         pmd migration entry
         pmd_trans_huge
         pmd_devmap

     b) __pte_map() failure

For (a), why is it that -EAGAIN is used here? I see that that
will lead to a re-fault, I got that far, but am missing something
still.

For (b), same question, actually. I'm not completely sure why
why a retry is going to fix a __pte_map() failure?


> +
> +	order = first_order(orders);
> +	while (orders) {
> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +		vmf->pte = pte + pte_index(addr);
> +		if (!vmf_pte_range_changed(vmf, 1 << order))
> +			break;
> +		order = next_order(&orders, order);
> +	}
> +
> +	vmf->pte = NULL;
> +	pte_unmap(pte);
> +
> +	gfp = vma_thp_gfp_mask(vma);
> +
> +	while (orders) {
> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +		folio = vma_alloc_folio(gfp, order, vma, addr, true);
> +		if (folio) {
> +			clear_huge_page(&folio->page, addr, 1 << order);
> +			return folio;
> +		}
> +		order = next_order(&orders, order);
> +	}

And finally: is it accurate to say that there are *no* special
page flags being set, for PTE-mapped THPs? I don't see any here,
but want to confirm.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
@ 2023-10-27 23:04     ` John Hubbard
  0 siblings, 0 replies; 140+ messages in thread
From: John Hubbard @ 2023-10-27 23:04 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 9/29/23 04:44, Ryan Roberts wrote:

Hi Ryan,

A few clarifying questions below.

...
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2e7c338229a6..c4860476a1f5 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>   #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>   
>   /*
> - * Mask of all large folio orders supported for anonymous THP.
> + * Mask of all large folio orders supported for anonymous THP; all orders up to
> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> + * (which is a limitation of the THP implementation).
>    */
> -#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
> +#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>   
>   /*
>    * Mask of all large folio orders supported for file THP.
> diff --git a/mm/memory.c b/mm/memory.c
> index b5b82fc8e164..92ed9c782dc9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   	return ret;
>   }
>   
> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> +{
> +	int i;
> +
> +	if (nr_pages == 1)
> +		return vmf_pte_changed(vmf);
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> +			return true;

This seems like something different than the function name implies.
It's really confusing: for a single page case, return true if the
pte in the page tables has changed, yes that is very clear.

But then for multiple page cases, which is really the main
focus here--for that, claim that the range has changed if any
pte is present (!pte_none). Can you please help me understand
what this means?

> +	}
> +
> +	return false;
> +}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> +{
> +	gfp_t gfp;
> +	pte_t *pte;
> +	unsigned long addr;
> +	struct folio *folio;
> +	struct vm_area_struct *vma = vmf->vma;
> +	unsigned int orders;
> +	int order;
> +
> +	/*
> +	 * If uffd is active for the vma we need per-page fault fidelity to
> +	 * maintain the uffd semantics.
> +	 */
> +	if (userfaultfd_armed(vma))
> +		goto fallback;
> +
> +	/*
> +	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
> +	 * for this vma. Then filter out the orders that can't be allocated over
> +	 * the faulting address and still be fully contained in the vma.
> +	 */
> +	orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
> +				    BIT(PMD_ORDER) - 1);
> +	orders = transhuge_vma_suitable(vma, vmf->address, orders);
> +
> +	if (!orders)
> +		goto fallback;
> +
> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> +	if (!pte)
> +		return ERR_PTR(-EAGAIN);

pte_offset_map() can only fail due to:

     a) Wrong pmd type. These include:
         pmd_none
         pmd_bad
         pmd migration entry
         pmd_trans_huge
         pmd_devmap

     b) __pte_map() failure

For (a), why is it that -EAGAIN is used here? I see that that
will lead to a re-fault, I got that far, but am missing something
still.

For (b), same question, actually. I'm not completely sure why
why a retry is going to fix a __pte_map() failure?


> +
> +	order = first_order(orders);
> +	while (orders) {
> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +		vmf->pte = pte + pte_index(addr);
> +		if (!vmf_pte_range_changed(vmf, 1 << order))
> +			break;
> +		order = next_order(&orders, order);
> +	}
> +
> +	vmf->pte = NULL;
> +	pte_unmap(pte);
> +
> +	gfp = vma_thp_gfp_mask(vma);
> +
> +	while (orders) {
> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +		folio = vma_alloc_folio(gfp, order, vma, addr, true);
> +		if (folio) {
> +			clear_huge_page(&folio->page, addr, 1 << order);
> +			return folio;
> +		}
> +		order = next_order(&orders, order);
> +	}

And finally: is it accurate to say that there are *no* special
page flags being set, for PTE-mapped THPs? I don't see any here,
but want to confirm.


thanks,
-- 
John Hubbard
NVIDIA


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
  2023-10-27 23:04     ` John Hubbard
@ 2023-10-30 11:43       ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-30 11:43 UTC (permalink / raw)
  To: John Hubbard, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 28/10/2023 00:04, John Hubbard wrote:
> On 9/29/23 04:44, Ryan Roberts wrote:
> 
> Hi Ryan,
> 
> A few clarifying questions below.

Excellent - keep them coming!

> 
> ...
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2e7c338229a6..c4860476a1f5 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>>   #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>>     /*
>> - * Mask of all large folio orders supported for anonymous THP.
>> + * Mask of all large folio orders supported for anonymous THP; all orders up to
>> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
>> + * (which is a limitation of the THP implementation).
>>    */
>> -#define THP_ORDERS_ALL_ANON    BIT(PMD_ORDER)
>> +#define THP_ORDERS_ALL_ANON    ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>>     /*
>>    * Mask of all large folio orders supported for file THP.
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b5b82fc8e164..92ed9c782dc9 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>       return ret;
>>   }
>>   +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>> +{
>> +    int i;
>> +
>> +    if (nr_pages == 1)
>> +        return vmf_pte_changed(vmf);
>> +
>> +    for (i = 0; i < nr_pages; i++) {
>> +        if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>> +            return true;
> 
> This seems like something different than the function name implies.
> It's really confusing: for a single page case, return true if the
> pte in the page tables has changed, yes that is very clear.
> 
> But then for multiple page cases, which is really the main
> focus here--for that, claim that the range has changed if any
> pte is present (!pte_none). Can you please help me understand
> what this means?

Yes I understand your confusion. Although I'm confident that the code is
correct, its a bad name - I'll make the excuse that this has evolved through
rebasing to cope with additions to UFFD. Perhaps something like
vmf_is_large_folio_suitable() is a better name.

It used to be that we would only take the do_anonymous_page() path if the pte
was none; i.e. this is the first time we are faulting on an address covered by
an anon VMA and we need to allocate some memory. But more recently we also end
up here if the pte is a uffd_wp marker. So for a single pte, instead of checking
none, we can check if the pte has changed from our original check (where we
determined it was a uffd_wp marker or none). But for multiple ptes, we don't
have storage to store all the original ptes from the first check.

Fortunately, if uffd is in use for a vma, then we don't want to use a large
folio anyway (this would break uffd semantics because we would no longer get a
fault for every page). So we only care about the "same but not none" case for
nr_pages=1.

Would changing the name to vmf_is_large_folio_suitable() help here?


> 
>> +    }
>> +
>> +    return false;
>> +}
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>> +{
>> +    gfp_t gfp;
>> +    pte_t *pte;
>> +    unsigned long addr;
>> +    struct folio *folio;
>> +    struct vm_area_struct *vma = vmf->vma;
>> +    unsigned int orders;
>> +    int order;
>> +
>> +    /*
>> +     * If uffd is active for the vma we need per-page fault fidelity to
>> +     * maintain the uffd semantics.
>> +     */
>> +    if (userfaultfd_armed(vma))
>> +        goto fallback;
>> +
>> +    /*
>> +     * Get a list of all the (large) orders below PMD_ORDER that are enabled
>> +     * for this vma. Then filter out the orders that can't be allocated over
>> +     * the faulting address and still be fully contained in the vma.
>> +     */
>> +    orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
>> +                    BIT(PMD_ORDER) - 1);
>> +    orders = transhuge_vma_suitable(vma, vmf->address, orders);
>> +
>> +    if (!orders)
>> +        goto fallback;
>> +
>> +    pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>> +    if (!pte)
>> +        return ERR_PTR(-EAGAIN);
> 
> pte_offset_map() can only fail due to:
> 
>     a) Wrong pmd type. These include:
>         pmd_none
>         pmd_bad
>         pmd migration entry
>         pmd_trans_huge
>         pmd_devmap
> 
>     b) __pte_map() failure
> 
> For (a), why is it that -EAGAIN is used here? I see that that
> will lead to a re-fault, I got that far, but am missing something
> still.
> 
> For (b), same question, actually. I'm not completely sure why
> why a retry is going to fix a __pte_map() failure?

I'm not going to claim to understand all the details of this. But this is due to
a change that Hugh introduced and we concluded at [1] that its always correct to
return EAGAIN here to rerun the fault. In fact, with the current implementation
pte_offset_map() should never fail for anon IIUC, but the view was that EAGAIN
makes it safe for tomorrow, and because this would only fail due to a race,
retrying is correct.

[1] https://lore.kernel.org/linux-mm/8bdfd8d8-5662-4615-86dc-d60259bd16d@google.com/


> 
> 
>> +
>> +    order = first_order(orders);
>> +    while (orders) {
>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +        vmf->pte = pte + pte_index(addr);
>> +        if (!vmf_pte_range_changed(vmf, 1 << order))
>> +            break;
>> +        order = next_order(&orders, order);
>> +    }
>> +
>> +    vmf->pte = NULL;
>> +    pte_unmap(pte);
>> +
>> +    gfp = vma_thp_gfp_mask(vma);
>> +
>> +    while (orders) {
>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +        folio = vma_alloc_folio(gfp, order, vma, addr, true);
>> +        if (folio) {
>> +            clear_huge_page(&folio->page, addr, 1 << order);
>> +            return folio;
>> +        }
>> +        order = next_order(&orders, order);
>> +    }
> 
> And finally: is it accurate to say that there are *no* special
> page flags being set, for PTE-mapped THPs? I don't see any here,
> but want to confirm.

The page flags are coming from 'gfp = vma_thp_gfp_mask(vma)', which pulls in the
correct flags based on transparent_hugepage/defrag file.

> 
> 
> thanks,


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
@ 2023-10-30 11:43       ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-30 11:43 UTC (permalink / raw)
  To: John Hubbard, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 28/10/2023 00:04, John Hubbard wrote:
> On 9/29/23 04:44, Ryan Roberts wrote:
> 
> Hi Ryan,
> 
> A few clarifying questions below.

Excellent - keep them coming!

> 
> ...
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2e7c338229a6..c4860476a1f5 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>>   #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>>     /*
>> - * Mask of all large folio orders supported for anonymous THP.
>> + * Mask of all large folio orders supported for anonymous THP; all orders up to
>> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
>> + * (which is a limitation of the THP implementation).
>>    */
>> -#define THP_ORDERS_ALL_ANON    BIT(PMD_ORDER)
>> +#define THP_ORDERS_ALL_ANON    ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>>     /*
>>    * Mask of all large folio orders supported for file THP.
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b5b82fc8e164..92ed9c782dc9 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>       return ret;
>>   }
>>   +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>> +{
>> +    int i;
>> +
>> +    if (nr_pages == 1)
>> +        return vmf_pte_changed(vmf);
>> +
>> +    for (i = 0; i < nr_pages; i++) {
>> +        if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>> +            return true;
> 
> This seems like something different than the function name implies.
> It's really confusing: for a single page case, return true if the
> pte in the page tables has changed, yes that is very clear.
> 
> But then for multiple page cases, which is really the main
> focus here--for that, claim that the range has changed if any
> pte is present (!pte_none). Can you please help me understand
> what this means?

Yes I understand your confusion. Although I'm confident that the code is
correct, its a bad name - I'll make the excuse that this has evolved through
rebasing to cope with additions to UFFD. Perhaps something like
vmf_is_large_folio_suitable() is a better name.

It used to be that we would only take the do_anonymous_page() path if the pte
was none; i.e. this is the first time we are faulting on an address covered by
an anon VMA and we need to allocate some memory. But more recently we also end
up here if the pte is a uffd_wp marker. So for a single pte, instead of checking
none, we can check if the pte has changed from our original check (where we
determined it was a uffd_wp marker or none). But for multiple ptes, we don't
have storage to store all the original ptes from the first check.

Fortunately, if uffd is in use for a vma, then we don't want to use a large
folio anyway (this would break uffd semantics because we would no longer get a
fault for every page). So we only care about the "same but not none" case for
nr_pages=1.

Would changing the name to vmf_is_large_folio_suitable() help here?


> 
>> +    }
>> +
>> +    return false;
>> +}
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>> +{
>> +    gfp_t gfp;
>> +    pte_t *pte;
>> +    unsigned long addr;
>> +    struct folio *folio;
>> +    struct vm_area_struct *vma = vmf->vma;
>> +    unsigned int orders;
>> +    int order;
>> +
>> +    /*
>> +     * If uffd is active for the vma we need per-page fault fidelity to
>> +     * maintain the uffd semantics.
>> +     */
>> +    if (userfaultfd_armed(vma))
>> +        goto fallback;
>> +
>> +    /*
>> +     * Get a list of all the (large) orders below PMD_ORDER that are enabled
>> +     * for this vma. Then filter out the orders that can't be allocated over
>> +     * the faulting address and still be fully contained in the vma.
>> +     */
>> +    orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
>> +                    BIT(PMD_ORDER) - 1);
>> +    orders = transhuge_vma_suitable(vma, vmf->address, orders);
>> +
>> +    if (!orders)
>> +        goto fallback;
>> +
>> +    pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>> +    if (!pte)
>> +        return ERR_PTR(-EAGAIN);
> 
> pte_offset_map() can only fail due to:
> 
>     a) Wrong pmd type. These include:
>         pmd_none
>         pmd_bad
>         pmd migration entry
>         pmd_trans_huge
>         pmd_devmap
> 
>     b) __pte_map() failure
> 
> For (a), why is it that -EAGAIN is used here? I see that that
> will lead to a re-fault, I got that far, but am missing something
> still.
> 
> For (b), same question, actually. I'm not completely sure why
> why a retry is going to fix a __pte_map() failure?

I'm not going to claim to understand all the details of this. But this is due to
a change that Hugh introduced and we concluded at [1] that its always correct to
return EAGAIN here to rerun the fault. In fact, with the current implementation
pte_offset_map() should never fail for anon IIUC, but the view was that EAGAIN
makes it safe for tomorrow, and because this would only fail due to a race,
retrying is correct.

[1] https://lore.kernel.org/linux-mm/8bdfd8d8-5662-4615-86dc-d60259bd16d@google.com/


> 
> 
>> +
>> +    order = first_order(orders);
>> +    while (orders) {
>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +        vmf->pte = pte + pte_index(addr);
>> +        if (!vmf_pte_range_changed(vmf, 1 << order))
>> +            break;
>> +        order = next_order(&orders, order);
>> +    }
>> +
>> +    vmf->pte = NULL;
>> +    pte_unmap(pte);
>> +
>> +    gfp = vma_thp_gfp_mask(vma);
>> +
>> +    while (orders) {
>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +        folio = vma_alloc_folio(gfp, order, vma, addr, true);
>> +        if (folio) {
>> +            clear_huge_page(&folio->page, addr, 1 << order);
>> +            return folio;
>> +        }
>> +        order = next_order(&orders, order);
>> +    }
> 
> And finally: is it accurate to say that there are *no* special
> page flags being set, for PTE-mapped THPs? I don't see any here,
> but want to confirm.

The page flags are coming from 'gfp = vma_thp_gfp_mask(vma)', which pulls in the
correct flags based on transparent_hugepage/defrag file.

> 
> 
> thanks,


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
  2023-10-30 11:43       ` Ryan Roberts
@ 2023-10-30 23:25         ` John Hubbard
  -1 siblings, 0 replies; 140+ messages in thread
From: John Hubbard @ 2023-10-30 23:25 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 10/30/23 04:43, Ryan Roberts wrote:
> On 28/10/2023 00:04, John Hubbard wrote:
>> On 9/29/23 04:44, Ryan Roberts wrote:
...
>>>    +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>> +{
>>> +    int i;
>>> +
>>> +    if (nr_pages == 1)
>>> +        return vmf_pte_changed(vmf);
>>> +
>>> +    for (i = 0; i < nr_pages; i++) {
>>> +        if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>> +            return true;
>>
>> This seems like something different than the function name implies.
>> It's really confusing: for a single page case, return true if the
>> pte in the page tables has changed, yes that is very clear.
>>
>> But then for multiple page cases, which is really the main
>> focus here--for that, claim that the range has changed if any
>> pte is present (!pte_none). Can you please help me understand
>> what this means?
> 
> Yes I understand your confusion. Although I'm confident that the code is
> correct, its a bad name - I'll make the excuse that this has evolved through
> rebasing to cope with additions to UFFD. Perhaps something like
> vmf_is_large_folio_suitable() is a better name.
> 
> It used to be that we would only take the do_anonymous_page() path if the pte
> was none; i.e. this is the first time we are faulting on an address covered by
> an anon VMA and we need to allocate some memory. But more recently we also end
> up here if the pte is a uffd_wp marker. So for a single pte, instead of checking
> none, we can check if the pte has changed from our original check (where we
> determined it was a uffd_wp marker or none). But for multiple ptes, we don't
> have storage to store all the original ptes from the first check.
> 
> Fortunately, if uffd is in use for a vma, then we don't want to use a large
> folio anyway (this would break uffd semantics because we would no longer get a
> fault for every page). So we only care about the "same but not none" case for
> nr_pages=1.
> 
> Would changing the name to vmf_is_large_folio_suitable() help here?

Yes it would! And adding in a sentence or two from above about the uffd, as
a function-level comment might be just the right of demystification for
the code.

...
pte_offset_map() can only fail due to:
>>
>>      a) Wrong pmd type. These include:
>>          pmd_none
>>          pmd_bad
>>          pmd migration entry
>>          pmd_trans_huge
>>          pmd_devmap
>>
>>      b) __pte_map() failure
>>
>> For (a), why is it that -EAGAIN is used here? I see that that
>> will lead to a re-fault, I got that far, but am missing something
>> still.
>>
>> For (b), same question, actually. I'm not completely sure why
>> why a retry is going to fix a __pte_map() failure?
> 
> I'm not going to claim to understand all the details of this. But this is due to
> a change that Hugh introduced and we concluded at [1] that its always correct to
> return EAGAIN here to rerun the fault. In fact, with the current implementation
> pte_offset_map() should never fail for anon IIUC, but the view was that EAGAIN
> makes it safe for tomorrow, and because this would only fail due to a race,
> retrying is correct.
> 
> [1] https://lore.kernel.org/linux-mm/8bdfd8d8-5662-4615-86dc-d60259bd16d@google.com/
> 

OK, got it.

...
>> And finally: is it accurate to say that there are *no* special
>> page flags being set, for PTE-mapped THPs? I don't see any here,
>> but want to confirm.
> 
> The page flags are coming from 'gfp = vma_thp_gfp_mask(vma)', which pulls in the
> correct flags based on transparent_hugepage/defrag file.
> 

OK that all is pretty clear now, thanks for the answers!


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
@ 2023-10-30 23:25         ` John Hubbard
  0 siblings, 0 replies; 140+ messages in thread
From: John Hubbard @ 2023-10-30 23:25 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 10/30/23 04:43, Ryan Roberts wrote:
> On 28/10/2023 00:04, John Hubbard wrote:
>> On 9/29/23 04:44, Ryan Roberts wrote:
...
>>>    +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>> +{
>>> +    int i;
>>> +
>>> +    if (nr_pages == 1)
>>> +        return vmf_pte_changed(vmf);
>>> +
>>> +    for (i = 0; i < nr_pages; i++) {
>>> +        if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>> +            return true;
>>
>> This seems like something different than the function name implies.
>> It's really confusing: for a single page case, return true if the
>> pte in the page tables has changed, yes that is very clear.
>>
>> But then for multiple page cases, which is really the main
>> focus here--for that, claim that the range has changed if any
>> pte is present (!pte_none). Can you please help me understand
>> what this means?
> 
> Yes I understand your confusion. Although I'm confident that the code is
> correct, its a bad name - I'll make the excuse that this has evolved through
> rebasing to cope with additions to UFFD. Perhaps something like
> vmf_is_large_folio_suitable() is a better name.
> 
> It used to be that we would only take the do_anonymous_page() path if the pte
> was none; i.e. this is the first time we are faulting on an address covered by
> an anon VMA and we need to allocate some memory. But more recently we also end
> up here if the pte is a uffd_wp marker. So for a single pte, instead of checking
> none, we can check if the pte has changed from our original check (where we
> determined it was a uffd_wp marker or none). But for multiple ptes, we don't
> have storage to store all the original ptes from the first check.
> 
> Fortunately, if uffd is in use for a vma, then we don't want to use a large
> folio anyway (this would break uffd semantics because we would no longer get a
> fault for every page). So we only care about the "same but not none" case for
> nr_pages=1.
> 
> Would changing the name to vmf_is_large_folio_suitable() help here?

Yes it would! And adding in a sentence or two from above about the uffd, as
a function-level comment might be just the right of demystification for
the code.

...
pte_offset_map() can only fail due to:
>>
>>      a) Wrong pmd type. These include:
>>          pmd_none
>>          pmd_bad
>>          pmd migration entry
>>          pmd_trans_huge
>>          pmd_devmap
>>
>>      b) __pte_map() failure
>>
>> For (a), why is it that -EAGAIN is used here? I see that that
>> will lead to a re-fault, I got that far, but am missing something
>> still.
>>
>> For (b), same question, actually. I'm not completely sure why
>> why a retry is going to fix a __pte_map() failure?
> 
> I'm not going to claim to understand all the details of this. But this is due to
> a change that Hugh introduced and we concluded at [1] that its always correct to
> return EAGAIN here to rerun the fault. In fact, with the current implementation
> pte_offset_map() should never fail for anon IIUC, but the view was that EAGAIN
> makes it safe for tomorrow, and because this would only fail due to a race,
> retrying is correct.
> 
> [1] https://lore.kernel.org/linux-mm/8bdfd8d8-5662-4615-86dc-d60259bd16d@google.com/
> 

OK, got it.

...
>> And finally: is it accurate to say that there are *no* special
>> page flags being set, for PTE-mapped THPs? I don't see any here,
>> but want to confirm.
> 
> The page flags are coming from 'gfp = vma_thp_gfp_mask(vma)', which pulls in the
> correct flags based on transparent_hugepage/defrag file.
> 

OK that all is pretty clear now, thanks for the answers!


thanks,
-- 
John Hubbard
NVIDIA


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-06 20:06   ` David Hildenbrand
@ 2023-10-31 11:50     ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-31 11:50 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 06/10/2023 21:06, David Hildenbrand wrote:
[...]
> 
> Change 2: sysfs interface.
> 
> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
> agree.
> 
> What we expose there and how, is TBD. Again, not a friend of "orders" and
> bitmaps at all. We can do better if we want to go down that path.
> 
> Maybe we should take a look at hugetlb, and how they added support for multiple
> sizes. What *might* make sense could be (depending on which values we actually
> support!)
> 
> 
> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
> 
> Each one would contain an "enabled" and "defrag" file. We want something minimal
> first? Start with the "enabled" option.
> 
> 
> enabled: always [global] madvise never
> 
> Initially, we would set it for PMD-sized THP to "global" and for everything else
> to "never".

Hi David,

I've just started coding this, and it occurs to me that I might need a small
clarification here; the existing global "enabled" control is used to drive
decisions for both anonymous memory and (non-shmem) file-backed memory. But the
proposed new per-size "enabled" is implicitly only controlling anon memory (for
now).

1) Is this potentially confusing for the user? Should we rename the per-size
controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
we can reuse the same control for file-backed memory in future?

2) The global control will continue to drive the file-backed memory decision
(for now), even when hugepages-2048kB/enabled != "global"; agreed?

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-31 11:50     ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-31 11:50 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 06/10/2023 21:06, David Hildenbrand wrote:
[...]
> 
> Change 2: sysfs interface.
> 
> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
> agree.
> 
> What we expose there and how, is TBD. Again, not a friend of "orders" and
> bitmaps at all. We can do better if we want to go down that path.
> 
> Maybe we should take a look at hugetlb, and how they added support for multiple
> sizes. What *might* make sense could be (depending on which values we actually
> support!)
> 
> 
> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
> 
> Each one would contain an "enabled" and "defrag" file. We want something minimal
> first? Start with the "enabled" option.
> 
> 
> enabled: always [global] madvise never
> 
> Initially, we would set it for PMD-sized THP to "global" and for everything else
> to "never".

Hi David,

I've just started coding this, and it occurs to me that I might need a small
clarification here; the existing global "enabled" control is used to drive
decisions for both anonymous memory and (non-shmem) file-backed memory. But the
proposed new per-size "enabled" is implicitly only controlling anon memory (for
now).

1) Is this potentially confusing for the user? Should we rename the per-size
controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
we can reuse the same control for file-backed memory in future?

2) The global control will continue to drive the file-backed memory decision
(for now), even when hugepages-2048kB/enabled != "global"; agreed?

Thanks,
Ryan


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-31 11:50     ` Ryan Roberts
@ 2023-10-31 11:55       ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-31 11:55 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 31/10/2023 11:50, Ryan Roberts wrote:
> On 06/10/2023 21:06, David Hildenbrand wrote:
> [...]
>>
>> Change 2: sysfs interface.
>>
>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>> agree.
>>
>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>> bitmaps at all. We can do better if we want to go down that path.
>>
>> Maybe we should take a look at hugetlb, and how they added support for multiple
>> sizes. What *might* make sense could be (depending on which values we actually
>> support!)
>>
>>
>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>
>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>> first? Start with the "enabled" option.
>>
>>
>> enabled: always [global] madvise never
>>
>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>> to "never".
> 
> Hi David,
> 
> I've just started coding this, and it occurs to me that I might need a small
> clarification here; the existing global "enabled" control is used to drive
> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
> proposed new per-size "enabled" is implicitly only controlling anon memory (for
> now).
> 
> 1) Is this potentially confusing for the user? Should we rename the per-size
> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
> we can reuse the same control for file-backed memory in future?
> 
> 2) The global control will continue to drive the file-backed memory decision
> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
> 
> Thanks,
> Ryan
> 

Also, an implementation question:

hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
(although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
regardless. Is that by design? It couldn't fathom any reasoning from the commit log:

bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
			bool smaps, bool in_pf, bool enforce_sysfs)
{
	if (!vma->vm_mm)		/* vdso */
		return false;

	/*
	 * Explicitly disabled through madvise or prctl, or some
	 * architectures may disable THP for some mappings, for
	 * example, s390 kvm.
	 * */
	if ((vm_flags & VM_NOHUGEPAGE) ||
	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
		return false;
	/*
	 * If the hardware/firmware marked hugepage support disabled.
	 */
	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
		return false;

	/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
	if (vma_is_dax(vma))
		return in_pf;  <<<<<<<<

	...
}



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-31 11:55       ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-31 11:55 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 31/10/2023 11:50, Ryan Roberts wrote:
> On 06/10/2023 21:06, David Hildenbrand wrote:
> [...]
>>
>> Change 2: sysfs interface.
>>
>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>> agree.
>>
>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>> bitmaps at all. We can do better if we want to go down that path.
>>
>> Maybe we should take a look at hugetlb, and how they added support for multiple
>> sizes. What *might* make sense could be (depending on which values we actually
>> support!)
>>
>>
>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>
>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>> first? Start with the "enabled" option.
>>
>>
>> enabled: always [global] madvise never
>>
>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>> to "never".
> 
> Hi David,
> 
> I've just started coding this, and it occurs to me that I might need a small
> clarification here; the existing global "enabled" control is used to drive
> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
> proposed new per-size "enabled" is implicitly only controlling anon memory (for
> now).
> 
> 1) Is this potentially confusing for the user? Should we rename the per-size
> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
> we can reuse the same control for file-backed memory in future?
> 
> 2) The global control will continue to drive the file-backed memory decision
> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
> 
> Thanks,
> Ryan
> 

Also, an implementation question:

hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
(although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
regardless. Is that by design? It couldn't fathom any reasoning from the commit log:

bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
			bool smaps, bool in_pf, bool enforce_sysfs)
{
	if (!vma->vm_mm)		/* vdso */
		return false;

	/*
	 * Explicitly disabled through madvise or prctl, or some
	 * architectures may disable THP for some mappings, for
	 * example, s390 kvm.
	 * */
	if ((vm_flags & VM_NOHUGEPAGE) ||
	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
		return false;
	/*
	 * If the hardware/firmware marked hugepage support disabled.
	 */
	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
		return false;

	/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
	if (vma_is_dax(vma))
		return in_pf;  <<<<<<<<

	...
}



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-31 11:50     ` Ryan Roberts
@ 2023-10-31 11:58       ` David Hildenbrand
  -1 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-31 11:58 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 31.10.23 12:50, Ryan Roberts wrote:
> On 06/10/2023 21:06, David Hildenbrand wrote:
> [...]
>>
>> Change 2: sysfs interface.
>>
>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>> agree.
>>
>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>> bitmaps at all. We can do better if we want to go down that path.
>>
>> Maybe we should take a look at hugetlb, and how they added support for multiple
>> sizes. What *might* make sense could be (depending on which values we actually
>> support!)
>>
>>
>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>
>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>> first? Start with the "enabled" option.
>>
>>
>> enabled: always [global] madvise never
>>
>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>> to "never".
> 
> Hi David,

Hi!

> 
> I've just started coding this, and it occurs to me that I might need a small
> clarification here; the existing global "enabled" control is used to drive
> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
> proposed new per-size "enabled" is implicitly only controlling anon memory (for
> now).

Anon was (way) first, and pagecache later decided to reuse that one as 
an indication whether larger folios are desired.

For the pagecache, it's just a way to enable/disable it globally. As 
there is no memory waste, nobody currently really cares about the exact 
sized the pagecache is allocating (maybe that will change at some point, 
maybe not, who knows).

> 
> 1) Is this potentially confusing for the user? Should we rename the per-size
> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
> we can reuse the same control for file-backed memory in future?

The latter would be my take. Just like we did with the global toggle.

> 
> 2) The global control will continue to drive the file-backed memory decision
> (for now), even when hugepages-2048kB/enabled != "global"; agreed?

That would be my take; it will allocate other sizes already, so just 
glue it to the global toggle and document for the other toggles that 
they only control anonymous THP for now.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-31 11:58       ` David Hildenbrand
  0 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-31 11:58 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 31.10.23 12:50, Ryan Roberts wrote:
> On 06/10/2023 21:06, David Hildenbrand wrote:
> [...]
>>
>> Change 2: sysfs interface.
>>
>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>> agree.
>>
>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>> bitmaps at all. We can do better if we want to go down that path.
>>
>> Maybe we should take a look at hugetlb, and how they added support for multiple
>> sizes. What *might* make sense could be (depending on which values we actually
>> support!)
>>
>>
>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>
>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>> first? Start with the "enabled" option.
>>
>>
>> enabled: always [global] madvise never
>>
>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>> to "never".
> 
> Hi David,

Hi!

> 
> I've just started coding this, and it occurs to me that I might need a small
> clarification here; the existing global "enabled" control is used to drive
> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
> proposed new per-size "enabled" is implicitly only controlling anon memory (for
> now).

Anon was (way) first, and pagecache later decided to reuse that one as 
an indication whether larger folios are desired.

For the pagecache, it's just a way to enable/disable it globally. As 
there is no memory waste, nobody currently really cares about the exact 
sized the pagecache is allocating (maybe that will change at some point, 
maybe not, who knows).

> 
> 1) Is this potentially confusing for the user? Should we rename the per-size
> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
> we can reuse the same control for file-backed memory in future?

The latter would be my take. Just like we did with the global toggle.

> 
> 2) The global control will continue to drive the file-backed memory decision
> (for now), even when hugepages-2048kB/enabled != "global"; agreed?

That would be my take; it will allocate other sizes already, so just 
glue it to the global toggle and document for the other toggles that 
they only control anonymous THP for now.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-31 11:55       ` Ryan Roberts
@ 2023-10-31 12:03         ` David Hildenbrand
  -1 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-31 12:03 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 31.10.23 12:55, Ryan Roberts wrote:
> On 31/10/2023 11:50, Ryan Roberts wrote:
>> On 06/10/2023 21:06, David Hildenbrand wrote:
>> [...]
>>>
>>> Change 2: sysfs interface.
>>>
>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>> agree.
>>>
>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>> bitmaps at all. We can do better if we want to go down that path.
>>>
>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>> sizes. What *might* make sense could be (depending on which values we actually
>>> support!)
>>>
>>>
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>
>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>>> first? Start with the "enabled" option.
>>>
>>>
>>> enabled: always [global] madvise never
>>>
>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>>> to "never".
>>
>> Hi David,
>>
>> I've just started coding this, and it occurs to me that I might need a small
>> clarification here; the existing global "enabled" control is used to drive
>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
>> now).
>>
>> 1) Is this potentially confusing for the user? Should we rename the per-size
>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
>> we can reuse the same control for file-backed memory in future?
>>
>> 2) The global control will continue to drive the file-backed memory decision
>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
>>
>> Thanks,
>> Ryan
>>
> 
> Also, an implementation question:
> 
> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
> regardless. Is that by design? It couldn't fathom any reasoning from the commit log:

The whole DAX "hugepage" and THP mixup is just plain confusing. We're 
simply using PUD/PMD mappings of DAX memory, and PMD/PTE- remap when 
required (VMA split I assume, COW).

It doesn't result in any memory waste, so who really cares how it's 
mapped? Apparently we want individual processes to just disable PMD/PUD 
mappings of DAX using the prctl and madvise. Maybe there are good reasons.

Looks like a design decision, probably some legacy leftovers.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-31 12:03         ` David Hildenbrand
  0 siblings, 0 replies; 140+ messages in thread
From: David Hildenbrand @ 2023-10-31 12:03 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 31.10.23 12:55, Ryan Roberts wrote:
> On 31/10/2023 11:50, Ryan Roberts wrote:
>> On 06/10/2023 21:06, David Hildenbrand wrote:
>> [...]
>>>
>>> Change 2: sysfs interface.
>>>
>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>> agree.
>>>
>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>> bitmaps at all. We can do better if we want to go down that path.
>>>
>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>> sizes. What *might* make sense could be (depending on which values we actually
>>> support!)
>>>
>>>
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>
>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>>> first? Start with the "enabled" option.
>>>
>>>
>>> enabled: always [global] madvise never
>>>
>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>>> to "never".
>>
>> Hi David,
>>
>> I've just started coding this, and it occurs to me that I might need a small
>> clarification here; the existing global "enabled" control is used to drive
>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
>> now).
>>
>> 1) Is this potentially confusing for the user? Should we rename the per-size
>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
>> we can reuse the same control for file-backed memory in future?
>>
>> 2) The global control will continue to drive the file-backed memory decision
>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
>>
>> Thanks,
>> Ryan
>>
> 
> Also, an implementation question:
> 
> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
> regardless. Is that by design? It couldn't fathom any reasoning from the commit log:

The whole DAX "hugepage" and THP mixup is just plain confusing. We're 
simply using PUD/PMD mappings of DAX memory, and PMD/PTE- remap when 
required (VMA split I assume, COW).

It doesn't result in any memory waste, so who really cares how it's 
mapped? Apparently we want individual processes to just disable PMD/PUD 
mappings of DAX using the prctl and madvise. Maybe there are good reasons.

Looks like a design decision, probably some legacy leftovers.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-31 11:58       ` David Hildenbrand
@ 2023-10-31 13:12         ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-31 13:12 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 31/10/2023 11:58, David Hildenbrand wrote:
> On 31.10.23 12:50, Ryan Roberts wrote:
>> On 06/10/2023 21:06, David Hildenbrand wrote:
>> [...]
>>>
>>> Change 2: sysfs interface.
>>>
>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>> agree.
>>>
>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>> bitmaps at all. We can do better if we want to go down that path.
>>>
>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>> sizes. What *might* make sense could be (depending on which values we actually
>>> support!)
>>>
>>>
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>
>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>>> first? Start with the "enabled" option.
>>>
>>>
>>> enabled: always [global] madvise never
>>>
>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>>> to "never".
>>
>> Hi David,
> 
> Hi!
> 
>>
>> I've just started coding this, and it occurs to me that I might need a small
>> clarification here; the existing global "enabled" control is used to drive
>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
>> now).
> 
> Anon was (way) first, and pagecache later decided to reuse that one as an
> indication whether larger folios are desired.
> 
> For the pagecache, it's just a way to enable/disable it globally. As there is no
> memory waste, nobody currently really cares about the exact sized the pagecache
> is allocating (maybe that will change at some point, maybe not, who knows).

Yup. Its not _just_ about allocation though; its also about collapse
(MADV_COLLAPSE, khugepaged) which is supported for pagecache pages. I can
imagine value in collapsing to various sizes that are beneficial for HW...
anyway that's for another day.

> 
>>
>> 1) Is this potentially confusing for the user? Should we rename the per-size
>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
>> we can reuse the same control for file-backed memory in future?
> 
> The latter would be my take. Just like we did with the global toggle.

ACK

> 
>>
>> 2) The global control will continue to drive the file-backed memory decision
>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
> 
> That would be my take; it will allocate other sizes already, so just glue it to
> the global toggle and document for the other toggles that they only control
> anonymous THP for now.

ACK

> 


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-31 13:12         ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-31 13:12 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 31/10/2023 11:58, David Hildenbrand wrote:
> On 31.10.23 12:50, Ryan Roberts wrote:
>> On 06/10/2023 21:06, David Hildenbrand wrote:
>> [...]
>>>
>>> Change 2: sysfs interface.
>>>
>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>> agree.
>>>
>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>> bitmaps at all. We can do better if we want to go down that path.
>>>
>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>> sizes. What *might* make sense could be (depending on which values we actually
>>> support!)
>>>
>>>
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>
>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>>> first? Start with the "enabled" option.
>>>
>>>
>>> enabled: always [global] madvise never
>>>
>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>>> to "never".
>>
>> Hi David,
> 
> Hi!
> 
>>
>> I've just started coding this, and it occurs to me that I might need a small
>> clarification here; the existing global "enabled" control is used to drive
>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
>> now).
> 
> Anon was (way) first, and pagecache later decided to reuse that one as an
> indication whether larger folios are desired.
> 
> For the pagecache, it's just a way to enable/disable it globally. As there is no
> memory waste, nobody currently really cares about the exact sized the pagecache
> is allocating (maybe that will change at some point, maybe not, who knows).

Yup. Its not _just_ about allocation though; its also about collapse
(MADV_COLLAPSE, khugepaged) which is supported for pagecache pages. I can
imagine value in collapsing to various sizes that are beneficial for HW...
anyway that's for another day.

> 
>>
>> 1) Is this potentially confusing for the user? Should we rename the per-size
>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
>> we can reuse the same control for file-backed memory in future?
> 
> The latter would be my take. Just like we did with the global toggle.

ACK

> 
>>
>> 2) The global control will continue to drive the file-backed memory decision
>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
> 
> That would be my take; it will allocate other sizes already, so just glue it to
> the global toggle and document for the other toggles that they only control
> anonymous THP for now.

ACK

> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-31 12:03         ` David Hildenbrand
@ 2023-10-31 13:13           ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-31 13:13 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 31/10/2023 12:03, David Hildenbrand wrote:
> On 31.10.23 12:55, Ryan Roberts wrote:
>> On 31/10/2023 11:50, Ryan Roberts wrote:
>>> On 06/10/2023 21:06, David Hildenbrand wrote:
>>> [...]
>>>>
>>>> Change 2: sysfs interface.
>>>>
>>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>>> agree.
>>>>
>>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>>> bitmaps at all. We can do better if we want to go down that path.
>>>>
>>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>>> sizes. What *might* make sense could be (depending on which values we actually
>>>> support!)
>>>>
>>>>
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>>
>>>> Each one would contain an "enabled" and "defrag" file. We want something
>>>> minimal
>>>> first? Start with the "enabled" option.
>>>>
>>>>
>>>> enabled: always [global] madvise never
>>>>
>>>> Initially, we would set it for PMD-sized THP to "global" and for everything
>>>> else
>>>> to "never".
>>>
>>> Hi David,
>>>
>>> I've just started coding this, and it occurs to me that I might need a small
>>> clarification here; the existing global "enabled" control is used to drive
>>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
>>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
>>> now).
>>>
>>> 1) Is this potentially confusing for the user? Should we rename the per-size
>>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
>>> we can reuse the same control for file-backed memory in future?
>>>
>>> 2) The global control will continue to drive the file-backed memory decision
>>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
>>>
>>> Thanks,
>>> Ryan
>>>
>>
>> Also, an implementation question:
>>
>> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
>> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
>> regardless. Is that by design? It couldn't fathom any reasoning from the
>> commit log:
> 
> The whole DAX "hugepage" and THP mixup is just plain confusing. We're simply
> using PUD/PMD mappings of DAX memory, and PMD/PTE- remap when required (VMA
> split I assume, COW).
> 
> It doesn't result in any memory waste, so who really cares how it's mapped?
> Apparently we want individual processes to just disable PMD/PUD mappings of DAX
> using the prctl and madvise. Maybe there are good reasons.
> 
> Looks like a design decision, probably some legacy leftovers.

OK, I'll ensure I keep this behaviour.

Thanks!

> 


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-31 13:13           ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-10-31 13:13 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, John Hubbard, David Rientjes,
	Vlastimil Babka, Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 31/10/2023 12:03, David Hildenbrand wrote:
> On 31.10.23 12:55, Ryan Roberts wrote:
>> On 31/10/2023 11:50, Ryan Roberts wrote:
>>> On 06/10/2023 21:06, David Hildenbrand wrote:
>>> [...]
>>>>
>>>> Change 2: sysfs interface.
>>>>
>>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>>> agree.
>>>>
>>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>>> bitmaps at all. We can do better if we want to go down that path.
>>>>
>>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>>> sizes. What *might* make sense could be (depending on which values we actually
>>>> support!)
>>>>
>>>>
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>>
>>>> Each one would contain an "enabled" and "defrag" file. We want something
>>>> minimal
>>>> first? Start with the "enabled" option.
>>>>
>>>>
>>>> enabled: always [global] madvise never
>>>>
>>>> Initially, we would set it for PMD-sized THP to "global" and for everything
>>>> else
>>>> to "never".
>>>
>>> Hi David,
>>>
>>> I've just started coding this, and it occurs to me that I might need a small
>>> clarification here; the existing global "enabled" control is used to drive
>>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
>>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
>>> now).
>>>
>>> 1) Is this potentially confusing for the user? Should we rename the per-size
>>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
>>> we can reuse the same control for file-backed memory in future?
>>>
>>> 2) The global control will continue to drive the file-backed memory decision
>>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
>>>
>>> Thanks,
>>> Ryan
>>>
>>
>> Also, an implementation question:
>>
>> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
>> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
>> regardless. Is that by design? It couldn't fathom any reasoning from the
>> commit log:
> 
> The whole DAX "hugepage" and THP mixup is just plain confusing. We're simply
> using PUD/PMD mappings of DAX memory, and PMD/PTE- remap when required (VMA
> split I assume, COW).
> 
> It doesn't result in any memory waste, so who really cares how it's mapped?
> Apparently we want individual processes to just disable PMD/PUD mappings of DAX
> using the prctl and madvise. Maybe there are good reasons.
> 
> Looks like a design decision, probably some legacy leftovers.

OK, I'll ensure I keep this behaviour.

Thanks!

> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-31 11:55       ` Ryan Roberts
@ 2023-10-31 18:29         ` Yang Shi
  -1 siblings, 0 replies; 140+ messages in thread
From: Yang Shi @ 2023-10-31 18:29 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On Tue, Oct 31, 2023 at 4:55 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 31/10/2023 11:50, Ryan Roberts wrote:
> > On 06/10/2023 21:06, David Hildenbrand wrote:
> > [...]
> >>
> >> Change 2: sysfs interface.
> >>
> >> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
> >> agree.
> >>
> >> What we expose there and how, is TBD. Again, not a friend of "orders" and
> >> bitmaps at all. We can do better if we want to go down that path.
> >>
> >> Maybe we should take a look at hugetlb, and how they added support for multiple
> >> sizes. What *might* make sense could be (depending on which values we actually
> >> support!)
> >>
> >>
> >> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
> >>
> >> Each one would contain an "enabled" and "defrag" file. We want something minimal
> >> first? Start with the "enabled" option.
> >>
> >>
> >> enabled: always [global] madvise never
> >>
> >> Initially, we would set it for PMD-sized THP to "global" and for everything else
> >> to "never".
> >
> > Hi David,
> >
> > I've just started coding this, and it occurs to me that I might need a small
> > clarification here; the existing global "enabled" control is used to drive
> > decisions for both anonymous memory and (non-shmem) file-backed memory. But the
> > proposed new per-size "enabled" is implicitly only controlling anon memory (for
> > now).
> >
> > 1) Is this potentially confusing for the user? Should we rename the per-size
> > controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
> > we can reuse the same control for file-backed memory in future?
> >
> > 2) The global control will continue to drive the file-backed memory decision
> > (for now), even when hugepages-2048kB/enabled != "global"; agreed?
> >
> > Thanks,
> > Ryan
> >
>
> Also, an implementation question:
>
> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
> regardless. Is that by design? It couldn't fathom any reasoning from the commit log:

The enabled="never" is for anonymous VMAs, DAX VMAs are typically file VMAs.

>
> bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
>                         bool smaps, bool in_pf, bool enforce_sysfs)
> {
>         if (!vma->vm_mm)                /* vdso */
>                 return false;
>
>         /*
>          * Explicitly disabled through madvise or prctl, or some
>          * architectures may disable THP for some mappings, for
>          * example, s390 kvm.
>          * */
>         if ((vm_flags & VM_NOHUGEPAGE) ||
>             test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>                 return false;
>         /*
>          * If the hardware/firmware marked hugepage support disabled.
>          */
>         if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>                 return false;
>
>         /* khugepaged doesn't collapse DAX vma, but page fault is fine. */
>         if (vma_is_dax(vma))
>                 return in_pf;  <<<<<<<<
>
>         ...
> }
>
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-10-31 18:29         ` Yang Shi
  0 siblings, 0 replies; 140+ messages in thread
From: Yang Shi @ 2023-10-31 18:29 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On Tue, Oct 31, 2023 at 4:55 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 31/10/2023 11:50, Ryan Roberts wrote:
> > On 06/10/2023 21:06, David Hildenbrand wrote:
> > [...]
> >>
> >> Change 2: sysfs interface.
> >>
> >> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
> >> agree.
> >>
> >> What we expose there and how, is TBD. Again, not a friend of "orders" and
> >> bitmaps at all. We can do better if we want to go down that path.
> >>
> >> Maybe we should take a look at hugetlb, and how they added support for multiple
> >> sizes. What *might* make sense could be (depending on which values we actually
> >> support!)
> >>
> >>
> >> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
> >> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
> >>
> >> Each one would contain an "enabled" and "defrag" file. We want something minimal
> >> first? Start with the "enabled" option.
> >>
> >>
> >> enabled: always [global] madvise never
> >>
> >> Initially, we would set it for PMD-sized THP to "global" and for everything else
> >> to "never".
> >
> > Hi David,
> >
> > I've just started coding this, and it occurs to me that I might need a small
> > clarification here; the existing global "enabled" control is used to drive
> > decisions for both anonymous memory and (non-shmem) file-backed memory. But the
> > proposed new per-size "enabled" is implicitly only controlling anon memory (for
> > now).
> >
> > 1) Is this potentially confusing for the user? Should we rename the per-size
> > controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
> > we can reuse the same control for file-backed memory in future?
> >
> > 2) The global control will continue to drive the file-backed memory decision
> > (for now), even when hugepages-2048kB/enabled != "global"; agreed?
> >
> > Thanks,
> > Ryan
> >
>
> Also, an implementation question:
>
> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
> regardless. Is that by design? It couldn't fathom any reasoning from the commit log:

The enabled="never" is for anonymous VMAs, DAX VMAs are typically file VMAs.

>
> bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
>                         bool smaps, bool in_pf, bool enforce_sysfs)
> {
>         if (!vma->vm_mm)                /* vdso */
>                 return false;
>
>         /*
>          * Explicitly disabled through madvise or prctl, or some
>          * architectures may disable THP for some mappings, for
>          * example, s390 kvm.
>          * */
>         if ((vm_flags & VM_NOHUGEPAGE) ||
>             test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>                 return false;
>         /*
>          * If the hardware/firmware marked hugepage support disabled.
>          */
>         if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>                 return false;
>
>         /* khugepaged doesn't collapse DAX vma, but page fault is fine. */
>         if (vma_is_dax(vma))
>                 return in_pf;  <<<<<<<<
>
>         ...
> }
>
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
  2023-10-30 23:25         ` John Hubbard
@ 2023-11-01 13:56           ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-11-01 13:56 UTC (permalink / raw)
  To: John Hubbard, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 30/10/2023 23:25, John Hubbard wrote:
> On 10/30/23 04:43, Ryan Roberts wrote:
>> On 28/10/2023 00:04, John Hubbard wrote:
>>> On 9/29/23 04:44, Ryan Roberts wrote:
> ...
>>>>    +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    if (nr_pages == 1)
>>>> +        return vmf_pte_changed(vmf);
>>>> +
>>>> +    for (i = 0; i < nr_pages; i++) {
>>>> +        if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>>> +            return true;
>>>
>>> This seems like something different than the function name implies.
>>> It's really confusing: for a single page case, return true if the
>>> pte in the page tables has changed, yes that is very clear.
>>>
>>> But then for multiple page cases, which is really the main
>>> focus here--for that, claim that the range has changed if any
>>> pte is present (!pte_none). Can you please help me understand
>>> what this means?
>>
>> Yes I understand your confusion. Although I'm confident that the code is
>> correct, its a bad name - I'll make the excuse that this has evolved through
>> rebasing to cope with additions to UFFD. Perhaps something like
>> vmf_is_large_folio_suitable() is a better name.
>>
>> It used to be that we would only take the do_anonymous_page() path if the pte
>> was none; i.e. this is the first time we are faulting on an address covered by
>> an anon VMA and we need to allocate some memory. But more recently we also end
>> up here if the pte is a uffd_wp marker. So for a single pte, instead of checking
>> none, we can check if the pte has changed from our original check (where we
>> determined it was a uffd_wp marker or none). But for multiple ptes, we don't
>> have storage to store all the original ptes from the first check.
>>
>> Fortunately, if uffd is in use for a vma, then we don't want to use a large
>> folio anyway (this would break uffd semantics because we would no longer get a
>> fault for every page). So we only care about the "same but not none" case for
>> nr_pages=1.
>>
>> Would changing the name to vmf_is_large_folio_suitable() help here?
> 
> Yes it would! And adding in a sentence or two from above about the uffd, as
> a function-level comment might be just the right of demystification for
> the code.

Actually I don't think the name I proposed it quite right either - this gets
called for small folios too.

I think its cleaner to change the name to vmf_pte_range_none() and strip out the
nr_pages==1 case. The checking-for-none part is required by alloc_anon_folio()
and needs to be safe without holding the PTL. vmf_pte_changed() is not safe in
without the lock. So I've just hoisted the nr_pages==1 case directly into
do_anonymous_page(). Shout if you think we can do better:


diff --git a/mm/memory.c b/mm/memory.c
index 569c828b1cdc..b48e4de1bf20 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4117,19 +4117,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
        return ret;
 }

-static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+static bool pte_range_none(pte_t *pte, int nr_pages)
 {
        int i;

-       if (nr_pages == 1)
-               return vmf_pte_changed(vmf);
-
        for (i = 0; i < nr_pages; i++) {
-               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
-                       return true;
+               if (!pte_none(ptep_get_lockless(pte + i)))
+                       return false;
        }

-       return false;
+       return true;
 }

 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -4170,7 +4167,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
        while (orders) {
                addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
                vmf->pte = pte + pte_index(addr);
-               if (!vmf_pte_range_changed(vmf, 1 << order))
+               if (pte_range_none(vmf->pte, 1 << order))
                        break;
                order = next_order(&orders, order);
        }
@@ -4280,7 +4277,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
        if (!vmf->pte)
                goto release;
-       if (vmf_pte_range_changed(vmf, nr_pages)) {
+       if ((nr_pages == 1 && vmf_pte_changed(vmf)) ||
+           (nr_pages  > 1 && !pte_range_none(vmf->pte, nr_pages))) {
                for (i = 0; i < nr_pages; i++)
                        update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
                goto release;


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
@ 2023-11-01 13:56           ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-11-01 13:56 UTC (permalink / raw)
  To: John Hubbard, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 30/10/2023 23:25, John Hubbard wrote:
> On 10/30/23 04:43, Ryan Roberts wrote:
>> On 28/10/2023 00:04, John Hubbard wrote:
>>> On 9/29/23 04:44, Ryan Roberts wrote:
> ...
>>>>    +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    if (nr_pages == 1)
>>>> +        return vmf_pte_changed(vmf);
>>>> +
>>>> +    for (i = 0; i < nr_pages; i++) {
>>>> +        if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>>> +            return true;
>>>
>>> This seems like something different than the function name implies.
>>> It's really confusing: for a single page case, return true if the
>>> pte in the page tables has changed, yes that is very clear.
>>>
>>> But then for multiple page cases, which is really the main
>>> focus here--for that, claim that the range has changed if any
>>> pte is present (!pte_none). Can you please help me understand
>>> what this means?
>>
>> Yes I understand your confusion. Although I'm confident that the code is
>> correct, its a bad name - I'll make the excuse that this has evolved through
>> rebasing to cope with additions to UFFD. Perhaps something like
>> vmf_is_large_folio_suitable() is a better name.
>>
>> It used to be that we would only take the do_anonymous_page() path if the pte
>> was none; i.e. this is the first time we are faulting on an address covered by
>> an anon VMA and we need to allocate some memory. But more recently we also end
>> up here if the pte is a uffd_wp marker. So for a single pte, instead of checking
>> none, we can check if the pte has changed from our original check (where we
>> determined it was a uffd_wp marker or none). But for multiple ptes, we don't
>> have storage to store all the original ptes from the first check.
>>
>> Fortunately, if uffd is in use for a vma, then we don't want to use a large
>> folio anyway (this would break uffd semantics because we would no longer get a
>> fault for every page). So we only care about the "same but not none" case for
>> nr_pages=1.
>>
>> Would changing the name to vmf_is_large_folio_suitable() help here?
> 
> Yes it would! And adding in a sentence or two from above about the uffd, as
> a function-level comment might be just the right of demystification for
> the code.

Actually I don't think the name I proposed it quite right either - this gets
called for small folios too.

I think its cleaner to change the name to vmf_pte_range_none() and strip out the
nr_pages==1 case. The checking-for-none part is required by alloc_anon_folio()
and needs to be safe without holding the PTL. vmf_pte_changed() is not safe in
without the lock. So I've just hoisted the nr_pages==1 case directly into
do_anonymous_page(). Shout if you think we can do better:


diff --git a/mm/memory.c b/mm/memory.c
index 569c828b1cdc..b48e4de1bf20 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4117,19 +4117,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
        return ret;
 }

-static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+static bool pte_range_none(pte_t *pte, int nr_pages)
 {
        int i;

-       if (nr_pages == 1)
-               return vmf_pte_changed(vmf);
-
        for (i = 0; i < nr_pages; i++) {
-               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
-                       return true;
+               if (!pte_none(ptep_get_lockless(pte + i)))
+                       return false;
        }

-       return false;
+       return true;
 }

 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -4170,7 +4167,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
        while (orders) {
                addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
                vmf->pte = pte + pte_index(addr);
-               if (!vmf_pte_range_changed(vmf, 1 << order))
+               if (pte_range_none(vmf->pte, 1 << order))
                        break;
                order = next_order(&orders, order);
        }
@@ -4280,7 +4277,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
        if (!vmf->pte)
                goto release;
-       if (vmf_pte_range_changed(vmf, nr_pages)) {
+       if ((nr_pages == 1 && vmf_pte_changed(vmf)) ||
+           (nr_pages  > 1 && !pte_range_none(vmf->pte, nr_pages))) {
                for (i = 0; i < nr_pages; i++)
                        update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
                goto release;


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-10-31 18:29         ` Yang Shi
@ 2023-11-01 14:02           ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-11-01 14:02 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On 31/10/2023 18:29, Yang Shi wrote:
> On Tue, Oct 31, 2023 at 4:55 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 31/10/2023 11:50, Ryan Roberts wrote:
>>> On 06/10/2023 21:06, David Hildenbrand wrote:
>>> [...]
>>>>
>>>> Change 2: sysfs interface.
>>>>
>>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>>> agree.
>>>>
>>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>>> bitmaps at all. We can do better if we want to go down that path.
>>>>
>>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>>> sizes. What *might* make sense could be (depending on which values we actually
>>>> support!)
>>>>
>>>>
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>>
>>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>>>> first? Start with the "enabled" option.
>>>>
>>>>
>>>> enabled: always [global] madvise never
>>>>
>>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>>>> to "never".
>>>
>>> Hi David,
>>>
>>> I've just started coding this, and it occurs to me that I might need a small
>>> clarification here; the existing global "enabled" control is used to drive
>>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
>>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
>>> now).
>>>
>>> 1) Is this potentially confusing for the user? Should we rename the per-size
>>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
>>> we can reuse the same control for file-backed memory in future?
>>>
>>> 2) The global control will continue to drive the file-backed memory decision
>>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
>>>
>>> Thanks,
>>> Ryan
>>>
>>
>> Also, an implementation question:
>>
>> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
>> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
>> regardless. Is that by design? It couldn't fathom any reasoning from the commit log:
> 
> The enabled="never" is for anonymous VMAs, DAX VMAs are typically file VMAs.

That's not quite true; enabled="never" is honoured for non-DAX/non-shmem file
VMAs (for collapse via CONFIG_READ_ONLY_THP_FOR_FS and more recently for
anything that implements huge_fault() - see
7a81751fcdeb833acc858e59082688e3020bfe12).


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-11-01 14:02           ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-11-01 14:02 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On 31/10/2023 18:29, Yang Shi wrote:
> On Tue, Oct 31, 2023 at 4:55 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 31/10/2023 11:50, Ryan Roberts wrote:
>>> On 06/10/2023 21:06, David Hildenbrand wrote:
>>> [...]
>>>>
>>>> Change 2: sysfs interface.
>>>>
>>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
>>>> agree.
>>>>
>>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
>>>> bitmaps at all. We can do better if we want to go down that path.
>>>>
>>>> Maybe we should take a look at hugetlb, and how they added support for multiple
>>>> sizes. What *might* make sense could be (depending on which values we actually
>>>> support!)
>>>>
>>>>
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
>>>>
>>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
>>>> first? Start with the "enabled" option.
>>>>
>>>>
>>>> enabled: always [global] madvise never
>>>>
>>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
>>>> to "never".
>>>
>>> Hi David,
>>>
>>> I've just started coding this, and it occurs to me that I might need a small
>>> clarification here; the existing global "enabled" control is used to drive
>>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
>>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
>>> now).
>>>
>>> 1) Is this potentially confusing for the user? Should we rename the per-size
>>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
>>> we can reuse the same control for file-backed memory in future?
>>>
>>> 2) The global control will continue to drive the file-backed memory decision
>>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
>>>
>>> Thanks,
>>> Ryan
>>>
>>
>> Also, an implementation question:
>>
>> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
>> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
>> regardless. Is that by design? It couldn't fathom any reasoning from the commit log:
> 
> The enabled="never" is for anonymous VMAs, DAX VMAs are typically file VMAs.

That's not quite true; enabled="never" is honoured for non-DAX/non-shmem file
VMAs (for collapse via CONFIG_READ_ONLY_THP_FOR_FS and more recently for
anything that implements huge_fault() - see
7a81751fcdeb833acc858e59082688e3020bfe12).


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-11-01 14:02           ` Ryan Roberts
@ 2023-11-01 18:11             ` Yang Shi
  -1 siblings, 0 replies; 140+ messages in thread
From: Yang Shi @ 2023-11-01 18:11 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On Wed, Nov 1, 2023 at 7:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 31/10/2023 18:29, Yang Shi wrote:
> > On Tue, Oct 31, 2023 at 4:55 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 31/10/2023 11:50, Ryan Roberts wrote:
> >>> On 06/10/2023 21:06, David Hildenbrand wrote:
> >>> [...]
> >>>>
> >>>> Change 2: sysfs interface.
> >>>>
> >>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
> >>>> agree.
> >>>>
> >>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
> >>>> bitmaps at all. We can do better if we want to go down that path.
> >>>>
> >>>> Maybe we should take a look at hugetlb, and how they added support for multiple
> >>>> sizes. What *might* make sense could be (depending on which values we actually
> >>>> support!)
> >>>>
> >>>>
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
> >>>>
> >>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
> >>>> first? Start with the "enabled" option.
> >>>>
> >>>>
> >>>> enabled: always [global] madvise never
> >>>>
> >>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
> >>>> to "never".
> >>>
> >>> Hi David,
> >>>
> >>> I've just started coding this, and it occurs to me that I might need a small
> >>> clarification here; the existing global "enabled" control is used to drive
> >>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
> >>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
> >>> now).
> >>>
> >>> 1) Is this potentially confusing for the user? Should we rename the per-size
> >>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
> >>> we can reuse the same control for file-backed memory in future?
> >>>
> >>> 2) The global control will continue to drive the file-backed memory decision
> >>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
> >>>
> >>> Thanks,
> >>> Ryan
> >>>
> >>
> >> Also, an implementation question:
> >>
> >> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
> >> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
> >> regardless. Is that by design? It couldn't fathom any reasoning from the commit log:
> >
> > The enabled="never" is for anonymous VMAs, DAX VMAs are typically file VMAs.
>
> That's not quite true; enabled="never" is honoured for non-DAX/non-shmem file
> VMAs (for collapse via CONFIG_READ_ONLY_THP_FOR_FS and more recently for

When implementing READ_ONLY_THP_FOR_FS the file THP just can be
collapsed by khugepaged, but khugepaged is started iff enabled !=
"never". So READ_ONLY_THP_FOR_FS has to honor it. Unfortunately there
are some confusing exceptions... But anyway DAX is not the same class.

> anything that implements huge_fault() - see
> 7a81751fcdeb833acc858e59082688e3020bfe12).

IIUC this commit just gives the vmas which implement huge_fault() a
chance to handle the fault. Currently just DAX vmas implement
huge_fault() in vanilla kernel AFAICT.

>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-11-01 18:11             ` Yang Shi
  0 siblings, 0 replies; 140+ messages in thread
From: Yang Shi @ 2023-11-01 18:11 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Huang, Ying, Zi Yan,
	Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	John Hubbard, David Rientjes, Vlastimil Babka, Hugh Dickins,
	linux-mm, linux-kernel, linux-arm-kernel

On Wed, Nov 1, 2023 at 7:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 31/10/2023 18:29, Yang Shi wrote:
> > On Tue, Oct 31, 2023 at 4:55 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 31/10/2023 11:50, Ryan Roberts wrote:
> >>> On 06/10/2023 21:06, David Hildenbrand wrote:
> >>> [...]
> >>>>
> >>>> Change 2: sysfs interface.
> >>>>
> >>>> If we call it THP, it shall go under "/sys/kernel/mm/transparent_hugepage/", I
> >>>> agree.
> >>>>
> >>>> What we expose there and how, is TBD. Again, not a friend of "orders" and
> >>>> bitmaps at all. We can do better if we want to go down that path.
> >>>>
> >>>> Maybe we should take a look at hugetlb, and how they added support for multiple
> >>>> sizes. What *might* make sense could be (depending on which values we actually
> >>>> support!)
> >>>>
> >>>>
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-128kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
> >>>> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/
> >>>>
> >>>> Each one would contain an "enabled" and "defrag" file. We want something minimal
> >>>> first? Start with the "enabled" option.
> >>>>
> >>>>
> >>>> enabled: always [global] madvise never
> >>>>
> >>>> Initially, we would set it for PMD-sized THP to "global" and for everything else
> >>>> to "never".
> >>>
> >>> Hi David,
> >>>
> >>> I've just started coding this, and it occurs to me that I might need a small
> >>> clarification here; the existing global "enabled" control is used to drive
> >>> decisions for both anonymous memory and (non-shmem) file-backed memory. But the
> >>> proposed new per-size "enabled" is implicitly only controlling anon memory (for
> >>> now).
> >>>
> >>> 1) Is this potentially confusing for the user? Should we rename the per-size
> >>> controls to "anon_enabled"? Or is it preferable to jsut keep it vague for now so
> >>> we can reuse the same control for file-backed memory in future?
> >>>
> >>> 2) The global control will continue to drive the file-backed memory decision
> >>> (for now), even when hugepages-2048kB/enabled != "global"; agreed?
> >>>
> >>> Thanks,
> >>> Ryan
> >>>
> >>
> >> Also, an implementation question:
> >>
> >> hugepage_vma_check() doesn't currently care whether enabled="never" for DAX VMAs
> >> (although it does honour MADV_NOHUGEPAGE and the prctl); It will return true
> >> regardless. Is that by design? It couldn't fathom any reasoning from the commit log:
> >
> > The enabled="never" is for anonymous VMAs, DAX VMAs are typically file VMAs.
>
> That's not quite true; enabled="never" is honoured for non-DAX/non-shmem file
> VMAs (for collapse via CONFIG_READ_ONLY_THP_FOR_FS and more recently for

When implementing READ_ONLY_THP_FOR_FS the file THP just can be
collapsed by khugepaged, but khugepaged is started iff enabled !=
"never". So READ_ONLY_THP_FOR_FS has to honor it. Unfortunately there
are some confusing exceptions... But anyway DAX is not the same class.

> anything that implements huge_fault() - see
> 7a81751fcdeb833acc858e59082688e3020bfe12).

IIUC this commit just gives the vmas which implement huge_fault() a
chance to handle the fault. Currently just DAX vmas implement
huge_fault() in vanilla kernel AFAICT.

>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-09-29 11:44 ` Ryan Roberts
@ 2023-11-13  3:57   ` John Hubbard
  -1 siblings, 0 replies; 140+ messages in thread
From: John Hubbard @ 2023-11-13  3:57 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 9/29/23 4:44 AM, Ryan Roberts wrote:
> Hi All,
> 
> This is v6 of a series to implement variable order, large folios for anonymous
> memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO",
> "FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The
> objective of this is to improve performance by allocating larger chunks of
> memory during anonymous page faults:
...
> 
> The major change in this revision is the addition of sysfs controls to allow
> this "small-order THP" to be enabled/disabled/configured independently of
> PMD-order THP. The approach I've taken differs a bit from previous discussions;
> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
> personally think this makes things clearer and more extensible. See [6] for
> detailed rationale.
> 

Hi Ryan and all,

I've done some initial performance testing of this patchset on an arm64
SBSA server. When these patches are combined with the arm64 arch contpte
patches in Ryan's git tree (he has conveniently combined everything
here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
some memory-intensive workloads. Many test runs, conducted independently
by different engineers and on different machines, have convinced me and
my colleagues that this is an accurate result.

In order to achieve that result, we used the git tree in [1] with
following settings:

     echo always >/sys/kernel/mm/transparent_hugepage/enabled
     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders

This was on a aarch64 machine configure to use a 64KB base page size.
That configuration means that the PMD size is 512MB, which is of course
too large for practical use as a pure PMD-THP. However, with with these
small-size (less than PMD-sized) THPs, we get the improvements in TLB
coverage, while still getting pages that are small enough to be
effectively usable.

These results are admittedly limited to aarch64 CPUs so far (because the
contpte TLB coalescing behavior plays a big role), but it's nice to see
real performance numbers from real computers.

Up until now, there has been some healthy discussion and debate about
various aspects of this patchset. This data point shows that at least
for some types of memory-intensive workloads (and I apologize for being
vague, at this point, about exactly *which* workloads), the performance
gains are really worth it: ~10x !

[1] https://gitlab.arm.com/linux-arm/linux-rr.git
         (branch: features/granule_perf/anonfolio-v6-contpte-v2)

thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-11-13  3:57   ` John Hubbard
  0 siblings, 0 replies; 140+ messages in thread
From: John Hubbard @ 2023-11-13  3:57 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Yin Fengwei,
	David Hildenbrand, Yu Zhao, Catalin Marinas, Anshuman Khandual,
	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 9/29/23 4:44 AM, Ryan Roberts wrote:
> Hi All,
> 
> This is v6 of a series to implement variable order, large folios for anonymous
> memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO",
> "FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The
> objective of this is to improve performance by allocating larger chunks of
> memory during anonymous page faults:
...
> 
> The major change in this revision is the addition of sysfs controls to allow
> this "small-order THP" to be enabled/disabled/configured independently of
> PMD-order THP. The approach I've taken differs a bit from previous discussions;
> instead of creating a whole new interface ("large_folio"), I'm extending THP. I
> personally think this makes things clearer and more extensible. See [6] for
> detailed rationale.
> 

Hi Ryan and all,

I've done some initial performance testing of this patchset on an arm64
SBSA server. When these patches are combined with the arm64 arch contpte
patches in Ryan's git tree (he has conveniently combined everything
here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
some memory-intensive workloads. Many test runs, conducted independently
by different engineers and on different machines, have convinced me and
my colleagues that this is an accurate result.

In order to achieve that result, we used the git tree in [1] with
following settings:

     echo always >/sys/kernel/mm/transparent_hugepage/enabled
     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders

This was on a aarch64 machine configure to use a 64KB base page size.
That configuration means that the PMD size is 512MB, which is of course
too large for practical use as a pure PMD-THP. However, with with these
small-size (less than PMD-sized) THPs, we get the improvements in TLB
coverage, while still getting pages that are small enough to be
effectively usable.

These results are admittedly limited to aarch64 CPUs so far (because the
contpte TLB coalescing behavior plays a big role), but it's nice to see
real performance numbers from real computers.

Up until now, there has been some healthy discussion and debate about
various aspects of this patchset. This data point shows that at least
for some types of memory-intensive workloads (and I apologize for being
vague, at this point, about exactly *which* workloads), the performance
gains are really worth it: ~10x !

[1] https://gitlab.arm.com/linux-arm/linux-rr.git
         (branch: features/granule_perf/anonfolio-v6-contpte-v2)

thanks,

-- 
John Hubbard
NVIDIA

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-11-13  3:57   ` John Hubbard
@ 2023-11-13  5:18     ` Matthew Wilcox
  -1 siblings, 0 replies; 140+ messages in thread
From: Matthew Wilcox @ 2023-11-13  5:18 UTC (permalink / raw)
  To: John Hubbard
  Cc: Ryan Roberts, Andrew Morton, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins, linux-mm, linux-kernel, linux-arm-kernel

On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
> I've done some initial performance testing of this patchset on an arm64
> SBSA server. When these patches are combined with the arm64 arch contpte
> patches in Ryan's git tree (he has conveniently combined everything
> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
> some memory-intensive workloads. Many test runs, conducted independently
> by different engineers and on different machines, have convinced me and
> my colleagues that this is an accurate result.
> 
> In order to achieve that result, we used the git tree in [1] with
> following settings:
> 
>     echo always >/sys/kernel/mm/transparent_hugepage/enabled
>     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
> 
> This was on a aarch64 machine configure to use a 64KB base page size.
> That configuration means that the PMD size is 512MB, which is of course
> too large for practical use as a pure PMD-THP. However, with with these
> small-size (less than PMD-sized) THPs, we get the improvements in TLB
> coverage, while still getting pages that are small enough to be
> effectively usable.

That is quite remarkable!

My hope is to abolish the 64kB page size configuration.  ie instead of
using the mixture of page sizes that you currently are -- 64k and
1M (right?  Order-0, and order-4), that 4k, 64k and 2MB (order-0,
order-4 and order-9) will provide better performance.

Have you run any experiements with a 4kB page size?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-11-13  5:18     ` Matthew Wilcox
  0 siblings, 0 replies; 140+ messages in thread
From: Matthew Wilcox @ 2023-11-13  5:18 UTC (permalink / raw)
  To: John Hubbard
  Cc: Ryan Roberts, Andrew Morton, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins, linux-mm, linux-kernel, linux-arm-kernel

On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
> I've done some initial performance testing of this patchset on an arm64
> SBSA server. When these patches are combined with the arm64 arch contpte
> patches in Ryan's git tree (he has conveniently combined everything
> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
> some memory-intensive workloads. Many test runs, conducted independently
> by different engineers and on different machines, have convinced me and
> my colleagues that this is an accurate result.
> 
> In order to achieve that result, we used the git tree in [1] with
> following settings:
> 
>     echo always >/sys/kernel/mm/transparent_hugepage/enabled
>     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
> 
> This was on a aarch64 machine configure to use a 64KB base page size.
> That configuration means that the PMD size is 512MB, which is of course
> too large for practical use as a pure PMD-THP. However, with with these
> small-size (less than PMD-sized) THPs, we get the improvements in TLB
> coverage, while still getting pages that are small enough to be
> effectively usable.

That is quite remarkable!

My hope is to abolish the 64kB page size configuration.  ie instead of
using the mixture of page sizes that you currently are -- 64k and
1M (right?  Order-0, and order-4), that 4k, 64k and 2MB (order-0,
order-4 and order-9) will provide better performance.

Have you run any experiements with a 4kB page size?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-11-13  5:18     ` Matthew Wilcox
@ 2023-11-13 10:19       ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-11-13 10:19 UTC (permalink / raw)
  To: Matthew Wilcox, John Hubbard
  Cc: Andrew Morton, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	David Rientjes, Vlastimil Babka, Hugh Dickins, linux-mm,
	linux-kernel, linux-arm-kernel

On 13/11/2023 05:18, Matthew Wilcox wrote:
> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>> I've done some initial performance testing of this patchset on an arm64
>> SBSA server. When these patches are combined with the arm64 arch contpte
>> patches in Ryan's git tree (he has conveniently combined everything
>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>> some memory-intensive workloads. Many test runs, conducted independently
>> by different engineers and on different machines, have convinced me and
>> my colleagues that this is an accurate result.
>>
>> In order to achieve that result, we used the git tree in [1] with
>> following settings:
>>
>>     echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>
>> This was on a aarch64 machine configure to use a 64KB base page size.
>> That configuration means that the PMD size is 512MB, which is of course
>> too large for practical use as a pure PMD-THP. However, with with these
>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>> coverage, while still getting pages that are small enough to be
>> effectively usable.
> 
> That is quite remarkable!

Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!

> 
> My hope is to abolish the 64kB page size configuration.  ie instead of
> using the mixture of page sizes that you currently are -- 64k and
> 1M (right?  Order-0, and order-4)

Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
intuitively you would expect the order to remain constant, but it doesn't).

The "recommend" setting above will actually enable order-3 as well even though
there is no HW benefit to this. So the full set of available memory sizes here is:

64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13

> , that 4k, 64k and 2MB (order-0,
> order-4 and order-9) will provide better performance.
> 
> Have you run any experiements with a 4kB page size?

Agree that would be interesting with 64K small-sized THP enabled. And I'd love
to get to a world were we universally deal in variable sized chunks of memory,
aligned on 4K boundaries.

In my experience though, there are still some performance benefits to 64K base
page vs 4K+contpte; the page tables are more cache efficient for the former case
- 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
latter. In practice the HW will still only read 8 bytes in the latter but that's
taking up a full cache line vs the former where a single cache line stores 8x
64K entries.

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-11-13 10:19       ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-11-13 10:19 UTC (permalink / raw)
  To: Matthew Wilcox, John Hubbard
  Cc: Andrew Morton, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	David Rientjes, Vlastimil Babka, Hugh Dickins, linux-mm,
	linux-kernel, linux-arm-kernel

On 13/11/2023 05:18, Matthew Wilcox wrote:
> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>> I've done some initial performance testing of this patchset on an arm64
>> SBSA server. When these patches are combined with the arm64 arch contpte
>> patches in Ryan's git tree (he has conveniently combined everything
>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>> some memory-intensive workloads. Many test runs, conducted independently
>> by different engineers and on different machines, have convinced me and
>> my colleagues that this is an accurate result.
>>
>> In order to achieve that result, we used the git tree in [1] with
>> following settings:
>>
>>     echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>     echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>
>> This was on a aarch64 machine configure to use a 64KB base page size.
>> That configuration means that the PMD size is 512MB, which is of course
>> too large for practical use as a pure PMD-THP. However, with with these
>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>> coverage, while still getting pages that are small enough to be
>> effectively usable.
> 
> That is quite remarkable!

Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!

> 
> My hope is to abolish the 64kB page size configuration.  ie instead of
> using the mixture of page sizes that you currently are -- 64k and
> 1M (right?  Order-0, and order-4)

Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
intuitively you would expect the order to remain constant, but it doesn't).

The "recommend" setting above will actually enable order-3 as well even though
there is no HW benefit to this. So the full set of available memory sizes here is:

64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13

> , that 4k, 64k and 2MB (order-0,
> order-4 and order-9) will provide better performance.
> 
> Have you run any experiements with a 4kB page size?

Agree that would be interesting with 64K small-sized THP enabled. And I'd love
to get to a world were we universally deal in variable sized chunks of memory,
aligned on 4K boundaries.

In my experience though, there are still some performance benefits to 64K base
page vs 4K+contpte; the page tables are more cache efficient for the former case
- 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
latter. In practice the HW will still only read 8 bytes in the latter but that's
taking up a full cache line vs the former where a single cache line stores 8x
64K entries.

Thanks,
Ryan


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-11-13 10:19       ` Ryan Roberts
@ 2023-11-13 11:52         ` Kefeng Wang
  -1 siblings, 0 replies; 140+ messages in thread
From: Kefeng Wang @ 2023-11-13 11:52 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox, John Hubbard
  Cc: Andrew Morton, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	David Rientjes, Vlastimil Babka, Hugh Dickins, linux-mm,
	linux-kernel, linux-arm-kernel



On 2023/11/13 18:19, Ryan Roberts wrote:
> On 13/11/2023 05:18, Matthew Wilcox wrote:
>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>>> I've done some initial performance testing of this patchset on an arm64
>>> SBSA server. When these patches are combined with the arm64 arch contpte
>>> patches in Ryan's git tree (he has conveniently combined everything
>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>>> some memory-intensive workloads. Many test runs, conducted independently
>>> by different engineers and on different machines, have convinced me and
>>> my colleagues that this is an accurate result.
>>>
>>> In order to achieve that result, we used the git tree in [1] with
>>> following settings:
>>>
>>>      echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>>      echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>
>>> This was on a aarch64 machine configure to use a 64KB base page size.
>>> That configuration means that the PMD size is 512MB, which is of course
>>> too large for practical use as a pure PMD-THP. However, with with these
>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>>> coverage, while still getting pages that are small enough to be
>>> effectively usable.
>>
>> That is quite remarkable!
> 
> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!
> 
>>
>> My hope is to abolish the 64kB page size configuration.  ie instead of
>> using the mixture of page sizes that you currently are -- 64k and
>> 1M (right?  Order-0, and order-4)
> 
> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
> intuitively you would expect the order to remain constant, but it doesn't).
> 
> The "recommend" setting above will actually enable order-3 as well even though
> there is no HW benefit to this. So the full set of available memory sizes here is:
> 
> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
> 
>> , that 4k, 64k and 2MB (order-0,
>> order-4 and order-9) will provide better performance.
>>
>> Have you run any experiements with a 4kB page size?
> 
> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
> to get to a world were we universally deal in variable sized chunks of memory,
> aligned on 4K boundaries.
> 
> In my experience though, there are still some performance benefits to 64K base
> page vs 4K+contpte; the page tables are more cache efficient for the former case
> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
> latter. In practice the HW will still only read 8 bytes in the latter but that's
> taking up a full cache line vs the former where a single cache line stores 8x
> 64K entries.

We test some benchmark, eg, unixbench, lmbench, sysbench, with v5 on
arm64 board(for better evaluation of anon large folio, using ext4,
which don't support large folio for now), will test again and send
the results once v7 out.

1) base page 4k  + without anon large folio
2) base page 64k + without anon large folio
3) base page 4k  + with anon large folio + cont-pte(order = 4,0)

Most of the test results from v5 show the 3) have a good improvement
vs 1), but still low than 2) , also for some latency-sensitive
benchmark, 2) and 3) maybe have poor performance vs 1).

Note, for pcp_allowed_order, order <= PAGE_ALLOC_COSTLY_ORDER=3, for
3), we maybe enlarge it for better scalability when page allocation
on arm64, not test on v5, will try to enlarge it on v7.

> 
> Thanks,
> Ryan
> 
> 
> 

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-11-13 11:52         ` Kefeng Wang
  0 siblings, 0 replies; 140+ messages in thread
From: Kefeng Wang @ 2023-11-13 11:52 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox, John Hubbard
  Cc: Andrew Morton, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	David Rientjes, Vlastimil Babka, Hugh Dickins, linux-mm,
	linux-kernel, linux-arm-kernel



On 2023/11/13 18:19, Ryan Roberts wrote:
> On 13/11/2023 05:18, Matthew Wilcox wrote:
>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>>> I've done some initial performance testing of this patchset on an arm64
>>> SBSA server. When these patches are combined with the arm64 arch contpte
>>> patches in Ryan's git tree (he has conveniently combined everything
>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>>> some memory-intensive workloads. Many test runs, conducted independently
>>> by different engineers and on different machines, have convinced me and
>>> my colleagues that this is an accurate result.
>>>
>>> In order to achieve that result, we used the git tree in [1] with
>>> following settings:
>>>
>>>      echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>>      echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>
>>> This was on a aarch64 machine configure to use a 64KB base page size.
>>> That configuration means that the PMD size is 512MB, which is of course
>>> too large for practical use as a pure PMD-THP. However, with with these
>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>>> coverage, while still getting pages that are small enough to be
>>> effectively usable.
>>
>> That is quite remarkable!
> 
> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!
> 
>>
>> My hope is to abolish the 64kB page size configuration.  ie instead of
>> using the mixture of page sizes that you currently are -- 64k and
>> 1M (right?  Order-0, and order-4)
> 
> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
> intuitively you would expect the order to remain constant, but it doesn't).
> 
> The "recommend" setting above will actually enable order-3 as well even though
> there is no HW benefit to this. So the full set of available memory sizes here is:
> 
> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
> 
>> , that 4k, 64k and 2MB (order-0,
>> order-4 and order-9) will provide better performance.
>>
>> Have you run any experiements with a 4kB page size?
> 
> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
> to get to a world were we universally deal in variable sized chunks of memory,
> aligned on 4K boundaries.
> 
> In my experience though, there are still some performance benefits to 64K base
> page vs 4K+contpte; the page tables are more cache efficient for the former case
> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
> latter. In practice the HW will still only read 8 bytes in the latter but that's
> taking up a full cache line vs the former where a single cache line stores 8x
> 64K entries.

We test some benchmark, eg, unixbench, lmbench, sysbench, with v5 on
arm64 board(for better evaluation of anon large folio, using ext4,
which don't support large folio for now), will test again and send
the results once v7 out.

1) base page 4k  + without anon large folio
2) base page 64k + without anon large folio
3) base page 4k  + with anon large folio + cont-pte(order = 4,0)

Most of the test results from v5 show the 3) have a good improvement
vs 1), but still low than 2) , also for some latency-sensitive
benchmark, 2) and 3) maybe have poor performance vs 1).

Note, for pcp_allowed_order, order <= PAGE_ALLOC_COSTLY_ORDER=3, for
3), we maybe enlarge it for better scalability when page allocation
on arm64, not test on v5, will try to enlarge it on v7.

> 
> Thanks,
> Ryan
> 
> 
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-11-13 11:52         ` Kefeng Wang
@ 2023-11-13 12:12           ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-11-13 12:12 UTC (permalink / raw)
  To: Kefeng Wang, Matthew Wilcox, John Hubbard
  Cc: Andrew Morton, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	David Rientjes, Vlastimil Babka, Hugh Dickins, linux-mm,
	linux-kernel, linux-arm-kernel

On 13/11/2023 11:52, Kefeng Wang wrote:
> 
> 
> On 2023/11/13 18:19, Ryan Roberts wrote:
>> On 13/11/2023 05:18, Matthew Wilcox wrote:
>>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>>>> I've done some initial performance testing of this patchset on an arm64
>>>> SBSA server. When these patches are combined with the arm64 arch contpte
>>>> patches in Ryan's git tree (he has conveniently combined everything
>>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>>>> some memory-intensive workloads. Many test runs, conducted independently
>>>> by different engineers and on different machines, have convinced me and
>>>> my colleagues that this is an accurate result.
>>>>
>>>> In order to achieve that result, we used the git tree in [1] with
>>>> following settings:
>>>>
>>>>      echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>>>      echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>>
>>>> This was on a aarch64 machine configure to use a 64KB base page size.
>>>> That configuration means that the PMD size is 512MB, which is of course
>>>> too large for practical use as a pure PMD-THP. However, with with these
>>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>>>> coverage, while still getting pages that are small enough to be
>>>> effectively usable.
>>>
>>> That is quite remarkable!
>>
>> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!
>>
>>>
>>> My hope is to abolish the 64kB page size configuration.  ie instead of
>>> using the mixture of page sizes that you currently are -- 64k and
>>> 1M (right?  Order-0, and order-4)
>>
>> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
>> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
>> intuitively you would expect the order to remain constant, but it doesn't).
>>
>> The "recommend" setting above will actually enable order-3 as well even though
>> there is no HW benefit to this. So the full set of available memory sizes here
>> is:
>>
>> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
>>
>>> , that 4k, 64k and 2MB (order-0,
>>> order-4 and order-9) will provide better performance.
>>>
>>> Have you run any experiements with a 4kB page size?
>>
>> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
>> to get to a world were we universally deal in variable sized chunks of memory,
>> aligned on 4K boundaries.
>>
>> In my experience though, there are still some performance benefits to 64K base
>> page vs 4K+contpte; the page tables are more cache efficient for the former case
>> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
>> latter. In practice the HW will still only read 8 bytes in the latter but that's
>> taking up a full cache line vs the former where a single cache line stores 8x
>> 64K entries.
> 
> We test some benchmark, eg, unixbench, lmbench, sysbench, with v5 on
> arm64 board(for better evaluation of anon large folio, using ext4,
> which don't support large folio for now), will test again and send
> the results once v7 out.

Thanks for the testing and for posting the insights!

> 
> 1) base page 4k  + without anon large folio
> 2) base page 64k + without anon large folio
> 3) base page 4k  + with anon large folio + cont-pte(order = 4,0)
> 
> Most of the test results from v5 show the 3) have a good improvement
> vs 1), but still low than 2) 

Do you have any understanding what the shortfall is for these particular
workloads? Certainly the cache spatial locality benefit of the 64K page tables
could be a factor. But certainly for the workloads I've been looking at, a
bigger factor is often the fact that executable file-backed memory (elf
segments) are not in 64K folios and therefore not contpte-mapped. If the iTLB is
under pressure this can help a lot. I have a change (hack) to force all
executable mappings to be read-ahead into 64K folios and this gives an
improvement. But obviously that only works when the file system supports large
folios (so not ext4 right now). It would certainly be interesting to see just
how close to native 64K we can get when employing these extra ideas.

>, also for some latency-sensitive
> benchmark, 2) and 3) maybe have poor performance vs 1).
> 
> Note, for pcp_allowed_order, order <= PAGE_ALLOC_COSTLY_ORDER=3, for
> 3), we maybe enlarge it for better scalability when page allocation
> on arm64, not test on v5, will try to enlarge it on v7.

Yes interesting! I'm hoping to post v7 this week - just waiting for mm-unstable
to be rebased on v6.7-rc1. I'd be interested to see your results.

> 
>>
>> Thanks,
>> Ryan
>>
>>
>>


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-11-13 12:12           ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-11-13 12:12 UTC (permalink / raw)
  To: Kefeng Wang, Matthew Wilcox, John Hubbard
  Cc: Andrew Morton, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	David Rientjes, Vlastimil Babka, Hugh Dickins, linux-mm,
	linux-kernel, linux-arm-kernel

On 13/11/2023 11:52, Kefeng Wang wrote:
> 
> 
> On 2023/11/13 18:19, Ryan Roberts wrote:
>> On 13/11/2023 05:18, Matthew Wilcox wrote:
>>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>>>> I've done some initial performance testing of this patchset on an arm64
>>>> SBSA server. When these patches are combined with the arm64 arch contpte
>>>> patches in Ryan's git tree (he has conveniently combined everything
>>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>>>> some memory-intensive workloads. Many test runs, conducted independently
>>>> by different engineers and on different machines, have convinced me and
>>>> my colleagues that this is an accurate result.
>>>>
>>>> In order to achieve that result, we used the git tree in [1] with
>>>> following settings:
>>>>
>>>>      echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>>>      echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>>
>>>> This was on a aarch64 machine configure to use a 64KB base page size.
>>>> That configuration means that the PMD size is 512MB, which is of course
>>>> too large for practical use as a pure PMD-THP. However, with with these
>>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>>>> coverage, while still getting pages that are small enough to be
>>>> effectively usable.
>>>
>>> That is quite remarkable!
>>
>> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!
>>
>>>
>>> My hope is to abolish the 64kB page size configuration.  ie instead of
>>> using the mixture of page sizes that you currently are -- 64k and
>>> 1M (right?  Order-0, and order-4)
>>
>> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
>> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
>> intuitively you would expect the order to remain constant, but it doesn't).
>>
>> The "recommend" setting above will actually enable order-3 as well even though
>> there is no HW benefit to this. So the full set of available memory sizes here
>> is:
>>
>> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
>>
>>> , that 4k, 64k and 2MB (order-0,
>>> order-4 and order-9) will provide better performance.
>>>
>>> Have you run any experiements with a 4kB page size?
>>
>> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
>> to get to a world were we universally deal in variable sized chunks of memory,
>> aligned on 4K boundaries.
>>
>> In my experience though, there are still some performance benefits to 64K base
>> page vs 4K+contpte; the page tables are more cache efficient for the former case
>> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
>> latter. In practice the HW will still only read 8 bytes in the latter but that's
>> taking up a full cache line vs the former where a single cache line stores 8x
>> 64K entries.
> 
> We test some benchmark, eg, unixbench, lmbench, sysbench, with v5 on
> arm64 board(for better evaluation of anon large folio, using ext4,
> which don't support large folio for now), will test again and send
> the results once v7 out.

Thanks for the testing and for posting the insights!

> 
> 1) base page 4k  + without anon large folio
> 2) base page 64k + without anon large folio
> 3) base page 4k  + with anon large folio + cont-pte(order = 4,0)
> 
> Most of the test results from v5 show the 3) have a good improvement
> vs 1), but still low than 2) 

Do you have any understanding what the shortfall is for these particular
workloads? Certainly the cache spatial locality benefit of the 64K page tables
could be a factor. But certainly for the workloads I've been looking at, a
bigger factor is often the fact that executable file-backed memory (elf
segments) are not in 64K folios and therefore not contpte-mapped. If the iTLB is
under pressure this can help a lot. I have a change (hack) to force all
executable mappings to be read-ahead into 64K folios and this gives an
improvement. But obviously that only works when the file system supports large
folios (so not ext4 right now). It would certainly be interesting to see just
how close to native 64K we can get when employing these extra ideas.

>, also for some latency-sensitive
> benchmark, 2) and 3) maybe have poor performance vs 1).
> 
> Note, for pcp_allowed_order, order <= PAGE_ALLOC_COSTLY_ORDER=3, for
> 3), we maybe enlarge it for better scalability when page allocation
> on arm64, not test on v5, will try to enlarge it on v7.

Yes interesting! I'm hoping to post v7 this week - just waiting for mm-unstable
to be rebased on v6.7-rc1. I'd be interested to see your results.

> 
>>
>> Thanks,
>> Ryan
>>
>>
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-11-13 10:19       ` Ryan Roberts
@ 2023-11-13 14:52         ` John Hubbard
  -1 siblings, 0 replies; 140+ messages in thread
From: John Hubbard @ 2023-11-13 14:52 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox
  Cc: Andrew Morton, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	David Rientjes, Vlastimil Babka, Hugh Dickins, linux-mm,
	linux-kernel, linux-arm-kernel

On 11/13/23 2:19 AM, Ryan Roberts wrote:
> On 13/11/2023 05:18, Matthew Wilcox wrote:
>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>>> I've done some initial performance testing of this patchset on an arm64
>>> SBSA server. When these patches are combined with the arm64 arch contpte
>>> patches in Ryan's git tree (he has conveniently combined everything
>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>>> some memory-intensive workloads. Many test runs, conducted independently
>>> by different engineers and on different machines, have convinced me and
>>> my colleagues that this is an accurate result.
>>>
>>> In order to achieve that result, we used the git tree in [1] with
>>> following settings:
>>>
>>>      echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>>      echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>
>>> This was on a aarch64 machine configure to use a 64KB base page size.
>>> That configuration means that the PMD size is 512MB, which is of course
>>> too large for practical use as a pure PMD-THP. However, with with these
>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>>> coverage, while still getting pages that are small enough to be
>>> effectively usable.
>>
>> That is quite remarkable!
> 
> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!
> 
>>
>> My hope is to abolish the 64kB page size configuration.  ie instead of

We've found that a 64KB base page size provides better performance for
HPC and AI workloads, than a 4KB base size, at least for these kinds of
servers. In fact, the 4KB config is considered odd and I'd have to
look around to get one. It's mostly a TLB coverage issue because,
again, the problem typically has a very large memory footprint.

So even though it would be nice from a software point of view, there's
a real need for this.

>> using the mixture of page sizes that you currently are -- 64k and
>> 1M (right?  Order-0, and order-4)
> 
> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
> intuitively you would expect the order to remain constant, but it doesn't).
> 
> The "recommend" setting above will actually enable order-3 as well even though
> there is no HW benefit to this. So the full set of available memory sizes here is:
> 
> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13

Yes, and to provide some further details about the test runs, I went
so far as to test individual anon_orders (for example, 
anon_orders=0x20), in order to isolate behavior and see what's really
going on.

On this hardware, anything with 2MB page sizes which corresponds to
anon_orders=0x20, as I recall) or larger, gets the 10x boost. It's
an interesting on/off behavior. This particular server design and
workload combination really prefers 2MB pages, even if they are
held together with contpte instead of a real PMD entry.

> 
>> , that 4k, 64k and 2MB (order-0,
>> order-4 and order-9) will provide better performance.
>>
>> Have you run any experiements with a 4kB page size?
> 
> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
> to get to a world were we universally deal in variable sized chunks of memory,
> aligned on 4K boundaries.
> 
> In my experience though, there are still some performance benefits to 64K base
> page vs 4K+contpte; the page tables are more cache efficient for the former case
> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
> latter. In practice the HW will still only read 8 bytes in the latter but that's
> taking up a full cache line vs the former where a single cache line stores 8x
> 64K entries. >
> Thanks,
> Ryan
> 

thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-11-13 14:52         ` John Hubbard
  0 siblings, 0 replies; 140+ messages in thread
From: John Hubbard @ 2023-11-13 14:52 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox
  Cc: Andrew Morton, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	David Rientjes, Vlastimil Babka, Hugh Dickins, linux-mm,
	linux-kernel, linux-arm-kernel

On 11/13/23 2:19 AM, Ryan Roberts wrote:
> On 13/11/2023 05:18, Matthew Wilcox wrote:
>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>>> I've done some initial performance testing of this patchset on an arm64
>>> SBSA server. When these patches are combined with the arm64 arch contpte
>>> patches in Ryan's git tree (he has conveniently combined everything
>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>>> some memory-intensive workloads. Many test runs, conducted independently
>>> by different engineers and on different machines, have convinced me and
>>> my colleagues that this is an accurate result.
>>>
>>> In order to achieve that result, we used the git tree in [1] with
>>> following settings:
>>>
>>>      echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>>      echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>
>>> This was on a aarch64 machine configure to use a 64KB base page size.
>>> That configuration means that the PMD size is 512MB, which is of course
>>> too large for practical use as a pure PMD-THP. However, with with these
>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>>> coverage, while still getting pages that are small enough to be
>>> effectively usable.
>>
>> That is quite remarkable!
> 
> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!
> 
>>
>> My hope is to abolish the 64kB page size configuration.  ie instead of

We've found that a 64KB base page size provides better performance for
HPC and AI workloads, than a 4KB base size, at least for these kinds of
servers. In fact, the 4KB config is considered odd and I'd have to
look around to get one. It's mostly a TLB coverage issue because,
again, the problem typically has a very large memory footprint.

So even though it would be nice from a software point of view, there's
a real need for this.

>> using the mixture of page sizes that you currently are -- 64k and
>> 1M (right?  Order-0, and order-4)
> 
> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
> intuitively you would expect the order to remain constant, but it doesn't).
> 
> The "recommend" setting above will actually enable order-3 as well even though
> there is no HW benefit to this. So the full set of available memory sizes here is:
> 
> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13

Yes, and to provide some further details about the test runs, I went
so far as to test individual anon_orders (for example, 
anon_orders=0x20), in order to isolate behavior and see what's really
going on.

On this hardware, anything with 2MB page sizes which corresponds to
anon_orders=0x20, as I recall) or larger, gets the 10x boost. It's
an interesting on/off behavior. This particular server design and
workload combination really prefers 2MB pages, even if they are
held together with contpte instead of a real PMD entry.

> 
>> , that 4k, 64k and 2MB (order-0,
>> order-4 and order-9) will provide better performance.
>>
>> Have you run any experiements with a 4kB page size?
> 
> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
> to get to a world were we universally deal in variable sized chunks of memory,
> aligned on 4K boundaries.
> 
> In my experience though, there are still some performance benefits to 64K base
> page vs 4K+contpte; the page tables are more cache efficient for the former case
> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
> latter. In practice the HW will still only read 8 bytes in the latter but that's
> taking up a full cache line vs the former where a single cache line stores 8x
> 64K entries. >
> Thanks,
> Ryan
> 

thanks,

-- 
John Hubbard
NVIDIA

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-11-13 12:12           ` Ryan Roberts
@ 2023-11-13 14:52             ` Kefeng Wang
  -1 siblings, 0 replies; 140+ messages in thread
From: Kefeng Wang @ 2023-11-13 14:52 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox, John Hubbard
  Cc: Andrew Morton, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	David Rientjes, Vlastimil Babka, Hugh Dickins, linux-mm,
	linux-kernel, linux-arm-kernel



On 2023/11/13 20:12, Ryan Roberts wrote:
> On 13/11/2023 11:52, Kefeng Wang wrote:
>>
>>
>> On 2023/11/13 18:19, Ryan Roberts wrote:
>>> On 13/11/2023 05:18, Matthew Wilcox wrote:
>>>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>>>>> I've done some initial performance testing of this patchset on an arm64
>>>>> SBSA server. When these patches are combined with the arm64 arch contpte
>>>>> patches in Ryan's git tree (he has conveniently combined everything
>>>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>>>>> some memory-intensive workloads. Many test runs, conducted independently
>>>>> by different engineers and on different machines, have convinced me and
>>>>> my colleagues that this is an accurate result.
>>>>>
>>>>> In order to achieve that result, we used the git tree in [1] with
>>>>> following settings:
>>>>>
>>>>>       echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>>>>       echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>>>
>>>>> This was on a aarch64 machine configure to use a 64KB base page size.
>>>>> That configuration means that the PMD size is 512MB, which is of course
>>>>> too large for practical use as a pure PMD-THP. However, with with these
>>>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>>>>> coverage, while still getting pages that are small enough to be
>>>>> effectively usable.
>>>>
>>>> That is quite remarkable!
>>>
>>> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!
>>>
>>>>
>>>> My hope is to abolish the 64kB page size configuration.  ie instead of
>>>> using the mixture of page sizes that you currently are -- 64k and
>>>> 1M (right?  Order-0, and order-4)
>>>
>>> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
>>> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
>>> intuitively you would expect the order to remain constant, but it doesn't).
>>>
>>> The "recommend" setting above will actually enable order-3 as well even though
>>> there is no HW benefit to this. So the full set of available memory sizes here
>>> is:
>>>
>>> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
>>>
>>>> , that 4k, 64k and 2MB (order-0,
>>>> order-4 and order-9) will provide better performance.
>>>>
>>>> Have you run any experiements with a 4kB page size?
>>>
>>> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
>>> to get to a world were we universally deal in variable sized chunks of memory,
>>> aligned on 4K boundaries.
>>>
>>> In my experience though, there are still some performance benefits to 64K base
>>> page vs 4K+contpte; the page tables are more cache efficient for the former case
>>> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
>>> latter. In practice the HW will still only read 8 bytes in the latter but that's
>>> taking up a full cache line vs the former where a single cache line stores 8x
>>> 64K entries.
>>
>> We test some benchmark, eg, unixbench, lmbench, sysbench, with v5 on
>> arm64 board(for better evaluation of anon large folio, using ext4,
>> which don't support large folio for now), will test again and send
>> the results once v7 out.
> 
> Thanks for the testing and for posting the insights!
> 
>>
>> 1) base page 4k  + without anon large folio
>> 2) base page 64k + without anon large folio
>> 3) base page 4k  + with anon large folio + cont-pte(order = 4,0)
>>
>> Most of the test results from v5 show the 3) have a good improvement
>> vs 1), but still low than 2)
> 
> Do you have any understanding what the shortfall is for these particular
> workloads? Certainly the cache spatial locality benefit of the 64K page tables
> could be a factor. But certainly for the workloads I've been looking at, a
> bigger factor is often the fact that executable file-backed memory (elf
> segments) are not in 64K folios and therefore not contpte-mapped. If the iTLB is
> under pressure this can help a lot. I have a change (hack) to force all
> executable mappings to be read-ahead into 64K folios and this gives an
> improvement. But obviously that only works when the file system supports large
> folios (so not ext4 right now). It would certainly be interesting to see just
> how close to native 64K we can get when employing these extra ideas.

No detailed analysis, but with base page 64k,
  less page fault
  less TLB operation
  less zone-lock congestion(pcp)
  less buddy split/merge
  no reclaim/compact when allocate 64k page, and no fallback logical
  execfolio
  faster page table opreation?
  ...

> 
>> , also for some latency-sensitive
>> benchmark, 2) and 3) maybe have poor performance vs 1).
>>
>> Note, for pcp_allowed_order, order <= PAGE_ALLOC_COSTLY_ORDER=3, for
>> 3), we maybe enlarge it for better scalability when page allocation
>> on arm64, not test on v5, will try to enlarge it on v7.
> 
> Yes interesting! I'm hoping to post v7 this week - just waiting for mm-unstable
> to be rebased on v6.7-rc1. I'd be interested to see your results.
> 
Glad to see it.>>
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>
> 
> 

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-11-13 14:52             ` Kefeng Wang
  0 siblings, 0 replies; 140+ messages in thread
From: Kefeng Wang @ 2023-11-13 14:52 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox, John Hubbard
  Cc: Andrew Morton, Yin Fengwei, David Hildenbrand, Yu Zhao,
	Catalin Marinas, Anshuman Khandual, Yang Shi, Huang, Ying,
	Zi Yan, Luis Chamberlain, Itaru Kitayama, Kirill A. Shutemov,
	David Rientjes, Vlastimil Babka, Hugh Dickins, linux-mm,
	linux-kernel, linux-arm-kernel



On 2023/11/13 20:12, Ryan Roberts wrote:
> On 13/11/2023 11:52, Kefeng Wang wrote:
>>
>>
>> On 2023/11/13 18:19, Ryan Roberts wrote:
>>> On 13/11/2023 05:18, Matthew Wilcox wrote:
>>>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>>>>> I've done some initial performance testing of this patchset on an arm64
>>>>> SBSA server. When these patches are combined with the arm64 arch contpte
>>>>> patches in Ryan's git tree (he has conveniently combined everything
>>>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>>>>> some memory-intensive workloads. Many test runs, conducted independently
>>>>> by different engineers and on different machines, have convinced me and
>>>>> my colleagues that this is an accurate result.
>>>>>
>>>>> In order to achieve that result, we used the git tree in [1] with
>>>>> following settings:
>>>>>
>>>>>       echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>>>>       echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>>>
>>>>> This was on a aarch64 machine configure to use a 64KB base page size.
>>>>> That configuration means that the PMD size is 512MB, which is of course
>>>>> too large for practical use as a pure PMD-THP. However, with with these
>>>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>>>>> coverage, while still getting pages that are small enough to be
>>>>> effectively usable.
>>>>
>>>> That is quite remarkable!
>>>
>>> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!
>>>
>>>>
>>>> My hope is to abolish the 64kB page size configuration.  ie instead of
>>>> using the mixture of page sizes that you currently are -- 64k and
>>>> 1M (right?  Order-0, and order-4)
>>>
>>> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
>>> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
>>> intuitively you would expect the order to remain constant, but it doesn't).
>>>
>>> The "recommend" setting above will actually enable order-3 as well even though
>>> there is no HW benefit to this. So the full set of available memory sizes here
>>> is:
>>>
>>> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
>>>
>>>> , that 4k, 64k and 2MB (order-0,
>>>> order-4 and order-9) will provide better performance.
>>>>
>>>> Have you run any experiements with a 4kB page size?
>>>
>>> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
>>> to get to a world were we universally deal in variable sized chunks of memory,
>>> aligned on 4K boundaries.
>>>
>>> In my experience though, there are still some performance benefits to 64K base
>>> page vs 4K+contpte; the page tables are more cache efficient for the former case
>>> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
>>> latter. In practice the HW will still only read 8 bytes in the latter but that's
>>> taking up a full cache line vs the former where a single cache line stores 8x
>>> 64K entries.
>>
>> We test some benchmark, eg, unixbench, lmbench, sysbench, with v5 on
>> arm64 board(for better evaluation of anon large folio, using ext4,
>> which don't support large folio for now), will test again and send
>> the results once v7 out.
> 
> Thanks for the testing and for posting the insights!
> 
>>
>> 1) base page 4k  + without anon large folio
>> 2) base page 64k + without anon large folio
>> 3) base page 4k  + with anon large folio + cont-pte(order = 4,0)
>>
>> Most of the test results from v5 show the 3) have a good improvement
>> vs 1), but still low than 2)
> 
> Do you have any understanding what the shortfall is for these particular
> workloads? Certainly the cache spatial locality benefit of the 64K page tables
> could be a factor. But certainly for the workloads I've been looking at, a
> bigger factor is often the fact that executable file-backed memory (elf
> segments) are not in 64K folios and therefore not contpte-mapped. If the iTLB is
> under pressure this can help a lot. I have a change (hack) to force all
> executable mappings to be read-ahead into 64K folios and this gives an
> improvement. But obviously that only works when the file system supports large
> folios (so not ext4 right now). It would certainly be interesting to see just
> how close to native 64K we can get when employing these extra ideas.

No detailed analysis, but with base page 64k,
  less page fault
  less TLB operation
  less zone-lock congestion(pcp)
  less buddy split/merge
  no reclaim/compact when allocate 64k page, and no fallback logical
  execfolio
  faster page table opreation?
  ...

> 
>> , also for some latency-sensitive
>> benchmark, 2) and 3) maybe have poor performance vs 1).
>>
>> Note, for pcp_allowed_order, order <= PAGE_ALLOC_COSTLY_ORDER=3, for
>> 3), we maybe enlarge it for better scalability when page allocation
>> on arm64, not test on v5, will try to enlarge it on v7.
> 
> Yes interesting! I'm hoping to post v7 this week - just waiting for mm-unstable
> to be rebased on v6.7-rc1. I'd be interested to see your results.
> 
Glad to see it.>>
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>
> 
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-11-13 10:19       ` Ryan Roberts
@ 2023-11-13 15:04         ` Matthew Wilcox
  -1 siblings, 0 replies; 140+ messages in thread
From: Matthew Wilcox @ 2023-11-13 15:04 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: John Hubbard, Andrew Morton, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins, linux-mm, linux-kernel, linux-arm-kernel

On Mon, Nov 13, 2023 at 10:19:48AM +0000, Ryan Roberts wrote:
> On 13/11/2023 05:18, Matthew Wilcox wrote:
> > My hope is to abolish the 64kB page size configuration.  ie instead of
> > using the mixture of page sizes that you currently are -- 64k and
> > 1M (right?  Order-0, and order-4)
> 
> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
> intuitively you would expect the order to remain constant, but it doesn't).
> 
> The "recommend" setting above will actually enable order-3 as well even though
> there is no HW benefit to this. So the full set of available memory sizes here is:
> 
> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
> 
> > , that 4k, 64k and 2MB (order-0,
> > order-4 and order-9) will provide better performance.
> > 
> > Have you run any experiements with a 4kB page size?
> 
> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
> to get to a world were we universally deal in variable sized chunks of memory,
> aligned on 4K boundaries.
> 
> In my experience though, there are still some performance benefits to 64K base
> page vs 4K+contpte; the page tables are more cache efficient for the former case
> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
> latter. In practice the HW will still only read 8 bytes in the latter but that's
> taking up a full cache line vs the former where a single cache line stores 8x
> 64K entries.

This is going to depend on your workload though -- if you're using more
2MB than 64kB, you get to elide a layer of page table with 4k base,
rather than taking up 4 cache lines with a 64k base.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-11-13 15:04         ` Matthew Wilcox
  0 siblings, 0 replies; 140+ messages in thread
From: Matthew Wilcox @ 2023-11-13 15:04 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: John Hubbard, Andrew Morton, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins, linux-mm, linux-kernel, linux-arm-kernel

On Mon, Nov 13, 2023 at 10:19:48AM +0000, Ryan Roberts wrote:
> On 13/11/2023 05:18, Matthew Wilcox wrote:
> > My hope is to abolish the 64kB page size configuration.  ie instead of
> > using the mixture of page sizes that you currently are -- 64k and
> > 1M (right?  Order-0, and order-4)
> 
> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
> intuitively you would expect the order to remain constant, but it doesn't).
> 
> The "recommend" setting above will actually enable order-3 as well even though
> there is no HW benefit to this. So the full set of available memory sizes here is:
> 
> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
> 
> > , that 4k, 64k and 2MB (order-0,
> > order-4 and order-9) will provide better performance.
> > 
> > Have you run any experiements with a 4kB page size?
> 
> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
> to get to a world were we universally deal in variable sized chunks of memory,
> aligned on 4K boundaries.
> 
> In my experience though, there are still some performance benefits to 64K base
> page vs 4K+contpte; the page tables are more cache efficient for the former case
> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
> latter. In practice the HW will still only read 8 bytes in the latter but that's
> taking up a full cache line vs the former where a single cache line stores 8x
> 64K entries.

This is going to depend on your workload though -- if you're using more
2MB than 64kB, you get to elide a layer of page table with 4k base,
rather than taking up 4 cache lines with a 64k base.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-11-13 15:04         ` Matthew Wilcox
@ 2023-11-14 10:57           ` Ryan Roberts
  -1 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-11-14 10:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: John Hubbard, Andrew Morton, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins, linux-mm, linux-kernel, linux-arm-kernel

On 13/11/2023 15:04, Matthew Wilcox wrote:
> On Mon, Nov 13, 2023 at 10:19:48AM +0000, Ryan Roberts wrote:
>> On 13/11/2023 05:18, Matthew Wilcox wrote:
>>> My hope is to abolish the 64kB page size configuration.  ie instead of
>>> using the mixture of page sizes that you currently are -- 64k and
>>> 1M (right?  Order-0, and order-4)
>>
>> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
>> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
>> intuitively you would expect the order to remain constant, but it doesn't).
>>
>> The "recommend" setting above will actually enable order-3 as well even though
>> there is no HW benefit to this. So the full set of available memory sizes here is:
>>
>> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
>>
>>> , that 4k, 64k and 2MB (order-0,
>>> order-4 and order-9) will provide better performance.
>>>
>>> Have you run any experiements with a 4kB page size?
>>
>> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
>> to get to a world were we universally deal in variable sized chunks of memory,
>> aligned on 4K boundaries.
>>
>> In my experience though, there are still some performance benefits to 64K base
>> page vs 4K+contpte; the page tables are more cache efficient for the former case
>> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
>> latter. In practice the HW will still only read 8 bytes in the latter but that's
>> taking up a full cache line vs the former where a single cache line stores 8x
>> 64K entries.
> 
> This is going to depend on your workload though -- if you're using more
> 2MB than 64kB, you get to elide a layer of page table with 4k base,
> rather than taking up 4 cache lines with a 64k base.

True, but again depending on workload/config, you may have few levels of lookup
for the 64K native case in the first place because you consume more VA bits at
each level.


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-11-14 10:57           ` Ryan Roberts
  0 siblings, 0 replies; 140+ messages in thread
From: Ryan Roberts @ 2023-11-14 10:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: John Hubbard, Andrew Morton, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins, linux-mm, linux-kernel, linux-arm-kernel

On 13/11/2023 15:04, Matthew Wilcox wrote:
> On Mon, Nov 13, 2023 at 10:19:48AM +0000, Ryan Roberts wrote:
>> On 13/11/2023 05:18, Matthew Wilcox wrote:
>>> My hope is to abolish the 64kB page size configuration.  ie instead of
>>> using the mixture of page sizes that you currently are -- 64k and
>>> 1M (right?  Order-0, and order-4)
>>
>> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
>> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
>> intuitively you would expect the order to remain constant, but it doesn't).
>>
>> The "recommend" setting above will actually enable order-3 as well even though
>> there is no HW benefit to this. So the full set of available memory sizes here is:
>>
>> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
>>
>>> , that 4k, 64k and 2MB (order-0,
>>> order-4 and order-9) will provide better performance.
>>>
>>> Have you run any experiements with a 4kB page size?
>>
>> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
>> to get to a world were we universally deal in variable sized chunks of memory,
>> aligned on 4K boundaries.
>>
>> In my experience though, there are still some performance benefits to 64K base
>> page vs 4K+contpte; the page tables are more cache efficient for the former case
>> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
>> latter. In practice the HW will still only read 8 bytes in the latter but that's
>> taking up a full cache line vs the former where a single cache line stores 8x
>> 64K entries.
> 
> This is going to depend on your workload though -- if you're using more
> 2MB than 64kB, you get to elide a layer of page table with 4k base,
> rather than taking up 4 cache lines with a 64k base.

True, but again depending on workload/config, you may have few levels of lookup
for the 64K native case in the first place because you consume more VA bits at
each level.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
  2023-11-14 10:57           ` Ryan Roberts
@ 2023-12-05 16:05             ` Matthew Wilcox
  -1 siblings, 0 replies; 140+ messages in thread
From: Matthew Wilcox @ 2023-12-05 16:05 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: John Hubbard, Andrew Morton, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins, linux-mm, linux-kernel, linux-arm-kernel

On Tue, Nov 14, 2023 at 10:57:07AM +0000, Ryan Roberts wrote:
> On 13/11/2023 15:04, Matthew Wilcox wrote:
> > On Mon, Nov 13, 2023 at 10:19:48AM +0000, Ryan Roberts wrote:
> >> On 13/11/2023 05:18, Matthew Wilcox wrote:
> >>> My hope is to abolish the 64kB page size configuration.  ie instead of
> >>> using the mixture of page sizes that you currently are -- 64k and
> >>> 1M (right?  Order-0, and order-4)
> >>
> >> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
> >> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
> >> intuitively you would expect the order to remain constant, but it doesn't).
> >>
> >> The "recommend" setting above will actually enable order-3 as well even though
> >> there is no HW benefit to this. So the full set of available memory sizes here is:
> >>
> >> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
> >>
> >>> , that 4k, 64k and 2MB (order-0,
> >>> order-4 and order-9) will provide better performance.
> >>>
> >>> Have you run any experiements with a 4kB page size?
> >>
> >> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
> >> to get to a world were we universally deal in variable sized chunks of memory,
> >> aligned on 4K boundaries.
> >>
> >> In my experience though, there are still some performance benefits to 64K base
> >> page vs 4K+contpte; the page tables are more cache efficient for the former case
> >> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
> >> latter. In practice the HW will still only read 8 bytes in the latter but that's
> >> taking up a full cache line vs the former where a single cache line stores 8x
> >> 64K entries.
> > 
> > This is going to depend on your workload though -- if you're using more
> > 2MB than 64kB, you get to elide a layer of page table with 4k base,
> > rather than taking up 4 cache lines with a 64k base.
> 
> True, but again depending on workload/config, you may have few levels of lookup
> for the 64K native case in the first place because you consume more VA bits at
> each level.

Sorry, missed this email ... let's work it through.

With 4k, and a 48-bit VA space, we get 12 bits at the lowest level, then
9 bits each layer, so 4 * 9 + 12 = 48.  With a 2MB allocation, we
eliminate the bottom layer and examine three cachelines to find it (PGD
entry, PUD entry, PMD entry)

With 64k, we get 16 bits at the lowest level, then 13 bits each layer,
so 3 * 13 + 16 = 55.  With a 2MB allocation, we can't eliminate the
bottom layer, so we still have to examine three cachelines to find it
(PGD, PMD, PTE).  If you can fit into a 42-bit address space, you can
reduce it by one cache miss, but my impression is that applications
which use 21 bits of address space for a single allocation want more
address space than your average application.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory
@ 2023-12-05 16:05             ` Matthew Wilcox
  0 siblings, 0 replies; 140+ messages in thread
From: Matthew Wilcox @ 2023-12-05 16:05 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: John Hubbard, Andrew Morton, Yin Fengwei, David Hildenbrand,
	Yu Zhao, Catalin Marinas, Anshuman Khandual, Yang Shi, Huang,
	Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
	Kirill A. Shutemov, David Rientjes, Vlastimil Babka,
	Hugh Dickins, linux-mm, linux-kernel, linux-arm-kernel

On Tue, Nov 14, 2023 at 10:57:07AM +0000, Ryan Roberts wrote:
> On 13/11/2023 15:04, Matthew Wilcox wrote:
> > On Mon, Nov 13, 2023 at 10:19:48AM +0000, Ryan Roberts wrote:
> >> On 13/11/2023 05:18, Matthew Wilcox wrote:
> >>> My hope is to abolish the 64kB page size configuration.  ie instead of
> >>> using the mixture of page sizes that you currently are -- 64k and
> >>> 1M (right?  Order-0, and order-4)
> >>
> >> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
> >> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
> >> intuitively you would expect the order to remain constant, but it doesn't).
> >>
> >> The "recommend" setting above will actually enable order-3 as well even though
> >> there is no HW benefit to this. So the full set of available memory sizes here is:
> >>
> >> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
> >>
> >>> , that 4k, 64k and 2MB (order-0,
> >>> order-4 and order-9) will provide better performance.
> >>>
> >>> Have you run any experiements with a 4kB page size?
> >>
> >> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
> >> to get to a world were we universally deal in variable sized chunks of memory,
> >> aligned on 4K boundaries.
> >>
> >> In my experience though, there are still some performance benefits to 64K base
> >> page vs 4K+contpte; the page tables are more cache efficient for the former case
> >> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
> >> latter. In practice the HW will still only read 8 bytes in the latter but that's
> >> taking up a full cache line vs the former where a single cache line stores 8x
> >> 64K entries.
> > 
> > This is going to depend on your workload though -- if you're using more
> > 2MB than 64kB, you get to elide a layer of page table with 4k base,
> > rather than taking up 4 cache lines with a 64k base.
> 
> True, but again depending on workload/config, you may have few levels of lookup
> for the 64K native case in the first place because you consume more VA bits at
> each level.

Sorry, missed this email ... let's work it through.

With 4k, and a 48-bit VA space, we get 12 bits at the lowest level, then
9 bits each layer, so 4 * 9 + 12 = 48.  With a 2MB allocation, we
eliminate the bottom layer and examine three cachelines to find it (PGD
entry, PUD entry, PMD entry)

With 64k, we get 16 bits at the lowest level, then 13 bits each layer,
so 3 * 13 + 16 = 55.  With a 2MB allocation, we can't eliminate the
bottom layer, so we still have to examine three cachelines to find it
(PGD, PMD, PTE).  If you can fit into a 42-bit address space, you can
reduce it by one cache miss, but my impression is that applications
which use 21 bits of address space for a single allocation want more
address space than your average application.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

end of thread, other threads:[~2023-12-05 16:06 UTC | newest]

Thread overview: 140+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-29 11:44 [PATCH v6 0/9] variable-order, large folios for anonymous memory Ryan Roberts
2023-09-29 11:44 ` Ryan Roberts
2023-09-29 11:44 ` [PATCH v6 1/9] mm: Allow deferred splitting of arbitrary anon large folios Ryan Roberts
2023-09-29 11:44   ` Ryan Roberts
2023-10-05  8:19   ` David Hildenbrand
2023-10-05  8:19     ` David Hildenbrand
2023-09-29 11:44 ` [PATCH v6 2/9] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() Ryan Roberts
2023-09-29 11:44   ` Ryan Roberts
2023-09-29 13:45   ` Kirill A. Shutemov
2023-09-29 13:45     ` Kirill A. Shutemov
2023-09-29 14:39     ` Ryan Roberts
2023-09-29 14:39       ` Ryan Roberts
2023-09-29 11:44 ` [PATCH v6 3/9] mm: thp: Account pte-mapped anonymous THP usage Ryan Roberts
2023-09-29 11:44   ` Ryan Roberts
2023-09-29 11:44 ` [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files Ryan Roberts
2023-09-29 11:44   ` Ryan Roberts
2023-09-29 22:55   ` Andrew Morton
2023-09-29 22:55     ` Andrew Morton
2023-09-29 22:55     ` Andrew Morton
2023-10-02 10:15     ` Ryan Roberts
2023-10-02 10:15       ` Ryan Roberts
2023-10-02 10:15       ` Ryan Roberts
2023-10-07 22:54     ` Michael Ellerman
2023-10-07 22:54       ` Michael Ellerman
2023-10-07 22:54       ` Michael Ellerman
2023-10-10  0:20       ` Andrew Morton
2023-10-10  0:20         ` Andrew Morton
2023-10-10  0:20         ` Andrew Morton
2023-10-12  9:31         ` David Hildenbrand
2023-10-12  9:31           ` David Hildenbrand
2023-10-12  9:31           ` David Hildenbrand
2023-10-12 11:07         ` Michael Ellerman
2023-10-12 11:07           ` Michael Ellerman
2023-10-12 11:07           ` Michael Ellerman
2023-10-11  6:02   ` kernel test robot
2023-10-11  6:02     ` kernel test robot
2023-09-29 11:44 ` [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios Ryan Roberts
2023-09-29 11:44   ` Ryan Roberts
     [not found]   ` <CGME20231005120507eucas1p13f50fa99f52808818840ee7db194e12e@eucas1p1.samsung.com>
2023-10-05 12:05     ` Daniel Gomez
2023-10-05 12:05       ` Daniel Gomez
2023-10-05 12:49       ` Ryan Roberts
2023-10-05 12:49         ` Ryan Roberts
2023-10-05 14:59         ` Daniel Gomez
2023-10-05 14:59           ` Daniel Gomez
2023-10-27 23:04   ` John Hubbard
2023-10-27 23:04     ` John Hubbard
2023-10-30 11:43     ` Ryan Roberts
2023-10-30 11:43       ` Ryan Roberts
2023-10-30 23:25       ` John Hubbard
2023-10-30 23:25         ` John Hubbard
2023-11-01 13:56         ` Ryan Roberts
2023-11-01 13:56           ` Ryan Roberts
2023-09-29 11:44 ` [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders Ryan Roberts
2023-09-29 11:44   ` Ryan Roberts
2023-10-06 20:08   ` David Hildenbrand
2023-10-06 20:08     ` David Hildenbrand
2023-10-06 22:28     ` Yu Zhao
2023-10-06 22:28       ` Yu Zhao
2023-10-09 11:45       ` Ryan Roberts
2023-10-09 11:45         ` Ryan Roberts
2023-10-09 14:43         ` David Hildenbrand
2023-10-09 14:43           ` David Hildenbrand
2023-10-09 20:04         ` Yu Zhao
2023-10-09 20:04           ` Yu Zhao
2023-10-10 10:16           ` Ryan Roberts
2023-10-10 10:16             ` Ryan Roberts
2023-09-29 11:44 ` [PATCH v6 7/9] arm64/mm: Override arch_wants_pte_order() Ryan Roberts
2023-09-29 11:44   ` Ryan Roberts
2023-10-02 15:21   ` Catalin Marinas
2023-10-02 15:21     ` Catalin Marinas
2023-10-03  7:32     ` Ryan Roberts
2023-10-03  7:32       ` Ryan Roberts
2023-10-03 12:05       ` Catalin Marinas
2023-10-03 12:05         ` Catalin Marinas
2023-09-29 11:44 ` [PATCH v6 8/9] selftests/mm/cow: Generalize do_run_with_thp() helper Ryan Roberts
2023-09-29 11:44   ` Ryan Roberts
2023-09-29 11:44 ` [PATCH v6 9/9] selftests/mm/cow: Add tests for small-order anon THP Ryan Roberts
2023-09-29 11:44   ` Ryan Roberts
2023-10-06 20:06 ` [PATCH v6 0/9] variable-order, large folios for anonymous memory David Hildenbrand
2023-10-06 20:06   ` David Hildenbrand
2023-10-09 11:28   ` Ryan Roberts
2023-10-09 11:28     ` Ryan Roberts
2023-10-09 16:22     ` David Hildenbrand
2023-10-09 16:22       ` David Hildenbrand
2023-10-10 10:47       ` Ryan Roberts
2023-10-10 10:47         ` Ryan Roberts
2023-10-13 20:14         ` David Hildenbrand
2023-10-13 20:14           ` David Hildenbrand
2023-10-20 12:33   ` Ryan Roberts
2023-10-20 12:33     ` Ryan Roberts
2023-10-25 16:24     ` Ryan Roberts
2023-10-25 16:24       ` Ryan Roberts
2023-10-25 18:47       ` David Hildenbrand
2023-10-25 18:47         ` David Hildenbrand
2023-10-25 19:11         ` Yu Zhao
2023-10-25 19:11           ` Yu Zhao
2023-10-26  9:53           ` Ryan Roberts
2023-10-26  9:53             ` Ryan Roberts
2023-10-26 15:19             ` David Hildenbrand
2023-10-26 15:19               ` David Hildenbrand
2023-10-25 19:10       ` John Hubbard
2023-10-25 19:10         ` John Hubbard
2023-10-31 11:50   ` Ryan Roberts
2023-10-31 11:50     ` Ryan Roberts
2023-10-31 11:55     ` Ryan Roberts
2023-10-31 11:55       ` Ryan Roberts
2023-10-31 12:03       ` David Hildenbrand
2023-10-31 12:03         ` David Hildenbrand
2023-10-31 13:13         ` Ryan Roberts
2023-10-31 13:13           ` Ryan Roberts
2023-10-31 18:29       ` Yang Shi
2023-10-31 18:29         ` Yang Shi
2023-11-01 14:02         ` Ryan Roberts
2023-11-01 14:02           ` Ryan Roberts
2023-11-01 18:11           ` Yang Shi
2023-11-01 18:11             ` Yang Shi
2023-10-31 11:58     ` David Hildenbrand
2023-10-31 11:58       ` David Hildenbrand
2023-10-31 13:12       ` Ryan Roberts
2023-10-31 13:12         ` Ryan Roberts
2023-11-13  3:57 ` John Hubbard
2023-11-13  3:57   ` John Hubbard
2023-11-13  5:18   ` Matthew Wilcox
2023-11-13  5:18     ` Matthew Wilcox
2023-11-13 10:19     ` Ryan Roberts
2023-11-13 10:19       ` Ryan Roberts
2023-11-13 11:52       ` Kefeng Wang
2023-11-13 11:52         ` Kefeng Wang
2023-11-13 12:12         ` Ryan Roberts
2023-11-13 12:12           ` Ryan Roberts
2023-11-13 14:52           ` Kefeng Wang
2023-11-13 14:52             ` Kefeng Wang
2023-11-13 14:52       ` John Hubbard
2023-11-13 14:52         ` John Hubbard
2023-11-13 15:04       ` Matthew Wilcox
2023-11-13 15:04         ` Matthew Wilcox
2023-11-14 10:57         ` Ryan Roberts
2023-11-14 10:57           ` Ryan Roberts
2023-12-05 16:05           ` Matthew Wilcox
2023-12-05 16:05             ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.