All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
@ 2020-09-28 17:53 Zi Yan
  2020-09-28 17:53 ` [RFC PATCH v2 01/30] mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked Zi Yan
                   ` (30 more replies)
  0 siblings, 31 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

Hi all,

This patchset adds support for 1GB PUD THP on x86_64. It is on top of
v5.9-rc5-mmots-2020-09-18-21-23. It is also available at:
https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-09-18-21-23

Other than PUD THP, we had some discussion on generating THPs and contiguous
physical memory via a synchronous system call [0]. I am planning to send out a
separate patchset on it later, since I feel that it can be done independently of
PUD THP support.

Any comment or suggestion is welcome. Thanks.

Motiation
====
The patchset is trying to provide a more transparent way of boosting virtual
memory performance by leveraging gigantic TLB entries compared to hugetlbfs
pages [1,2]. Roman also said he would provide performance numbers of using 1GB
PUD THP once the patchset is a relatively good shape [1].


Patchset organization:
====

1. Patch 1 and 2: Jason's PUD entry READ_ONCE patch to walk_page_range to give
   a consistent read of PUD entries during lockless page table walks.
   I also add PMD entry READ_ONCE patch, since PMD level walk_page_range has
   the same lockless behavior as PUD level.

2. Patch 3: THP page table deposit now use single linked list to enable
   hierarchical page table deposit, i.e., deposit a PMD page where 512 PTE pages
   are deposited to.  Every page table page has a deposit_head and a deposit_node.
   For example, when storing 512 PTE pages to a PMD page, PMD page's deposit_head
   links to a PTE page's deposit_node, which links to another PTE page's
   deposit_node.

3. Patch 4,5,6: helper functions for allocating page table pages for PUD THPs
   and change thp_order and thp_nr.

4. Patch 7 to 23: PUD THP implementation. It is broken into small patches for
   easy review.

5. Patch 24, 25: new page size encoding for MADV_HUGEPAGE and MADV_NOHUGEPAGE in
   madvise. User can specify THP size. Only MADV_HUGEPAGE_1GB is used accepted.
   VM_HUGEPAGE_PUD is added to vm_flags to store the information at big 37.
   You are welcome to suggest any other approach.

6. Patch 26, 27: enable_pud_thp and hpage_pud_size are added to
   /sys/kernel/mm/transparent_hugepage/. enable_pud_thp is set to never by
   default.

7. Patch 28, 29: PUD THPs are allocated only from boot-time reserved CMA regions.
   The CMA regions can be used for other moveable page allocations.


Design for PUD-, PMD-, and PTE-mapped PUD THP
====

One additional design compared to PMD THP is the support for PMD-mapped PUD THP,
since original THP design supports PUD-mapped and PTE-mapped PUD THP
automatically.

PMD mapcounts are stored at (512*N + 3) subpages (N = 0 to 511) and 512*N
subpages are called PMDPageInPUD. A PUDDoubleMap bit is stored at third
subpage of a PUD THP, using the same page flag position as DoubleMap (stored
at second subpage of a PMD THP), to indicate a PUD THP with both PUD and
PMD mappings.


A PUD THP looks like:

┌───┬───┬───┬───┬─────┬───┬───┬───┬───┬────────┬──────┐
│ H │ T │ T │ T │ ... │ T │ T │ T │ T │  ...   │  T   │
│ 0 │ 1 │ 2 │ 3 │     │512│513│514│515│        │262143│
└───┴───┴───┴───┴─────┴───┴───┴───┴───┴────────┴──────┘

PMDPageInPUD pages in a PUD THP (only show first two PMDPageInPUD pages below).
Note that PMDPageInPUD pages are identified by their relative position to the
head page of the PUD THP and are still tail pages except the first one,
so H_0, T_512, T_1024, ... T_512x511 are all PMDPageInPUD pages:

 ┌────────────┬──────────────┬────────────┬──────────────┬───────────────────┐
 │PMDPageInPUD│     ...      │PMDPageInPUD│     ...      │  the remaining    │
 │    page    │ 511 subpages │    page    │ 511 subpages │ 510x512 subpages  │
 └────────────┴──────────────┴────────────┴──────────────┴───────────────────┘


Mapcount positions:

* For each subpage, its PTE mapcount is _mapcount, the same as PMD THP.
* For PUD THP, its PUD-mapping uses compound_mapcount at T_1 the same as PMD THP.
* For PMD-mapped PUD THP, its PMD-mapping uses compound_mapcount at T_3, T_515,
  ..., T_512x511+3. It is called sub_compound_mapcount.

PUDDoubleMap and DoubleMap in PUD THP:

* PUDDoubleMap is stored at the page flag of T_2 (third subpage), reusing the
  DoubleMap's position.
* DoubleMap is stored at the page flags of T_1 (second subpage), T_513, ...,
  T_512x511+1.

[0] https://lore.kernel.org/linux-mm/20200907072014.GD30144@dhcp22.suse.cz/
[1] https://lore.kernel.org/linux-mm/20200903162527.GF60440@carbon.dhcp.thefacebook.com/
[2] https://lore.kernel.org/linux-mm/20200903165051.GN24045@ziepe.ca/


Changelog from RFC v1
====
1. Add Jason's PUD entry READ_ONCE patch and my PMD entry READ_ONCE patch to
   get consistent page table entry reading in lockless page table walks.
2. Use single linked list for page table page deposit instead of pagechain
   data structure from RFC v1.
3. Address Kirill's comments.
4. Remove PUD page allocation via alloc_contig_pages(), using cma_alloc only.
5. Add madvise flag MADV_HUGEPAGE_1GB to explicitly enable PUD THP on specific
   VMAs instead of reusing MADV_HUGEPAGE. A new vm_flags VM_HUGEPAGE_PUD is
   added to achieve this.
6. Break large patches in v1 into small ones for easy review.

Jason Gunthorpe (1):
  mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked

Zi Yan (29):
  mm: pagewalk: use READ_ONCE when reading the PMD entry unlocked
  mm: thp: use single linked list for THP page table page deposit.
  mm: add new helper functions to allocate one PMD page with 512 PTE
    pages.
  mm: thp: add page table deposit/withdraw functions for PUD THP.
  mm: change thp_order and thp_nr as we will have not just PMD THPs.
  mm: thp: add anonymous PUD THP page fault support without enabling it.
  mm: thp: add PUD THP support for copy_huge_pud.
  mm: thp: add PUD THP support to zap_huge_pud.
  fs: proc: add PUD THP kpageflag.
  mm: thp: handling PUD THP reference bit.
  mm: rmap: add mappped/unmapped page order to anonymous page rmap
    functions.
  mm: rmap: add map_order to page_remove_anon_compound_rmap.
  mm: thp: add PUD THP split_huge_pud_page() function.
  mm: thp: add PUD THP to deferred split list when PUD mapping is gone.
  mm: debug: adapt dump_page to PUD THP.
  mm: thp: PUD THP COW splits PUD page and falls back to PMD page.
  mm: thp: PUD THP follow_p*d_page() support.
  mm: stats: make smap stats understand PUD THPs.
  mm: page_vma_walk: teach it about PMD-mapped PUD THP.
  mm: thp: PUD THP support in try_to_unmap().
  mm: thp: split PUD THPs at page reclaim.
  mm: support PUD THP pagemap support.
  mm: madvise: add page size options to MADV_HUGEPAGE and
    MADV_NOHUGEPAGE.
  mm: vma: add VM_HUGEPAGE_PUD to vm_flags at bit 37.
  mm: thp: add a global knob to enable/disable PUD THPs.
  mm: thp: make PUD THP size public.
  hugetlb: cma: move cma reserve function to cma.c.
  mm: thp: use cma reservation for pud thp allocation.
  mm: thp: enable anonymous PUD THP at page fault path.

 .../admin-guide/kernel-parameters.txt         |   2 +-
 Documentation/admin-guide/mm/transhuge.rst    |   1 +
 arch/arm64/mm/hugetlbpage.c                   |   2 +-
 arch/powerpc/mm/hugetlbpage.c                 |   2 +-
 arch/x86/include/asm/pgalloc.h                |  69 ++
 arch/x86/include/asm/pgtable.h                |  26 +
 arch/x86/kernel/setup.c                       |   8 +-
 arch/x86/mm/pgtable.c                         |  38 +
 drivers/base/node.c                           |   3 +
 fs/proc/meminfo.c                             |   2 +
 fs/proc/page.c                                |   2 +
 fs/proc/task_mmu.c                            | 200 +++-
 include/linux/cma.h                           |  18 +
 include/linux/huge_mm.h                       |  84 +-
 include/linux/hugetlb.h                       |  12 -
 include/linux/memcontrol.h                    |   5 +
 include/linux/mm.h                            |  42 +-
 include/linux/mm_types.h                      |  11 +-
 include/linux/mmu_notifier.h                  |  13 +
 include/linux/mmzone.h                        |   1 +
 include/linux/page-flags.h                    |  48 +
 include/linux/pagewalk.h                      |   4 +-
 include/linux/pgtable.h                       |  34 +
 include/linux/rmap.h                          |  10 +-
 include/linux/swap.h                          |   2 +
 include/linux/vm_event_item.h                 |   7 +
 include/uapi/asm-generic/mman-common.h        |  23 +
 include/uapi/linux/kernel-page-flags.h        |   1 +
 kernel/events/uprobes.c                       |   4 +-
 kernel/fork.c                                 |  10 +-
 mm/cma.c                                      | 119 +++
 mm/debug.c                                    |   6 +-
 mm/gup.c                                      |  60 +-
 mm/hmm.c                                      |  16 +-
 mm/huge_memory.c                              | 899 +++++++++++++++++-
 mm/hugetlb.c                                  | 117 +--
 mm/khugepaged.c                               |  16 +-
 mm/ksm.c                                      |   4 +-
 mm/madvise.c                                  |  76 +-
 mm/mapping_dirty_helpers.c                    |   6 +-
 mm/memcontrol.c                               |  43 +-
 mm/memory.c                                   |  28 +-
 mm/mempolicy.c                                |  29 +-
 mm/migrate.c                                  |  12 +-
 mm/mincore.c                                  |  10 +-
 mm/page_alloc.c                               |  53 +-
 mm/page_vma_mapped.c                          | 171 +++-
 mm/pagewalk.c                                 |  47 +-
 mm/pgtable-generic.c                          |  49 +-
 mm/ptdump.c                                   |   3 +-
 mm/rmap.c                                     | 300 ++++--
 mm/swap.c                                     |  30 +
 mm/swap_slots.c                               |   2 +
 mm/swapfile.c                                 |  11 +-
 mm/userfaultfd.c                              |   2 +-
 mm/util.c                                     |  22 +-
 mm/vmscan.c                                   |  33 +-
 mm/vmstat.c                                   |   8 +
 58 files changed, 2396 insertions(+), 460 deletions(-)

--
2.28.0


^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2020-10-06 11:59 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
2020-09-28 17:53 ` [RFC PATCH v2 01/30] mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 02/30] mm: pagewalk: use READ_ONCE when reading the PMD " Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 03/30] mm: thp: use single linked list for THP page table page deposit Zi Yan
2020-09-28 19:34   ` Matthew Wilcox
2020-09-28 20:34     ` Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 04/30] mm: add new helper functions to allocate one PMD page with 512 PTE pages Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 05/30] mm: thp: add page table deposit/withdraw functions for PUD THP Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 06/30] mm: change thp_order and thp_nr as we will have not just PMD THPs Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 07/30] mm: thp: add anonymous PUD THP page fault support without enabling it Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 08/30] mm: thp: add PUD THP support for copy_huge_pud Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 09/30] mm: thp: add PUD THP support to zap_huge_pud Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 10/30] fs: proc: add PUD THP kpageflag Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 11/30] mm: thp: handling PUD THP reference bit Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 12/30] mm: rmap: add mappped/unmapped page order to anonymous page rmap functions Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 13/30] mm: rmap: add map_order to page_remove_anon_compound_rmap Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 14/30] mm: thp: add PUD THP split_huge_pud_page() function Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 15/30] mm: thp: add PUD THP to deferred split list when PUD mapping is gone Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 16/30] mm: debug: adapt dump_page to PUD THP Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 17/30] mm: thp: PUD THP COW splits PUD page and falls back to PMD page Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 18/30] mm: thp: PUD THP follow_p*d_page() support Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 19/30] mm: stats: make smap stats understand PUD THPs Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 20/30] mm: page_vma_walk: teach it about PMD-mapped PUD THP Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 21/30] mm: thp: PUD THP support in try_to_unmap() Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 22/30] mm: thp: split PUD THPs at page reclaim Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 23/30] mm: support PUD THP pagemap support Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 24/30] mm: madvise: add page size options to MADV_HUGEPAGE and MADV_NOHUGEPAGE Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 25/30] mm: vma: add VM_HUGEPAGE_PUD to vm_flags at bit 37 Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 26/30] mm: thp: add a global knob to enable/disable PUD THPs Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 27/30] mm: thp: make PUD THP size public Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 28/30] hugetlb: cma: move cma reserve function to cma.c Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 29/30] mm: thp: use cma reservation for pud thp allocation Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 30/30] mm: thp: enable anonymous PUD THP at page fault path Zi Yan
2020-09-30 11:55 ` [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Michal Hocko
2020-10-01 15:14   ` Zi Yan
2020-10-02  7:32     ` Michal Hocko
2020-10-02  7:50       ` David Hildenbrand
2020-10-02  8:10         ` Michal Hocko
2020-10-02  8:30           ` David Hildenbrand
2020-10-05 15:03             ` Zi Yan
2020-10-05 15:55               ` Matthew Wilcox
2020-10-05 17:04                 ` Roman Gushchin
2020-10-05 19:12                 ` Zi Yan
2020-10-05 19:37                   ` Matthew Wilcox
2020-10-05 17:16               ` Roman Gushchin
2020-10-05 17:27                 ` David Hildenbrand
2020-10-05 18:25                   ` Roman Gushchin
2020-10-05 18:33                     ` David Hildenbrand
2020-10-05 19:11                       ` Roman Gushchin
2020-10-06  8:25                         ` David Hildenbrand
2020-10-05 17:39               ` David Hildenbrand
2020-10-05 18:05                 ` Zi Yan
2020-10-05 18:48                   ` David Hildenbrand
2020-10-06 11:59                   ` Michal Hocko
2020-10-05 15:34         ` Zi Yan
2020-10-05 17:30           ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.