All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
@ 2020-09-28 17:53 Zi Yan
  2020-09-28 17:53 ` [RFC PATCH v2 01/30] mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked Zi Yan
                   ` (30 more replies)
  0 siblings, 31 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

Hi all,

This patchset adds support for 1GB PUD THP on x86_64. It is on top of
v5.9-rc5-mmots-2020-09-18-21-23. It is also available at:
https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-09-18-21-23

Other than PUD THP, we had some discussion on generating THPs and contiguous
physical memory via a synchronous system call [0]. I am planning to send out a
separate patchset on it later, since I feel that it can be done independently of
PUD THP support.

Any comment or suggestion is welcome. Thanks.

Motiation
====
The patchset is trying to provide a more transparent way of boosting virtual
memory performance by leveraging gigantic TLB entries compared to hugetlbfs
pages [1,2]. Roman also said he would provide performance numbers of using 1GB
PUD THP once the patchset is a relatively good shape [1].


Patchset organization:
====

1. Patch 1 and 2: Jason's PUD entry READ_ONCE patch to walk_page_range to give
   a consistent read of PUD entries during lockless page table walks.
   I also add PMD entry READ_ONCE patch, since PMD level walk_page_range has
   the same lockless behavior as PUD level.

2. Patch 3: THP page table deposit now use single linked list to enable
   hierarchical page table deposit, i.e., deposit a PMD page where 512 PTE pages
   are deposited to.  Every page table page has a deposit_head and a deposit_node.
   For example, when storing 512 PTE pages to a PMD page, PMD page's deposit_head
   links to a PTE page's deposit_node, which links to another PTE page's
   deposit_node.

3. Patch 4,5,6: helper functions for allocating page table pages for PUD THPs
   and change thp_order and thp_nr.

4. Patch 7 to 23: PUD THP implementation. It is broken into small patches for
   easy review.

5. Patch 24, 25: new page size encoding for MADV_HUGEPAGE and MADV_NOHUGEPAGE in
   madvise. User can specify THP size. Only MADV_HUGEPAGE_1GB is used accepted.
   VM_HUGEPAGE_PUD is added to vm_flags to store the information at big 37.
   You are welcome to suggest any other approach.

6. Patch 26, 27: enable_pud_thp and hpage_pud_size are added to
   /sys/kernel/mm/transparent_hugepage/. enable_pud_thp is set to never by
   default.

7. Patch 28, 29: PUD THPs are allocated only from boot-time reserved CMA regions.
   The CMA regions can be used for other moveable page allocations.


Design for PUD-, PMD-, and PTE-mapped PUD THP
====

One additional design compared to PMD THP is the support for PMD-mapped PUD THP,
since original THP design supports PUD-mapped and PTE-mapped PUD THP
automatically.

PMD mapcounts are stored at (512*N + 3) subpages (N = 0 to 511) and 512*N
subpages are called PMDPageInPUD. A PUDDoubleMap bit is stored at third
subpage of a PUD THP, using the same page flag position as DoubleMap (stored
at second subpage of a PMD THP), to indicate a PUD THP with both PUD and
PMD mappings.


A PUD THP looks like:

┌───┬───┬───┬───┬─────┬───┬───┬───┬───┬────────┬──────┐
│ H │ T │ T │ T │ ... │ T │ T │ T │ T │  ...   │  T   │
│ 0 │ 1 │ 2 │ 3 │     │512│513│514│515│        │262143│
└───┴───┴───┴───┴─────┴───┴───┴───┴───┴────────┴──────┘

PMDPageInPUD pages in a PUD THP (only show first two PMDPageInPUD pages below).
Note that PMDPageInPUD pages are identified by their relative position to the
head page of the PUD THP and are still tail pages except the first one,
so H_0, T_512, T_1024, ... T_512x511 are all PMDPageInPUD pages:

 ┌────────────┬──────────────┬────────────┬──────────────┬───────────────────┐
 │PMDPageInPUD│     ...      │PMDPageInPUD│     ...      │  the remaining    │
 │    page    │ 511 subpages │    page    │ 511 subpages │ 510x512 subpages  │
 └────────────┴──────────────┴────────────┴──────────────┴───────────────────┘


Mapcount positions:

* For each subpage, its PTE mapcount is _mapcount, the same as PMD THP.
* For PUD THP, its PUD-mapping uses compound_mapcount at T_1 the same as PMD THP.
* For PMD-mapped PUD THP, its PMD-mapping uses compound_mapcount at T_3, T_515,
  ..., T_512x511+3. It is called sub_compound_mapcount.

PUDDoubleMap and DoubleMap in PUD THP:

* PUDDoubleMap is stored at the page flag of T_2 (third subpage), reusing the
  DoubleMap's position.
* DoubleMap is stored at the page flags of T_1 (second subpage), T_513, ...,
  T_512x511+1.

[0] https://lore.kernel.org/linux-mm/20200907072014.GD30144@dhcp22.suse.cz/
[1] https://lore.kernel.org/linux-mm/20200903162527.GF60440@carbon.dhcp.thefacebook.com/
[2] https://lore.kernel.org/linux-mm/20200903165051.GN24045@ziepe.ca/


Changelog from RFC v1
====
1. Add Jason's PUD entry READ_ONCE patch and my PMD entry READ_ONCE patch to
   get consistent page table entry reading in lockless page table walks.
2. Use single linked list for page table page deposit instead of pagechain
   data structure from RFC v1.
3. Address Kirill's comments.
4. Remove PUD page allocation via alloc_contig_pages(), using cma_alloc only.
5. Add madvise flag MADV_HUGEPAGE_1GB to explicitly enable PUD THP on specific
   VMAs instead of reusing MADV_HUGEPAGE. A new vm_flags VM_HUGEPAGE_PUD is
   added to achieve this.
6. Break large patches in v1 into small ones for easy review.

Jason Gunthorpe (1):
  mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked

Zi Yan (29):
  mm: pagewalk: use READ_ONCE when reading the PMD entry unlocked
  mm: thp: use single linked list for THP page table page deposit.
  mm: add new helper functions to allocate one PMD page with 512 PTE
    pages.
  mm: thp: add page table deposit/withdraw functions for PUD THP.
  mm: change thp_order and thp_nr as we will have not just PMD THPs.
  mm: thp: add anonymous PUD THP page fault support without enabling it.
  mm: thp: add PUD THP support for copy_huge_pud.
  mm: thp: add PUD THP support to zap_huge_pud.
  fs: proc: add PUD THP kpageflag.
  mm: thp: handling PUD THP reference bit.
  mm: rmap: add mappped/unmapped page order to anonymous page rmap
    functions.
  mm: rmap: add map_order to page_remove_anon_compound_rmap.
  mm: thp: add PUD THP split_huge_pud_page() function.
  mm: thp: add PUD THP to deferred split list when PUD mapping is gone.
  mm: debug: adapt dump_page to PUD THP.
  mm: thp: PUD THP COW splits PUD page and falls back to PMD page.
  mm: thp: PUD THP follow_p*d_page() support.
  mm: stats: make smap stats understand PUD THPs.
  mm: page_vma_walk: teach it about PMD-mapped PUD THP.
  mm: thp: PUD THP support in try_to_unmap().
  mm: thp: split PUD THPs at page reclaim.
  mm: support PUD THP pagemap support.
  mm: madvise: add page size options to MADV_HUGEPAGE and
    MADV_NOHUGEPAGE.
  mm: vma: add VM_HUGEPAGE_PUD to vm_flags at bit 37.
  mm: thp: add a global knob to enable/disable PUD THPs.
  mm: thp: make PUD THP size public.
  hugetlb: cma: move cma reserve function to cma.c.
  mm: thp: use cma reservation for pud thp allocation.
  mm: thp: enable anonymous PUD THP at page fault path.

 .../admin-guide/kernel-parameters.txt         |   2 +-
 Documentation/admin-guide/mm/transhuge.rst    |   1 +
 arch/arm64/mm/hugetlbpage.c                   |   2 +-
 arch/powerpc/mm/hugetlbpage.c                 |   2 +-
 arch/x86/include/asm/pgalloc.h                |  69 ++
 arch/x86/include/asm/pgtable.h                |  26 +
 arch/x86/kernel/setup.c                       |   8 +-
 arch/x86/mm/pgtable.c                         |  38 +
 drivers/base/node.c                           |   3 +
 fs/proc/meminfo.c                             |   2 +
 fs/proc/page.c                                |   2 +
 fs/proc/task_mmu.c                            | 200 +++-
 include/linux/cma.h                           |  18 +
 include/linux/huge_mm.h                       |  84 +-
 include/linux/hugetlb.h                       |  12 -
 include/linux/memcontrol.h                    |   5 +
 include/linux/mm.h                            |  42 +-
 include/linux/mm_types.h                      |  11 +-
 include/linux/mmu_notifier.h                  |  13 +
 include/linux/mmzone.h                        |   1 +
 include/linux/page-flags.h                    |  48 +
 include/linux/pagewalk.h                      |   4 +-
 include/linux/pgtable.h                       |  34 +
 include/linux/rmap.h                          |  10 +-
 include/linux/swap.h                          |   2 +
 include/linux/vm_event_item.h                 |   7 +
 include/uapi/asm-generic/mman-common.h        |  23 +
 include/uapi/linux/kernel-page-flags.h        |   1 +
 kernel/events/uprobes.c                       |   4 +-
 kernel/fork.c                                 |  10 +-
 mm/cma.c                                      | 119 +++
 mm/debug.c                                    |   6 +-
 mm/gup.c                                      |  60 +-
 mm/hmm.c                                      |  16 +-
 mm/huge_memory.c                              | 899 +++++++++++++++++-
 mm/hugetlb.c                                  | 117 +--
 mm/khugepaged.c                               |  16 +-
 mm/ksm.c                                      |   4 +-
 mm/madvise.c                                  |  76 +-
 mm/mapping_dirty_helpers.c                    |   6 +-
 mm/memcontrol.c                               |  43 +-
 mm/memory.c                                   |  28 +-
 mm/mempolicy.c                                |  29 +-
 mm/migrate.c                                  |  12 +-
 mm/mincore.c                                  |  10 +-
 mm/page_alloc.c                               |  53 +-
 mm/page_vma_mapped.c                          | 171 +++-
 mm/pagewalk.c                                 |  47 +-
 mm/pgtable-generic.c                          |  49 +-
 mm/ptdump.c                                   |   3 +-
 mm/rmap.c                                     | 300 ++++--
 mm/swap.c                                     |  30 +
 mm/swap_slots.c                               |   2 +
 mm/swapfile.c                                 |  11 +-
 mm/userfaultfd.c                              |   2 +-
 mm/util.c                                     |  22 +-
 mm/vmscan.c                                   |  33 +-
 mm/vmstat.c                                   |   8 +
 58 files changed, 2396 insertions(+), 460 deletions(-)

--
2.28.0


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 01/30] mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
@ 2020-09-28 17:53 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 02/30] mm: pagewalk: use READ_ONCE when reading the PMD " Zi Yan
                   ` (29 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

From: Jason Gunthorpe <jgg@nvidia.com>

The pagewalker runs while only holding the mmap_sem for read. The pud can
be set asynchronously, while also holding the mmap_sem for read

eg from:

 handle_mm_fault()
  __handle_mm_fault()
   create_huge_pmd()
    dev_dax_huge_fault()
     __dev_dax_pud_fault()
      vmf_insert_pfn_pud()
       insert_pfn_pud()
        pud_lock()
        set_pud_at()

At least x86 sets the PUD using WRITE_ONCE(), so an unlocked read of
unstable data should be paired to use READ_ONCE().

For the pagewalker to work locklessly the PUD must work similarly to the
PMD: once the PUD entry becomes a pointer to a PMD, it must be stable, and
safe to pass to pmd_offset()

Passing the value from READ_ONCE into the callbacks prevents the callers
from seeing inconsistencies after they re-read, such as seeing pud_none().

If a callback does obtain the pud_lock then it should trigger ACTION_AGAIN
if a data race caused the original value to change.

Use the same pattern as gup_pmd_range() and pass in the address of the
local READ_ONCE stack variable to pmd_offset() to avoid reading it again.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 include/linux/pagewalk.h   |  2 +-
 mm/hmm.c                   | 16 +++++++---------
 mm/mapping_dirty_helpers.c |  6 ++----
 mm/pagewalk.c              | 28 ++++++++++++++++------------
 mm/ptdump.c                |  3 +--
 5 files changed, 27 insertions(+), 28 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index b1cb6b753abb..6caf28aadafb 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -39,7 +39,7 @@ struct mm_walk_ops {
 			 unsigned long next, struct mm_walk *walk);
 	int (*p4d_entry)(p4d_t *p4d, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
-	int (*pud_entry)(pud_t *pud, unsigned long addr,
+	int (*pud_entry)(pud_t pud, pud_t *pudp, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
 	int (*pmd_entry)(pmd_t *pmd, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
diff --git a/mm/hmm.c b/mm/hmm.c
index 943cb2ba4442..419e9e50fd51 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -402,28 +402,26 @@ static inline unsigned long pud_to_hmm_pfn_flags(struct hmm_range *range,
 	       hmm_pfn_flags_order(PUD_SHIFT - PAGE_SHIFT);
 }
 
-static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
-		struct mm_walk *walk)
+static int hmm_vma_walk_pud(pud_t pud, pud_t *pudp, unsigned long start,
+			    unsigned long end, struct mm_walk *walk)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
 	unsigned long addr = start;
-	pud_t pud;
 	int ret = 0;
 	spinlock_t *ptl = pud_trans_huge_lock(pudp, walk->vma);
 
 	if (!ptl)
 		return 0;
+	if (memcmp(pudp, &pud, sizeof(pud)) != 0) {
+		walk->action = ACTION_AGAIN;
+		spin_unlock(ptl);
+		return 0;
+	}
 
 	/* Normally we don't want to split the huge page */
 	walk->action = ACTION_CONTINUE;
 
-	pud = READ_ONCE(*pudp);
-	if (pud_none(pud)) {
-		spin_unlock(ptl);
-		return hmm_vma_walk_hole(start, end, -1, walk);
-	}
-
 	if (pud_huge(pud) && pud_devmap(pud)) {
 		unsigned long i, npages, pfn;
 		unsigned int required_fault;
diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index 2c7d03675903..9fc46ebef497 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -150,11 +150,9 @@ static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
  * causes dirty info loss. The pagefault handler should do
  * that if needed.
  */
-static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end,
-			      struct mm_walk *walk)
+static int wp_clean_pud_entry(pud_t pudval, pud_t *pudp, unsigned long addr,
+			      unsigned long end, struct mm_walk *walk)
 {
-	pud_t pudval = READ_ONCE(*pud);
-
 	if (!pud_trans_unstable(&pudval))
 		return 0;
 
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..15d1e423b4a3 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	return err;
 }
 
-static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+static int walk_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
 {
 	pmd_t *pmd;
@@ -67,7 +67,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	int err = 0;
 	int depth = real_depth(3);
 
-	pmd = pmd_offset(pud, addr);
+	pmd = pmd_offset(&pud, addr);
 	do {
 again:
 		next = pmd_addr_end(addr, end);
@@ -119,17 +119,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
 {
-	pud_t *pud;
+	pud_t *pudp;
+	pud_t pud;
 	unsigned long next;
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
 	int depth = real_depth(2);
 
-	pud = pud_offset(p4d, addr);
+	pudp = pud_offset(p4d, addr);
 	do {
  again:
+		pud = READ_ONCE(*pudp);
 		next = pud_addr_end(addr, end);
-		if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
+		if (pud_none(pud) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
@@ -140,27 +142,29 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 		walk->action = ACTION_SUBTREE;
 
 		if (ops->pud_entry)
-			err = ops->pud_entry(pud, addr, next, walk);
+			err = ops->pud_entry(pud, pudp, addr, next, walk);
 		if (err)
 			break;
 
 		if (walk->action == ACTION_AGAIN)
 			goto again;
 
-		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
+		if ((!walk->vma && (pud_leaf(pud) || !pud_present(pud))) ||
 		    walk->action == ACTION_CONTINUE ||
 		    !(ops->pmd_entry || ops->pte_entry))
 			continue;
 
-		if (walk->vma)
-			split_huge_pud(walk->vma, pud, addr);
-		if (pud_none(*pud))
-			goto again;
+		if (walk->vma) {
+			split_huge_pud(walk->vma, pudp, addr);
+			pud = READ_ONCE(*pudp);
+			if (pud_none(pud))
+				goto again;
+		}
 
 		err = walk_pmd_range(pud, addr, next, walk);
 		if (err)
 			break;
-	} while (pud++, addr = next, addr != end);
+	} while (pudp++, addr = next, addr != end);
 
 	return err;
 }
diff --git a/mm/ptdump.c b/mm/ptdump.c
index ba88ec43ff21..2055b940408e 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -65,11 +65,10 @@ static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr,
 	return 0;
 }
 
-static int ptdump_pud_entry(pud_t *pud, unsigned long addr,
+static int ptdump_pud_entry(pud_t val, pud_t *pudp, unsigned long addr,
 			    unsigned long next, struct mm_walk *walk)
 {
 	struct ptdump_state *st = walk->private;
-	pud_t val = READ_ONCE(*pud);
 
 #if CONFIG_PGTABLE_LEVELS > 2 && defined(CONFIG_KASAN)
 	if (pud_page(val) == virt_to_page(lm_alias(kasan_early_shadow_pmd)))
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 02/30] mm: pagewalk: use READ_ONCE when reading the PMD entry unlocked
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
  2020-09-28 17:53 ` [RFC PATCH v2 01/30] mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 03/30] mm: thp: use single linked list for THP page table page deposit Zi Yan
                   ` (28 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

The pagewalker runs while only holding the mmap_sem for read. The pud can
be set asynchronously, while also holding the mmap_sem for read.

This follows the same way as the commit:
mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked"

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 fs/proc/task_mmu.c       | 69 ++++++++++++++++++++++++++--------------
 include/linux/pagewalk.h |  2 +-
 mm/madvise.c             | 59 ++++++++++++++++++----------------
 mm/memcontrol.c          | 30 +++++++++++------
 mm/mempolicy.c           | 15 ++++++---
 mm/mincore.c             | 10 +++---
 mm/pagewalk.c            | 21 ++++++------
 7 files changed, 124 insertions(+), 82 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 069978777423..a21484b1414d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -570,28 +570,33 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 }
 #endif
 
-static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-			   struct mm_walk *walk)
+static int smaps_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
+			unsigned long end, struct mm_walk *walk)
 {
 	struct vm_area_struct *vma = walk->vma;
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	ptl = pmd_trans_huge_lock(pmd, vma);
+	ptl = pmd_trans_huge_lock(pmdp, vma);
 	if (ptl) {
-		smaps_pmd_entry(pmd, addr, walk);
+		if (memcmp(pmdp, &pmd, sizeof(pmd)) != 0) {
+			walk->action = ACTION_AGAIN;
+			spin_unlock(ptl);
+			return 0;
+		}
+		smaps_pmd_entry(pmdp, addr, walk);
 		spin_unlock(ptl);
 		goto out;
 	}
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(&pmd))
 		goto out;
 	/*
 	 * The mmap_lock held all the way back in m_start() is what
 	 * keeps khugepaged out of here and from collapsing things
 	 * in here.
 	 */
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	pte = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE)
 		smaps_pte_entry(pte, addr, walk);
 	pte_unmap_unlock(pte - 1, ptl);
@@ -1091,7 +1096,7 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 }
 #endif
 
-static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
+static int clear_refs_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 {
 	struct clear_refs_private *cp = walk->private;
@@ -1100,20 +1105,25 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	struct page *page;
 
-	ptl = pmd_trans_huge_lock(pmd, vma);
+	ptl = pmd_trans_huge_lock(pmdp, vma);
 	if (ptl) {
+		if (memcmp(pmdp, &pmd, sizeof(pmd)) != 0) {
+			walk->action = ACTION_AGAIN;
+			spin_unlock(ptl);
+			return 0;
+		}
 		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
-			clear_soft_dirty_pmd(vma, addr, pmd);
+			clear_soft_dirty_pmd(vma, addr, pmdp);
 			goto out;
 		}
 
-		if (!pmd_present(*pmd))
+		if (!pmd_present(pmd))
 			goto out;
 
-		page = pmd_page(*pmd);
+		page = pmd_page(pmd);
 
 		/* Clear accessed and referenced bits. */
-		pmdp_test_and_clear_young(vma, addr, pmd);
+		pmdp_test_and_clear_young(vma, addr, pmdp);
 		test_and_clear_page_young(page);
 		ClearPageReferenced(page);
 out:
@@ -1121,10 +1131,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		return 0;
 	}
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(&pmd))
 		return 0;
 
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	pte = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
 
@@ -1388,8 +1398,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 	return make_pme(frame, flags);
 }
 
-static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
-			     struct mm_walk *walk)
+static int pagemap_pmd_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
+			unsigned long end, struct mm_walk *walk)
 {
 	struct vm_area_struct *vma = walk->vma;
 	struct pagemapread *pm = walk->private;
@@ -1401,9 +1411,14 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	ptl = pmd_trans_huge_lock(pmdp, vma);
 	if (ptl) {
 		u64 flags = 0, frame = 0;
-		pmd_t pmd = *pmdp;
 		struct page *page = NULL;
 
+		if (memcmp(pmdp, &pmd, sizeof(pmd)) != 0) {
+			walk->action = ACTION_AGAIN;
+			spin_unlock(ptl);
+			return 0;
+		}
+
 		if (vma->vm_flags & VM_SOFTDIRTY)
 			flags |= PM_SOFT_DIRTY;
 
@@ -1456,7 +1471,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		return err;
 	}
 
-	if (pmd_trans_unstable(pmdp))
+	if (pmd_trans_unstable(&pmd))
 		return 0;
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -1768,7 +1783,7 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
 }
 #endif
 
-static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
+static int gather_pte_stats(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 		unsigned long end, struct mm_walk *walk)
 {
 	struct numa_maps *md = walk->private;
@@ -1778,22 +1793,28 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	pte_t *pte;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	ptl = pmd_trans_huge_lock(pmd, vma);
+	ptl = pmd_trans_huge_lock(pmdp, vma);
 	if (ptl) {
 		struct page *page;
 
-		page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+		if (memcmp(pmdp, &pmd, sizeof(pmd)) != 0) {
+			walk->action = ACTION_AGAIN;
+			spin_unlock(ptl);
+			return 0;
+		}
+
+		page = can_gather_numa_stats_pmd(pmd, vma, addr);
 		if (page)
-			gather_stats(page, md, pmd_dirty(*pmd),
+			gather_stats(page, md, pmd_dirty(pmd),
 				     HPAGE_PMD_SIZE/PAGE_SIZE);
 		spin_unlock(ptl);
 		return 0;
 	}
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(&pmd))
 		return 0;
 #endif
-	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	orig_pte = pte = pte_offset_map_lock(walk->mm, pmdp, addr, &ptl);
 	do {
 		struct page *page = can_gather_numa_stats(*pte, vma, addr);
 		if (!page)
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 6caf28aadafb..686b57e94a9f 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -41,7 +41,7 @@ struct mm_walk_ops {
 			 unsigned long next, struct mm_walk *walk);
 	int (*pud_entry)(pud_t pud, pud_t *pudp, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
-	int (*pmd_entry)(pmd_t *pmd, unsigned long addr,
+	int (*pmd_entry)(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
 	int (*pte_entry)(pte_t *pte, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
diff --git a/mm/madvise.c b/mm/madvise.c
index ae266dfede8a..16e7b8eadb13 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -183,14 +183,14 @@ static long madvise_behavior(struct vm_area_struct *vma,
 }
 
 #ifdef CONFIG_SWAP
-static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
+static int swapin_walk_pmd_entry(pmd_t pmd, pmd_t *pmdp, unsigned long start,
 	unsigned long end, struct mm_walk *walk)
 {
 	pte_t *orig_pte;
 	struct vm_area_struct *vma = walk->private;
 	unsigned long index;
 
-	if (pmd_none_or_trans_huge_or_clear_bad(pmd))
+	if (pmd_none_or_trans_huge_or_clear_bad(&pmd))
 		return 0;
 
 	for (index = start; index != end; index += PAGE_SIZE) {
@@ -199,7 +199,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 		struct page *page;
 		spinlock_t *ptl;
 
-		orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
+		orig_pte = pte_offset_map_lock(vma->vm_mm, pmdp, start, &ptl);
 		pte = *(orig_pte + ((index - start) / PAGE_SIZE));
 		pte_unmap_unlock(orig_pte, ptl);
 
@@ -304,7 +304,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	return 0;
 }
 
-static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
+static int madvise_cold_or_pageout_pte_range(pmd_t pmd, pmd_t *pmdp,
 				unsigned long addr, unsigned long end,
 				struct mm_walk *walk)
 {
@@ -322,26 +322,29 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		return -EINTR;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (pmd_trans_huge(*pmd)) {
-		pmd_t orig_pmd;
+	if (pmd_trans_huge(pmd)) {
 		unsigned long next = pmd_addr_end(addr, end);
 
 		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
-		ptl = pmd_trans_huge_lock(pmd, vma);
+		ptl = pmd_trans_huge_lock(pmdp, vma);
 		if (!ptl)
 			return 0;
 
-		orig_pmd = *pmd;
-		if (is_huge_zero_pmd(orig_pmd))
+		if (memcmp(pmdp, &pmd, sizeof(pmd)) != 0) {
+			walk->action = ACTION_AGAIN;
+			goto huge_unlock;
+		}
+
+		if (is_huge_zero_pmd(pmd))
 			goto huge_unlock;
 
-		if (unlikely(!pmd_present(orig_pmd))) {
+		if (unlikely(!pmd_present(pmd))) {
 			VM_BUG_ON(thp_migration_supported() &&
-					!is_pmd_migration_entry(orig_pmd));
+					!is_pmd_migration_entry(pmd));
 			goto huge_unlock;
 		}
 
-		page = pmd_page(orig_pmd);
+		page = pmd_page(pmd);
 
 		/* Do not interfere with other mappings of this page */
 		if (page_mapcount(page) != 1)
@@ -361,12 +364,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			return 0;
 		}
 
-		if (pmd_young(orig_pmd)) {
-			pmdp_invalidate(vma, addr, pmd);
-			orig_pmd = pmd_mkold(orig_pmd);
+		if (pmd_young(pmd)) {
+			pmdp_invalidate(vma, addr, pmdp);
+			pmd = pmd_mkold(pmd);
 
-			set_pmd_at(mm, addr, pmd, orig_pmd);
-			tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+			set_pmd_at(mm, addr, pmdp, pmd);
+			tlb_remove_pmd_tlb_entry(tlb, pmdp, addr);
 		}
 
 		ClearPageReferenced(page);
@@ -388,11 +391,11 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	}
 
 regular_page:
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(&pmd))
 		return 0;
 #endif
 	tlb_change_page_size(tlb, PAGE_SIZE);
-	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	for (; addr < end; pte++, addr += PAGE_SIZE) {
@@ -424,12 +427,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			if (split_huge_page(page)) {
 				unlock_page(page);
 				put_page(page);
-				pte_offset_map_lock(mm, pmd, addr, &ptl);
+				pte_offset_map_lock(mm, pmdp, addr, &ptl);
 				break;
 			}
 			unlock_page(page);
 			put_page(page);
-			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+			pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 			pte--;
 			addr -= PAGE_SIZE;
 			continue;
@@ -566,7 +569,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
 	return 0;
 }
 
-static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
+static int madvise_free_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 
 {
@@ -580,15 +583,15 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	unsigned long next;
 
 	next = pmd_addr_end(addr, end);
-	if (pmd_trans_huge(*pmd))
-		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
+	if (pmd_trans_huge(pmd))
+		if (madvise_free_huge_pmd(tlb, vma, pmdp, addr, next))
 			goto next;
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(&pmd))
 		return 0;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
-	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	orig_pte = pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
@@ -634,12 +637,12 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			if (split_huge_page(page)) {
 				unlock_page(page);
 				put_page(page);
-				pte_offset_map_lock(mm, pmd, addr, &ptl);
+				pte_offset_map_lock(mm, pmdp, addr, &ptl);
 				goto out;
 			}
 			unlock_page(page);
 			put_page(page);
-			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+			pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 			pte--;
 			addr -= PAGE_SIZE;
 			continue;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9c4a0851348f..b28f620c1c5b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5827,7 +5827,7 @@ static inline enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
 }
 #endif
 
-static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
+static int mem_cgroup_count_precharge_pte_range(pmd_t pmd, pmd_t *pmdp,
 					unsigned long addr, unsigned long end,
 					struct mm_walk *walk)
 {
@@ -5835,22 +5835,27 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	ptl = pmd_trans_huge_lock(pmd, vma);
+	ptl = pmd_trans_huge_lock(pmdp, vma);
 	if (ptl) {
+		if (memcmp(pmdp, &pmd, sizeof(pmd)) != 0) {
+			walk->action = ACTION_AGAIN;
+			spin_unlock(ptl);
+			return 0;
+		}
 		/*
 		 * Note their can not be MC_TARGET_DEVICE for now as we do not
 		 * support transparent huge page with MEMORY_DEVICE_PRIVATE but
 		 * this might change.
 		 */
-		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
+		if (get_mctgt_type_thp(vma, addr, pmd, NULL) == MC_TARGET_PAGE)
 			mc.precharge += HPAGE_PMD_NR;
 		spin_unlock(ptl);
 		return 0;
 	}
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(&pmd))
 		return 0;
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	pte = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE)
 		if (get_mctgt_type(vma, addr, *pte, NULL))
 			mc.precharge++;	/* increment precharge temporarily */
@@ -6023,7 +6028,7 @@ static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
 		mem_cgroup_clear_mc();
 }
 
-static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
+static int mem_cgroup_move_charge_pte_range(pmd_t pmd, pmd_t *pmdp,
 				unsigned long addr, unsigned long end,
 				struct mm_walk *walk)
 {
@@ -6035,13 +6040,18 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	union mc_target target;
 	struct page *page;
 
-	ptl = pmd_trans_huge_lock(pmd, vma);
+	ptl = pmd_trans_huge_lock(pmdp, vma);
 	if (ptl) {
+		if (memcmp(pmdp, &pmd, sizeof(pmd)) != 0) {
+			walk->action = ACTION_AGAIN;
+			spin_unlock(ptl);
+			return 0;
+		}
 		if (mc.precharge < HPAGE_PMD_NR) {
 			spin_unlock(ptl);
 			return 0;
 		}
-		target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
+		target_type = get_mctgt_type_thp(vma, addr, pmd, &target);
 		if (target_type == MC_TARGET_PAGE) {
 			page = target.page;
 			if (!isolate_lru_page(page)) {
@@ -6066,10 +6076,10 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 		return 0;
 	}
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(&pmd))
 		return 0;
 retry:
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	pte = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
 	for (; addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
 		bool device = false;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eddbe4e56c73..731a7710395f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -516,7 +516,7 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
  * -EIO - only MPOL_MF_STRICT was specified and an existing page was already
  *        on a node that does not follow the policy.
  */
-static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
+static int queue_pages_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 			unsigned long end, struct mm_walk *walk)
 {
 	struct vm_area_struct *vma = walk->vma;
@@ -528,18 +528,23 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	ptl = pmd_trans_huge_lock(pmd, vma);
+	ptl = pmd_trans_huge_lock(pmdp, vma);
 	if (ptl) {
-		ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
+		if (memcmp(pmdp, &pmd, sizeof(pmd)) != 0) {
+			walk->action = ACTION_AGAIN;
+			spin_unlock(ptl);
+			return 0;
+		}
+		ret = queue_pages_pmd(pmdp, ptl, addr, end, walk);
 		if (ret != 2)
 			return ret;
 	}
 	/* THP was split, fall through to pte walk */
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(&pmd))
 		return 0;
 
-	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	pte = pte_offset_map_lock(walk->mm, pmdp, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		if (!pte_present(*pte))
 			continue;
diff --git a/mm/mincore.c b/mm/mincore.c
index 02db1a834021..168661f32aaa 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -96,8 +96,8 @@ static int mincore_unmapped_range(unsigned long addr, unsigned long end,
 	return 0;
 }
 
-static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-			struct mm_walk *walk)
+static int mincore_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
+			unsigned long end, struct mm_walk *walk)
 {
 	spinlock_t *ptl;
 	struct vm_area_struct *vma = walk->vma;
@@ -105,19 +105,19 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	unsigned char *vec = walk->private;
 	int nr = (end - addr) >> PAGE_SHIFT;
 
-	ptl = pmd_trans_huge_lock(pmd, vma);
+	ptl = pmd_trans_huge_lock(pmdp, vma);
 	if (ptl) {
 		memset(vec, 1, nr);
 		spin_unlock(ptl);
 		goto out;
 	}
 
-	if (pmd_trans_unstable(pmd)) {
+	if (pmd_trans_unstable(&pmd)) {
 		__mincore_unmapped_range(addr, end, vma, vec);
 		goto out;
 	}
 
-	ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	ptep = pte_offset_map_lock(walk->mm, pmdp, addr, &ptl);
 	for (; addr != end; ptep++, addr += PAGE_SIZE) {
 		pte_t pte = *ptep;
 
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 15d1e423b4a3..a3752c82a7b2 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -61,17 +61,19 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 static int walk_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
 {
-	pmd_t *pmd;
+	pmd_t *pmdp;
+	pmd_t pmd;
 	unsigned long next;
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
 	int depth = real_depth(3);
 
-	pmd = pmd_offset(&pud, addr);
+	pmdp = pmd_offset(&pud, addr);
 	do {
 again:
+		pmd = READ_ONCE(*pmdp);
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) {
+		if (pmd_none(pmd) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
@@ -86,7 +88,7 @@ static int walk_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		 * needs to know about pmd_trans_huge() pmds
 		 */
 		if (ops->pmd_entry)
-			err = ops->pmd_entry(pmd, addr, next, walk);
+			err = ops->pmd_entry(pmd, pmdp, addr, next, walk);
 		if (err)
 			break;
 
@@ -97,21 +99,22 @@ static int walk_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		 * Check this here so we only break down trans_huge
 		 * pages when we _need_ to
 		 */
-		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
+		if ((!walk->vma && (pmd_leaf(pmd) || !pmd_present(pmd))) ||
 		    walk->action == ACTION_CONTINUE ||
 		    !(ops->pte_entry))
 			continue;
 
 		if (walk->vma) {
-			split_huge_pmd(walk->vma, pmd, addr);
-			if (pmd_trans_unstable(pmd))
+			split_huge_pmd(walk->vma, pmdp, addr);
+			pmd = READ_ONCE(*pmdp);
+			if (pmd_trans_unstable(&pmd))
 				goto again;
 		}
 
-		err = walk_pte_range(pmd, addr, next, walk);
+		err = walk_pte_range(pmdp, addr, next, walk);
 		if (err)
 			break;
-	} while (pmd++, addr = next, addr != end);
+	} while (pmdp++, addr = next, addr != end);
 
 	return err;
 }
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 03/30] mm: thp: use single linked list for THP page table page deposit.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
  2020-09-28 17:53 ` [RFC PATCH v2 01/30] mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 02/30] mm: pagewalk: use READ_ONCE when reading the PMD " Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 19:34   ` Matthew Wilcox
  2020-09-28 17:54 ` [RFC PATCH v2 04/30] mm: add new helper functions to allocate one PMD page with 512 PTE pages Zi Yan
                   ` (27 subsequent siblings)
  30 siblings, 1 reply; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

The old design uses the double linked list page->lru to chain all
deposited page table pages when creating a THP and page->pmd_huge_pte
to point to the first page of the list. As the second pointer in
page->lru overlaps with page->pmd_huge_pte, the design prevents
multi-level page table page deposit, which is useful for PUD and higher
level THPs.

The new design uses single linked list, where deposit_head points to
a single linked list of deposited pages and deposit_node can be used to
deposit the page itself to another list. For example, this allows us to
have one PUD page points to a list of PMD pages, each of which points
a list of PTE pages to support PUD level THP.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/mm.h       |  9 +++++----
 include/linux/mm_types.h |  8 +++++---
 kernel/fork.c            |  4 ++--
 mm/pgtable-generic.c     | 15 +++++----------
 4 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 17e712207d74..01b62da34794 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -10,6 +10,7 @@
 #include <linux/gfp.h>
 #include <linux/bug.h>
 #include <linux/list.h>
+#include <linux/llist.h>
 #include <linux/mmzone.h>
 #include <linux/rbtree.h>
 #include <linux/atomic.h>
@@ -2249,7 +2250,7 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
 static inline bool pmd_ptlock_init(struct page *page)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	page->pmd_huge_pte = NULL;
+	init_llist_head(&page->deposit_head);
 #endif
 	return ptlock_init(page);
 }
@@ -2257,12 +2258,12 @@ static inline bool pmd_ptlock_init(struct page *page)
 static inline void pmd_ptlock_free(struct page *page)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	VM_BUG_ON_PAGE(page->pmd_huge_pte, page);
+	VM_BUG_ON_PAGE(!llist_empty(&page->deposit_head), page);
 #endif
 	ptlock_free(page);
 }
 
-#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte)
+#define huge_pmd_deposit_head(mm, pmd) (pmd_to_page(pmd)->deposit_head)
 
 #else
 
@@ -2274,7 +2275,7 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
 static inline bool pmd_ptlock_init(struct page *page) { return true; }
 static inline void pmd_ptlock_free(struct page *page) {}
 
-#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
+#define huge_pmd_deposit_head(mm, pmd) ((mm)->deposit_head_pmd)
 
 #endif
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 496c3ff97cce..be842926577a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -6,6 +6,7 @@
 
 #include <linux/auxvec.h>
 #include <linux/list.h>
+#include <linux/llist.h>
 #include <linux/spinlock.h>
 #include <linux/rbtree.h>
 #include <linux/rwsem.h>
@@ -143,8 +144,8 @@ struct page {
 			struct list_head deferred_list;
 		};
 		struct {	/* Page table pages */
-			unsigned long _pt_pad_1;	/* compound_head */
-			pgtable_t pmd_huge_pte; /* protected by page->ptl */
+			struct llist_head deposit_head; /* pgtable deposit list head */
+			struct llist_node deposit_node; /* pgtable deposit list node */
 			unsigned long _pt_pad_2;	/* mapping */
 			union {
 				struct mm_struct *pt_mm; /* x86 pgds only */
@@ -511,7 +512,8 @@ struct mm_struct {
 		struct mmu_notifier_subscriptions *notifier_subscriptions;
 #endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
-		pgtable_t pmd_huge_pte; /* protected by page_table_lock */
+		/* pgtable deposit list head, protected by page_table_lock */
+		struct llist_head deposit_head_pmd;
 #endif
 #ifdef CONFIG_NUMA_BALANCING
 		/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 138cd6ca50da..9c8e880538de 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -661,7 +661,7 @@ static void check_mm(struct mm_struct *mm)
 				mm_pgtables_bytes(mm));
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
-	VM_BUG_ON_MM(mm->pmd_huge_pte, mm);
+	VM_BUG_ON_MM(!llist_empty(&mm->deposit_head_pmd), mm);
 #endif
 }
 
@@ -1022,7 +1022,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mmu_notifier_subscriptions_init(mm);
 	init_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
-	mm->pmd_huge_pte = NULL;
+	init_llist_head(&mm->deposit_head_pmd);
 #endif
 	mm_init_uprobes_state(mm);
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 9578db83e312..dbb0154165f1 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -164,11 +164,7 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 	assert_spin_locked(pmd_lockptr(mm, pmdp));
 
 	/* FIFO */
-	if (!pmd_huge_pte(mm, pmdp))
-		INIT_LIST_HEAD(&pgtable->lru);
-	else
-		list_add(&pgtable->lru, &pmd_huge_pte(mm, pmdp)->lru);
-	pmd_huge_pte(mm, pmdp) = pgtable;
+	llist_add(&pgtable->deposit_node, &huge_pmd_deposit_head(mm, pmdp));
 }
 #endif
 
@@ -180,12 +176,11 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 
 	assert_spin_locked(pmd_lockptr(mm, pmdp));
 
+	/* only withdraw from a non empty list */
+	VM_BUG_ON(llist_empty(&huge_pmd_deposit_head(mm, pmdp)));
 	/* FIFO */
-	pgtable = pmd_huge_pte(mm, pmdp);
-	pmd_huge_pte(mm, pmdp) = list_first_entry_or_null(&pgtable->lru,
-							  struct page, lru);
-	if (pmd_huge_pte(mm, pmdp))
-		list_del(&pgtable->lru);
+	pgtable = llist_entry(llist_del_first(&huge_pmd_deposit_head(mm, pmdp)),
+			struct page, deposit_node);
 	return pgtable;
 }
 #endif
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 04/30] mm: add new helper functions to allocate one PMD page with 512 PTE pages.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (2 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 03/30] mm: thp: use single linked list for THP page table page deposit Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 05/30] mm: thp: add page table deposit/withdraw functions for PUD THP Zi Yan
                   ` (26 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

This prepares for PUD THP support, which allocates 512 of such PMD pages
when creating a PUD THP. These page table pages will be withdrawn during
THP split.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/include/asm/pgalloc.h | 60 ++++++++++++++++++++++++++++++++++
 arch/x86/mm/pgtable.c          | 25 ++++++++++++++
 include/linux/huge_mm.h        |  3 ++
 3 files changed, 88 insertions(+)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index 62ad61d6fefc..b24284522973 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -52,6 +52,19 @@ extern pgd_t *pgd_alloc(struct mm_struct *);
 extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
 
 extern pgtable_t pte_alloc_one(struct mm_struct *);
+extern pgtable_t pte_alloc_order(struct mm_struct *mm, unsigned long address,
+		int order);
+
+static inline void pte_free_order(struct mm_struct *mm, struct page *pte,
+		int order)
+{
+	int i;
+
+	for (i = 0; i < (1<<order); i++) {
+		pgtable_pte_page_dtor(&pte[i]);
+		__free_page(&pte[i]);
+	}
+}
 
 extern void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte);
 
@@ -87,6 +100,53 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 #define pmd_pgtable(pmd) pmd_page(pmd)
 
 #if CONFIG_PGTABLE_LEVELS > 2
+static inline pmd_t *pmd_alloc_one_page_with_ptes(struct mm_struct *mm, unsigned long addr)
+{
+	pgtable_t pte_pgtables;
+	pmd_t *pmd;
+	spinlock_t *pmd_ptl;
+	int i;
+
+	pte_pgtables = pte_alloc_order(mm, addr,
+		HPAGE_PUD_ORDER - HPAGE_PMD_ORDER);
+	if (!pte_pgtables)
+		return NULL;
+
+	pmd = pmd_alloc_one(mm, addr);
+	if (unlikely(!pmd)) {
+		pte_free_order(mm, pte_pgtables,
+			HPAGE_PUD_ORDER - HPAGE_PMD_ORDER);
+		return NULL;
+	}
+	pmd_ptl = pmd_lock(mm, pmd);
+
+	for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+		pgtable_trans_huge_deposit(mm, pmd, pte_pgtables + i);
+
+	spin_unlock(pmd_ptl);
+
+	return pmd;
+}
+
+static inline void pmd_free_page_with_ptes(struct mm_struct *mm, pmd_t *pmd)
+{
+	spinlock_t *pmd_ptl;
+	int i;
+
+	BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
+	pmd_ptl = pmd_lock(mm, pmd);
+
+	for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++) {
+		pgtable_t pte_pgtable;
+
+		pte_pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+		pte_free(mm, pte_pgtable);
+	}
+
+	spin_unlock(pmd_ptl);
+	pmd_free(mm, pmd);
+}
+
 extern void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd);
 
 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd,
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index dfd82f51ba66..7be73aee6183 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -33,6 +33,31 @@ pgtable_t pte_alloc_one(struct mm_struct *mm)
 	return __pte_alloc_one(mm, __userpte_alloc_gfp);
 }
 
+pgtable_t pte_alloc_order(struct mm_struct *mm, unsigned long address, int order)
+{
+	struct page *pte;
+	int i;
+
+	pte = alloc_pages(__userpte_alloc_gfp, order);
+	if (!pte)
+		return NULL;
+	split_page(pte, order);
+	for (i = 1; i < (1 << order); i++)
+		set_page_private(pte + i, 0);
+
+	for (i = 0; i < (1<<order); i++) {
+		if (!pgtable_pte_page_ctor(&pte[i])) {
+			__free_page(&pte[i]);
+			while (--i >= 0) {
+				pgtable_pte_page_dtor(&pte[i]);
+				__free_page(&pte[i]);
+			}
+			return NULL;
+		}
+	}
+	return pte;
+}
+
 static int __init setup_userpte(char *arg)
 {
 	if (!arg)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 8a8bc46a2432..e9d228d4fc69 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -115,6 +115,9 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 
+#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define HPAGE_PMD_SHIFT PMD_SHIFT
 #define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 05/30] mm: thp: add page table deposit/withdraw functions for PUD THP.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (3 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 04/30] mm: add new helper functions to allocate one PMD page with 512 PTE pages Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 06/30] mm: change thp_order and thp_nr as we will have not just PMD THPs Zi Yan
                   ` (25 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

We deposit 512 PMD pages, each of which has 512 PTE pages deposited in
its ->deposit_head, to mm->deposit_head_pud. They will be withdrawn and
used when a PUD THP split into 512 PMD THPs. In this way, when any of
the 512 PMD THPs is split further, we will use the existing code path to
withdraw PTE pages for use.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/mm.h       |  2 ++
 include/linux/mm_types.h |  3 +++
 include/linux/pgtable.h  |  3 +++
 kernel/fork.c            |  6 ++++++
 mm/pgtable-generic.c     | 23 +++++++++++++++++++++++
 5 files changed, 37 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 01b62da34794..8f54f06c8eb6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2321,6 +2321,8 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
 	return ptl;
 }
 
+#define huge_pud_deposit_head(mm, pud) ((mm)->deposit_head_pud)
+
 extern void __init pagecache_init(void);
 extern void __init free_area_init_memoryless_node(int nid);
 extern void free_initmem(void);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index be842926577a..5ff4dd6a3e32 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -515,6 +515,9 @@ struct mm_struct {
 		/* pgtable deposit list head, protected by page_table_lock */
 		struct llist_head deposit_head_pmd;
 #endif
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+		struct llist_head deposit_head_pud; /* protected by page_table_lock */
+#endif
 #ifdef CONFIG_NUMA_BALANCING
 		/*
 		 * numa_next_scan is the next time that the PTEs will be marked
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 177eab8e1c31..1f6d46465c54 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -465,10 +465,13 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
+extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
+				       pgtable_t pgtable);
 #endif
 
 #ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
 extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
+extern pgtable_t pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
 #endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/kernel/fork.c b/kernel/fork.c
index 9c8e880538de..86fbeec751ef 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -663,6 +663,9 @@ static void check_mm(struct mm_struct *mm)
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	VM_BUG_ON_MM(!llist_empty(&mm->deposit_head_pmd), mm);
 #endif
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	VM_BUG_ON_MM(!llist_empty(&mm->deposit_head_pud), mm);
+#endif
 }
 
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
@@ -1023,6 +1026,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	init_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	init_llist_head(&mm->deposit_head_pmd);
+#endif
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	init_llist_head(&mm->deposit_head_pud);
 #endif
 	mm_init_uprobes_state(mm);
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index dbb0154165f1..a014cf847067 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -166,6 +166,15 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 	/* FIFO */
 	llist_add(&pgtable->deposit_node, &huge_pmd_deposit_head(mm, pmdp));
 }
+
+void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
+				pgtable_t pgtable)
+{
+	assert_spin_locked(pud_lockptr(mm, pudp));
+
+	/* FIFO */
+	llist_add(&pgtable->deposit_node, &huge_pud_deposit_head(mm, pudp));
+}
 #endif
 
 #ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
@@ -183,6 +192,20 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 			struct page, deposit_node);
 	return pgtable;
 }
+
+pgtable_t pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
+{
+	pgtable_t pgtable;
+
+	assert_spin_locked(pud_lockptr(mm, pudp));
+
+	/* only withdraw from a non empty list */
+	VM_BUG_ON(llist_empty(&huge_pud_deposit_head(mm, pudp)));
+	/* FIFO */
+	pgtable = llist_entry(llist_del_first(&huge_pud_deposit_head(mm, pmdp)),
+			struct page, deposit_node);
+	return pgtable;
+}
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 06/30] mm: change thp_order and thp_nr as we will have not just PMD THPs.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (4 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 05/30] mm: thp: add page table deposit/withdraw functions for PUD THP Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 07/30] mm: thp: add anonymous PUD THP page fault support without enabling it Zi Yan
                   ` (24 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

As PUD THP is going to be added in the following patches, thp_order and
thp_nr can be HPAGE_PUD_ORDER and HPAGE_PUD_NR, respectively.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e9d228d4fc69..addd206150e2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -279,7 +279,7 @@ static inline unsigned int thp_order(struct page *page)
 {
 	VM_BUG_ON_PGFLAGS(PageTail(page), page);
 	if (PageHead(page))
-		return HPAGE_PMD_ORDER;
+		return page[1].compound_order;
 	return 0;
 }
 
@@ -291,7 +291,7 @@ static inline int thp_nr_pages(struct page *page)
 {
 	VM_BUG_ON_PGFLAGS(PageTail(page), page);
 	if (PageHead(page))
-		return HPAGE_PMD_NR;
+		return (1<<page[1].compound_order);
 	return 1;
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 07/30] mm: thp: add anonymous PUD THP page fault support without enabling it.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (5 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 06/30] mm: change thp_order and thp_nr as we will have not just PMD THPs Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 08/30] mm: thp: add PUD THP support for copy_huge_pud Zi Yan
                   ` (23 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

This adds PUD THP support for anonymous pages. Applications will be
able to get PUD pages during page faults when their VMAs are larger than
PUD page size after the page fault path is enabled.

No shared zero PUD THP is created and shared by all read-only zero PUD
THPs, different zero read-only PMD THPs. We do not want to reserve so
much physical memory for this use, assuming the case will be rare.

New PUD THP related events are added too.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/include/asm/pgtable.h |   2 +
 drivers/base/node.c            |   3 +
 fs/proc/meminfo.c              |   2 +
 include/linux/huge_mm.h        |   6 ++
 include/linux/mmzone.h         |   1 +
 include/linux/vm_event_item.h  |   3 +
 mm/huge_memory.c               | 105 +++++++++++++++++++++++++++++++++
 mm/page_alloc.c                |   3 +-
 mm/rmap.c                      |  24 ++++++--
 mm/vmstat.c                    |   4 ++
 10 files changed, 147 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..199de6be2f6d 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1141,6 +1141,8 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned long
 	return native_pmdp_get_and_clear(pmdp);
 }
 
+#define mk_pud(page, pgprot)   pfn_pud(page_to_pfn(page), (pgprot))
+
 #define __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR
 static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 					unsigned long addr, pud_t *pudp)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 9426b0f1f660..fe809c914be0 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -428,6 +428,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d SUnreclaim:     %8lu kB\n"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       "Node %d AnonHugePages:  %8lu kB\n"
+		       "Node %d AnonHugePUDPages: %8lu kB\n"
 		       "Node %d ShmemHugePages: %8lu kB\n"
 		       "Node %d ShmemPmdMapped: %8lu kB\n"
 		       "Node %d FileHugePages: %8lu kB\n"
@@ -457,6 +458,8 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       ,
 		       nid, K(node_page_state(pgdat, NR_ANON_THPS) *
 				       HPAGE_PMD_NR),
+			   nid, K(node_page_state(pgdat, NR_ANON_THPS_PUD) *
+				       HPAGE_PUD_NR),
 		       nid, K(node_page_state(pgdat, NR_SHMEM_THPS) *
 				       HPAGE_PMD_NR),
 		       nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 887a5532e449..b60e0c241015 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -130,6 +130,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	show_val_kb(m, "AnonHugePages:  ",
 		    global_node_page_state(NR_ANON_THPS) * HPAGE_PMD_NR);
+	show_val_kb(m, "AnonHugePUDPages:  ",
+			global_node_page_state(NR_ANON_THPS_PUD) * HPAGE_PUD_NR);
 	show_val_kb(m, "ShmemHugePages: ",
 		    global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR);
 	show_val_kb(m, "ShmemPmdMapped: ",
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index addd206150e2..7528652400e4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -18,10 +18,15 @@ extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 extern void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
+extern int do_huge_pud_anonymous_page(struct vm_fault *vmf);
 #else
 static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 {
 }
+extern int do_huge_pud_anonymous_page(struct vm_fault *vmf)
+{
+	return VM_FAULT_FALLBACK;
+}
 #endif
 
 extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
@@ -323,6 +328,7 @@ struct page *mm_get_huge_zero_page(struct mm_struct *mm);
 void mm_put_huge_zero_page(struct mm_struct *mm);
 
 #define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))
+#define mk_huge_pud(page, prot) pud_mkhuge(mk_pud(page, prot))
 
 static inline bool thp_migration_supported(void)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7e0ea3fe95ca..cbc768d364fd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -196,6 +196,7 @@ enum node_stat_item {
 	NR_FILE_THPS,
 	NR_FILE_PMDMAPPED,
 	NR_ANON_THPS,
+	NR_ANON_THPS_PUD,
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..416d9966fa3f 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -93,6 +93,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_DEFERRED_SPLIT_PAGE,
 		THP_SPLIT_PMD,
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+		THP_FAULT_ALLOC_PUD,
+		THP_FAULT_FALLBACK_PUD,
+		THP_FAULT_FALLBACK_PUD_CHARGE,
 		THP_SPLIT_PUD,
 #endif
 		THP_ZERO_PAGE_ALLOC,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b1c7dc8a6f96..20a3d393d451 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -933,6 +933,111 @@ vm_fault_t vmf_insert_pfn_pud_prot(struct vm_fault *vmf, pfn_t pfn,
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud_prot);
+
+static int __do_huge_pud_anonymous_page(struct vm_fault *vmf, struct page *page,
+		gfp_t gfp)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	pmd_t *pmd_pgtable;
+	unsigned long haddr = vmf->address & HPAGE_PUD_MASK;
+	int ret = 0;
+
+	VM_BUG_ON_PAGE(!PageCompound(page), page);
+
+	if (mem_cgroup_charge(page, vma->vm_mm, gfp)) {
+		put_page(page);
+		count_vm_event(THP_FAULT_FALLBACK_PUD);
+		count_vm_event(THP_FAULT_FALLBACK_CHARGE);
+		return VM_FAULT_FALLBACK;
+	}
+	cgroup_throttle_swaprate(page, gfp);
+
+	pmd_pgtable = pmd_alloc_one_page_with_ptes(vma->vm_mm, haddr);
+	if (unlikely(!pmd_pgtable)) {
+		ret = VM_FAULT_OOM;
+		goto release;
+	}
+
+	clear_huge_page(page, vmf->address, HPAGE_PUD_NR);
+	/*
+	 * The memory barrier inside __SetPageUptodate makes sure that
+	 * clear_huge_page writes become visible before the set_pmd_at()
+	 * write.
+	 */
+	__SetPageUptodate(page);
+
+	vmf->ptl = pud_lock(vma->vm_mm, vmf->pud);
+	if (unlikely(!pud_none(*vmf->pud))) {
+		goto unlock_release;
+	} else {
+		pud_t entry;
+		int i;
+
+		ret = check_stable_address_space(vma->vm_mm);
+		if (ret)
+			goto unlock_release;
+
+		/* Deliver the page fault to userland */
+		if (userfaultfd_missing(vma)) {
+			vm_fault_t ret2;
+
+			spin_unlock(vmf->ptl);
+			put_page(page);
+			pmd_free_page_with_ptes(vma->vm_mm, pmd_pgtable);
+			ret2 = handle_userfault(vmf, VM_UFFD_MISSING);
+			VM_BUG_ON(ret2 & VM_FAULT_FALLBACK);
+			return ret2;
+		}
+
+		entry = mk_huge_pud(page, vma->vm_page_prot);
+		entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
+		page_add_new_anon_rmap(page, vma, haddr, true);
+		lru_cache_add_inactive_or_unevictable(page, vma);
+		pgtable_trans_huge_pud_deposit(vma->vm_mm, vmf->pud,
+				virt_to_page(pmd_pgtable));
+		set_pud_at(vma->vm_mm, haddr, vmf->pud, entry);
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PUD_NR);
+		mm_inc_nr_pmds(vma->vm_mm);
+		for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+			mm_inc_nr_ptes(vma->vm_mm);
+		spin_unlock(vmf->ptl);
+		count_vm_event(THP_FAULT_ALLOC_PUD);
+	}
+
+	return 0;
+unlock_release:
+	spin_unlock(vmf->ptl);
+release:
+	if (pmd_pgtable)
+		pmd_free_page_with_ptes(vma->vm_mm, pmd_pgtable);
+	put_page(page);
+	return ret;
+
+}
+
+int do_huge_pud_anonymous_page(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	gfp_t gfp;
+	struct page *page;
+	unsigned long haddr = vmf->address & HPAGE_PUD_MASK;
+
+	if (haddr < vma->vm_start || haddr + HPAGE_PUD_SIZE > vma->vm_end)
+		return VM_FAULT_FALLBACK;
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+	/* no khugepaged_enter, since PUD THP is not supported by khugepaged */
+
+	gfp = alloc_hugepage_direct_gfpmask(vma);
+	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PUD_ORDER);
+	if (unlikely(!page)) {
+		count_vm_event(THP_FAULT_FALLBACK_PUD);
+		return VM_FAULT_FALLBACK;
+	}
+	prep_transhuge_page(page);
+	return __do_huge_pud_anonymous_page(vmf, page, gfp);
+}
+
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6b1b4a331792..29abeff09fcc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5434,7 +5434,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_SHMEM_THPS) * HPAGE_PMD_NR),
 			K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)
 					* HPAGE_PMD_NR),
-			K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR),
+			K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR +
+			  node_page_state(pgdat, NR_ANON_THPS_PUD) * HPAGE_PUD_NR),
 #endif
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 			node_page_state(pgdat, NR_KERNEL_STACK_KB),
diff --git a/mm/rmap.c b/mm/rmap.c
index 1b84945d655c..5683f367a792 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -726,6 +726,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
+	pud_t pude;
 	pmd_t *pmd = NULL;
 	pmd_t pmde;
 
@@ -738,7 +739,10 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 		goto out;
 
 	pud = pud_offset(p4d, address);
-	if (!pud_present(*pud))
+
+	pude = *pud;
+	barrier();
+	if (!pud_present(pude) || pud_trans_huge(pude))
 		goto out;
 
 	pmd = pmd_offset(pud, address);
@@ -1137,8 +1141,12 @@ void do_page_add_anon_rmap(struct page *page,
 		 * pte lock(a spinlock) is held, which implies preemption
 		 * disabled.
 		 */
-		if (compound)
-			__inc_lruvec_page_state(page, NR_ANON_THPS);
+		if (compound) {
+			if (nr == HPAGE_PMD_NR)
+				__inc_lruvec_page_state(page, NR_ANON_THPS);
+			else
+				__inc_lruvec_page_state(page, NR_ANON_THPS_PUD);
+		}
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
 	}
 
@@ -1180,7 +1188,10 @@ void page_add_new_anon_rmap(struct page *page,
 		if (hpage_pincount_available(page))
 			atomic_set(compound_pincount_ptr(page), 0);
 
-		__inc_lruvec_page_state(page, NR_ANON_THPS);
+		if (nr == HPAGE_PMD_NR)
+			__inc_lruvec_page_state(page, NR_ANON_THPS);
+		else
+			__inc_lruvec_page_state(page, NR_ANON_THPS_PUD);
 	} else {
 		/* Anon THP always mapped first with PMD */
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1286,7 +1297,10 @@ static void page_remove_anon_compound_rmap(struct page *page)
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 		return;
 
-	__dec_lruvec_page_state(page, NR_ANON_THPS);
+	if (thp_nr_pages(page) == HPAGE_PMD_NR)
+		__dec_lruvec_page_state(page, NR_ANON_THPS);
+	else
+		__dec_lruvec_page_state(page, NR_ANON_THPS_PUD);
 
 	if (TestClearPageDoubleMap(page)) {
 		/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 79e5cd0abd0e..a9e50ef6a40d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1209,6 +1209,7 @@ const char * const vmstat_text[] = {
 	"nr_file_hugepages",
 	"nr_file_pmdmapped",
 	"nr_anon_transparent_hugepages",
+	"nr_anon_transparent_pud_hugepages",
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",
 	"nr_dirtied",
@@ -1326,6 +1327,9 @@ const char * const vmstat_text[] = {
 	"thp_deferred_split_page",
 	"thp_split_pmd",
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	"thp_fault_alloc_pud",
+	"thp_fault_fallback_pud",
+	"thp_fault_fallback_pud_charge",
 	"thp_split_pud",
 #endif
 	"thp_zero_page_alloc",
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 08/30] mm: thp: add PUD THP support for copy_huge_pud.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (6 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 07/30] mm: thp: add anonymous PUD THP page fault support without enabling it Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 09/30] mm: thp: add PUD THP support to zap_huge_pud Zi Yan
                   ` (22 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

copy_huge_pud needs to allocate 1 PMD page table page and 512 PTE page
table pages and deposit them when copying a PUD THP. It is similar to
what we do at PUD THP page faults.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 20a3d393d451..ea9fbedcda26 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1264,7 +1264,12 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 {
 	spinlock_t *dst_ptl, *src_ptl;
 	pud_t pud;
-	int ret;
+	pmd_t *pmd_pgtable = NULL;
+	int ret = -ENOMEM;
+
+	pmd_pgtable = pmd_alloc_one_page_with_ptes(vma->vm_mm, addr);
+	if (unlikely(!pmd_pgtable))
+		goto out;
 
 	dst_ptl = pud_lock(dst_mm, dst_pud);
 	src_ptl = pud_lockptr(src_mm, src_pud);
@@ -1272,16 +1277,30 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	ret = -EAGAIN;
 	pud = *src_pud;
-	if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud)))
-		goto out_unlock;
 
 	/*
-	 * When page table lock is held, the huge zero pud should not be
-	 * under splitting since we don't split the page itself, only pud to
-	 * a page table.
+	 * only transparent huge pud page needs extra page table pages for
+	 * possible huge page split
 	 */
-	if (is_huge_zero_pud(pud)) {
-		/* No huge zero pud yet */
+	if (!pud_trans_huge(pud))
+		pmd_free_page_with_ptes(dst_mm, pmd_pgtable);
+
+	if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud)))
+		goto out_unlock;
+
+	if (pud_trans_huge(pud)) {
+		struct page *src_page;
+		int i;
+
+		src_page = pud_page(pud);
+		VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+		get_page(src_page);
+		page_dup_rmap(src_page, true);
+		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PUD_NR);
+		mm_inc_nr_pmds(dst_mm);
+		for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+			mm_inc_nr_ptes(dst_mm);
+		pgtable_trans_huge_pud_deposit(dst_mm, dst_pud, virt_to_page(pmd_pgtable));
 	}
 
 	pudp_set_wrprotect(src_mm, addr, src_pud);
@@ -1292,6 +1311,7 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 out_unlock:
 	spin_unlock(src_ptl);
 	spin_unlock(dst_ptl);
+out:
 	return ret;
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 09/30] mm: thp: add PUD THP support to zap_huge_pud.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (7 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 08/30] mm: thp: add PUD THP support for copy_huge_pud Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 10/30] fs: proc: add PUD THP kpageflag Zi Yan
                   ` (21 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

Preallocated 513 (1 PMD and 512 PTE) page table pages need to be freed
when PUD THP is removed. zap_pud_deposited_table is added to perform the
action.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 48 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 45 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ea9fbedcda26..76069affebef 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2013,11 +2013,27 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static inline void zap_pud_deposited_table(struct mm_struct *mm, pud_t *pud)
+{
+	pgtable_t pgtable;
+	int i;
+
+	pgtable = pgtable_trans_huge_pud_withdraw(mm, pud);
+	pmd_free_page_with_ptes(mm, (pmd_t *)page_address(pgtable));
+
+	mm_dec_nr_pmds(mm);
+	for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+		mm_dec_nr_ptes(mm);
+}
+
 int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pud_t *pud, unsigned long addr)
 {
+	pud_t orig_pud;
 	spinlock_t *ptl;
 
+	tlb_change_page_size(tlb, HPAGE_PUD_SIZE);
+
 	ptl = __pud_trans_huge_lock(pud, vma);
 	if (!ptl)
 		return 0;
@@ -2027,14 +2043,40 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 * pgtable_trans_huge_withdraw after finishing pudp related
 	 * operations.
 	 */
-	pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm);
+	orig_pud = pudp_huge_get_and_clear_full(tlb->mm, addr, pud,
+			tlb->fullmm);
 	tlb_remove_pud_tlb_entry(tlb, pud, addr);
 	if (vma_is_special_huge(vma)) {
 		spin_unlock(ptl);
 		/* No zero page support yet */
+	} else if (is_huge_zero_pud(orig_pud)) {
+		zap_pud_deposited_table(tlb->mm, pud);
+		spin_unlock(ptl);
+		tlb_remove_page_size(tlb, pud_page(orig_pud), HPAGE_PUD_SIZE);
 	} else {
-		/* No support for anonymous PUD pages yet */
-		BUG();
+		struct page *page = NULL;
+		int flush_needed = 1;
+
+		if (pud_present(orig_pud)) {
+			page = pud_page(orig_pud);
+			page_remove_rmap(page, true);
+			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+			VM_BUG_ON_PAGE(!PageHead(page), page);
+		} else
+			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+
+		if (PageAnon(page)) {
+			zap_pud_deposited_table(tlb->mm, pud);
+			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PUD_NR);
+		} else {
+			if (arch_needs_pgtable_deposit())
+				zap_pud_deposited_table(tlb->mm, pud);
+			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PUD_NR);
+		}
+
+		spin_unlock(ptl);
+		if (flush_needed)
+			tlb_remove_page_size(tlb, page, HPAGE_PUD_SIZE);
 	}
 	return 1;
 }
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 10/30] fs: proc: add PUD THP kpageflag.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (8 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 09/30] mm: thp: add PUD THP support to zap_huge_pud Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 11/30] mm: thp: handling PUD THP reference bit Zi Yan
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

Bit 27 is used to identify PUD THP.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 fs/proc/page.c                         | 2 ++
 include/uapi/linux/kernel-page-flags.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/proc/page.c b/fs/proc/page.c
index f3b39a7d2bf3..e4e2ad3612c9 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -161,6 +161,8 @@ u64 stable_page_flags(struct page *page)
 			u |= BIT_ULL(KPF_ZERO_PAGE);
 			u |= BIT_ULL(KPF_THP);
 		}
+		if (compound_order(head) == HPAGE_PUD_ORDER)
+			u |= 1 << KPF_PUD_THP;
 	} else if (is_zero_pfn(page_to_pfn(page)))
 		u |= BIT_ULL(KPF_ZERO_PAGE);
 
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index 6f2f2720f3ac..62c5fc70909b 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -36,5 +36,6 @@
 #define KPF_ZERO_PAGE		24
 #define KPF_IDLE		25
 #define KPF_PGTABLE		26
+#define KPF_PUD_THP		27
 
 #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 11/30] mm: thp: handling PUD THP reference bit.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (9 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 10/30] fs: proc: add PUD THP kpageflag Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 12/30] mm: rmap: add mappped/unmapped page order to anonymous page rmap functions Zi Yan
                   ` (19 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

Add PUD-level TLB flush ops and teach page_vma_mapped_talk about PUD
THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/include/asm/pgtable.h |  3 +++
 arch/x86/mm/pgtable.c          | 13 +++++++++++++
 include/linux/mmu_notifier.h   | 13 +++++++++++++
 include/linux/pgtable.h        | 14 ++++++++++++++
 include/linux/rmap.h           |  1 +
 mm/page_vma_mapped.c           | 33 +++++++++++++++++++++++++++++----
 mm/rmap.c                      | 12 +++++++++---
 7 files changed, 82 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 199de6be2f6d..8bf7bfd71a46 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1127,6 +1127,9 @@ extern int pudp_test_and_clear_young(struct vm_area_struct *vma,
 extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pmd_t *pmdp);
 
+#define __HAVE_ARCH_PUDP_CLEAR_YOUNG_FLUSH
+extern int pudp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pud_t *pudp);
 
 #define pmd_write pmd_write
 static inline int pmd_write(pmd_t pmd)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7be73aee6183..e4a2dffcc418 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -633,6 +633,19 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 
 	return young;
 }
+int pudp_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pud_t *pudp)
+{
+	int young;
+
+	VM_BUG_ON(address & ~HPAGE_PUD_MASK);
+
+	young = pudp_test_and_clear_young(vma, address, pudp);
+	if (young)
+		flush_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+
+	return young;
+}
 #endif
 
 /**
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b8200782dede..4ffa179e654f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -557,6 +557,19 @@ static inline void mmu_notifier_range_init_migrate(
 	__young;							\
 })
 
+#define pudp_clear_flush_young_notify(__vma, __address, __pudp)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = pudp_clear_flush_young(___vma, ___address, __pudp);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address,		\
+						  ___address +		\
+							PUD_SIZE);	\
+	__young;							\
+})
+
 #define ptep_clear_young_notify(__vma, __address, __ptep)		\
 ({									\
 	int __young;							\
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 1f6d46465c54..bb163504fb01 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -243,6 +243,20 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
+#ifndef __HAVE_ARCH_PUDP_CLEAR_YOUNG_FLUSH
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+extern int pudp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pud_t *pudp);
+#else
+int pudp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pud_t *pudp)
+{
+	BUILD_BUG();
+	return 0;
+}
+#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD  */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 3a6adfa70fb0..0af61dd193d2 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -206,6 +206,7 @@ struct page_vma_mapped_walk {
 	struct page *page;
 	struct vm_area_struct *vma;
 	unsigned long address;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 	spinlock_t *ptl;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 5e77b269c330..f88e845ad5e6 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -145,9 +145,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	struct page *page = pvmw->page;
 	pgd_t *pgd;
 	p4d_t *p4d;
-	pud_t *pud;
+	pud_t pude;
 	pmd_t pmde;
 
+	if (!pvmw->pte && !pvmw->pmd && pvmw->pud)
+		return not_found(pvmw);
+
 	/* The only possible pmd mapping has been handled on last iteration */
 	if (pvmw->pmd && !pvmw->pte)
 		return not_found(pvmw);
@@ -174,10 +177,32 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	p4d = p4d_offset(pgd, pvmw->address);
 	if (!p4d_present(*p4d))
 		return false;
-	pud = pud_offset(p4d, pvmw->address);
-	if (!pud_present(*pud))
+	pvmw->pud = pud_offset(p4d, pvmw->address);
+
+	/*
+	 * Make sure the pud value isn't cached in a register by the
+	 * compiler and used as a stale value after we've observed a
+	 * subsequent update.
+	 */
+	pude = READ_ONCE(*pvmw->pud);
+	if (pud_trans_huge(pude)) {
+		pvmw->ptl = pud_lock(mm, pvmw->pud);
+		if (likely(pud_trans_huge(*pvmw->pud))) {
+			if (pvmw->flags & PVMW_MIGRATION)
+				return not_found(pvmw);
+			if (pud_page(*pvmw->pud) != page)
+				return not_found(pvmw);
+			return true;
+		} else if (!pud_present(*pvmw->pud))
+			return not_found(pvmw);
+
+		/* THP pud was split under us: handle on pmd level */
+		spin_unlock(pvmw->ptl);
+		pvmw->ptl = NULL;
+	} else if (!pud_present(pude))
 		return false;
-	pvmw->pmd = pmd_offset(pud, pvmw->address);
+
+	pvmw->pmd = pmd_offset(pvmw->pud, pvmw->address);
 	/*
 	 * Make sure the pmd value isn't cached in a register by the
 	 * compiler and used as a stale value after we've observed a
diff --git a/mm/rmap.c b/mm/rmap.c
index 5683f367a792..629f8fe7ffac 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -803,9 +803,15 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
 					referenced++;
 			}
 		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-			if (pmdp_clear_flush_young_notify(vma, address,
-						pvmw.pmd))
-				referenced++;
+			if (pvmw.pmd) {
+				if (pmdp_clear_flush_young_notify(vma, address,
+							pvmw.pmd))
+					referenced++;
+			} else if (pvmw.pud) {
+				if (pudp_clear_flush_young_notify(vma, address,
+							pvmw.pud))
+					referenced++;
+			}
 		} else {
 			/* unexpected pmd-mapped page? */
 			WARN_ON_ONCE(1);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 12/30] mm: rmap: add mappped/unmapped page order to anonymous page rmap functions.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (10 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 11/30] mm: thp: handling PUD THP reference bit Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 13/30] mm: rmap: add map_order to page_remove_anon_compound_rmap Zi Yan
                   ` (18 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

page_add_anon_rmap, do_page_add_anon_rmap, page_add_new_anon_rmap,
page_remove_rmap are changed to have page order as a parameter. This
prepares for PMD-mapped PUD THP, since a PUD THP can be mapped in three
different ways, PTEs, PMDs, and PUDs and the original boolean parameter
is not enough to record the information.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/rmap.h    |  8 ++++----
 kernel/events/uprobes.c |  4 ++--
 mm/huge_memory.c        | 16 ++++++++--------
 mm/hugetlb.c            |  4 ++--
 mm/khugepaged.c         |  6 +++---
 mm/ksm.c                |  4 ++--
 mm/memory.c             | 16 ++++++++--------
 mm/migrate.c            | 10 +++++-----
 mm/rmap.c               | 22 +++++++++++++---------
 mm/swapfile.c           |  4 ++--
 mm/userfaultfd.c        |  2 +-
 11 files changed, 50 insertions(+), 46 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 0af61dd193d2..1244549f3eaf 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -171,13 +171,13 @@ struct anon_vma *page_get_anon_vma(struct page *page);
  */
 void page_move_anon_rmap(struct page *, struct vm_area_struct *);
 void page_add_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
+		unsigned long, int);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
-			   unsigned long, int);
+			   unsigned long, int, int);
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
+		unsigned long, int);
 void page_add_file_rmap(struct page *, bool);
-void page_remove_rmap(struct page *, bool);
+void page_remove_rmap(struct page *, int);
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 			    unsigned long);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 0e18aaf23a7b..21b85bac881d 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -183,7 +183,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	if (new_page) {
 		get_page(new_page);
-		page_add_new_anon_rmap(new_page, vma, addr, false);
+		page_add_new_anon_rmap(new_page, vma, addr, 0);
 		lru_cache_add_inactive_or_unevictable(new_page, vma);
 	} else
 		/* no new page, just dec_mm_counter for old_page */
@@ -200,7 +200,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		set_pte_at_notify(mm, addr, pvmw.pte,
 				  mk_pte(new_page, vma->vm_page_prot));
 
-	page_remove_rmap(old_page, false);
+	page_remove_rmap(old_page, 0);
 	if (!page_mapped(old_page))
 		try_to_free_swap(old_page);
 	page_vma_mapped_walk_done(&pvmw);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 76069affebef..6716c5286494 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -618,7 +618,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		page_add_new_anon_rmap(page, vma, haddr, true);
+		page_add_new_anon_rmap(page, vma, haddr, HPAGE_PMD_ORDER);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
@@ -991,7 +991,7 @@ static int __do_huge_pud_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 		entry = mk_huge_pud(page, vma->vm_page_prot);
 		entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
-		page_add_new_anon_rmap(page, vma, haddr, true);
+		page_add_new_anon_rmap(page, vma, haddr, HPAGE_PUD_ORDER);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 		pgtable_trans_huge_pud_deposit(vma->vm_mm, vmf->pud,
 				virt_to_page(pmd_pgtable));
@@ -1773,7 +1773,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		if (pmd_present(orig_pmd)) {
 			page = pmd_page(orig_pmd);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, HPAGE_PMD_ORDER);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
 		} else if (thp_migration_supported()) {
@@ -2059,7 +2059,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		if (pud_present(orig_pud)) {
 			page = pud_page(orig_pud);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, HPAGE_PUD_ORDER);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
 		} else
@@ -2187,7 +2187,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			set_page_dirty(page);
 		if (!PageReferenced(page) && pmd_young(_pmd))
 			SetPageReferenced(page);
-		page_remove_rmap(page, true);
+		page_remove_rmap(page, HPAGE_PMD_ORDER);
 		put_page(page);
 		add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
 		return;
@@ -2319,7 +2319,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (freeze) {
 		for (i = 0; i < HPAGE_PMD_NR; i++) {
-			page_remove_rmap(page + i, false);
+			page_remove_rmap(page + i, 0);
 			put_page(page + i);
 		}
 	}
@@ -3089,7 +3089,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
 	set_pmd_at(mm, address, pvmw->pmd, pmdswp);
-	page_remove_rmap(page, true);
+	page_remove_rmap(page, HPAGE_PMD_ORDER);
 	put_page(page);
 }
 
@@ -3115,7 +3115,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 
 	flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
 	if (PageAnon(new))
-		page_add_anon_rmap(new, vma, mmun_start, true);
+		page_add_anon_rmap(new, vma, mmun_start, HPAGE_PMD_ORDER);
 	else
 		page_add_file_rmap(new, true);
 	set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 61469fd3ad92..25674d7b1e5f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4007,7 +4007,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			set_page_dirty(page);
 
 		hugetlb_count_sub(pages_per_huge_page(h), mm);
-		page_remove_rmap(page, true);
+		page_remove_rmap(page, huge_page_order(h));
 
 		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, page, huge_page_size(h));
@@ -4232,7 +4232,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		set_huge_pte_at(mm, haddr, ptep,
 				make_huge_pte(vma, new_page, 1));
-		page_remove_rmap(old_page, true);
+		page_remove_rmap(old_page, huge_page_order(h));
 		hugepage_add_new_anon_rmap(new_page, vma, haddr);
 		set_page_huge_active(new_page);
 		/* Make the old page be freed below */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f1d5f6dde47c..636a0f32b09e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -765,7 +765,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			 * superfluous.
 			 */
 			pte_clear(vma->vm_mm, address, _pte);
-			page_remove_rmap(src_page, false);
+			page_remove_rmap(src_page, 0);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
 		}
@@ -1175,7 +1175,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	page_add_new_anon_rmap(new_page, vma, address, true);
+	page_add_new_anon_rmap(new_page, vma, address, HPAGE_PMD_ORDER);
 	lru_cache_add_inactive_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
@@ -1478,7 +1478,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 		if (pte_none(*pte))
 			continue;
 		page = vm_normal_page(vma, addr, *pte);
-		page_remove_rmap(page, false);
+		page_remove_rmap(page, 0);
 	}
 
 	pte_unmap_unlock(start_pte, ptl);
diff --git a/mm/ksm.c b/mm/ksm.c
index 9afccc36dbd2..f32bdfe768b4 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1153,7 +1153,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	 */
 	if (!is_zero_pfn(page_to_pfn(kpage))) {
 		get_page(kpage);
-		page_add_anon_rmap(kpage, vma, addr, false);
+		page_add_anon_rmap(kpage, vma, addr, 0);
 		newpte = mk_pte(kpage, vma->vm_page_prot);
 	} else {
 		newpte = pte_mkspecial(pfn_pte(page_to_pfn(kpage),
@@ -1177,7 +1177,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	ptep_clear_flush(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, newpte);
 
-	page_remove_rmap(page, false);
+	page_remove_rmap(page, 0);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 05789aa4af12..37e206a7d213 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1090,7 +1090,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, 0);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(__tlb_remove_page(tlb, page))) {
@@ -1118,7 +1118,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, 0);
 			put_page(page);
 			continue;
 		}
@@ -2726,7 +2726,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * thread doing COW.
 		 */
 		ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
-		page_add_new_anon_rmap(new_page, vma, vmf->address, false);
+		page_add_new_anon_rmap(new_page, vma, vmf->address, 0);
 		lru_cache_add_inactive_or_unevictable(new_page, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
@@ -2758,7 +2758,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			 * mapcount is visible. So transitively, TLBs to
 			 * old page will be flushed before it can be reused.
 			 */
-			page_remove_rmap(old_page, false);
+			page_remove_rmap(old_page, 0);
 		}
 
 		/* Free the old page.. */
@@ -3249,10 +3249,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	/* ksm created a completely new copy */
 	if (unlikely(page != swapcache && swapcache)) {
-		page_add_new_anon_rmap(page, vma, vmf->address, false);
+		page_add_new_anon_rmap(page, vma, vmf->address, 0);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
-		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
+		do_page_add_anon_rmap(page, vma, vmf->address, exclusive, 0);
 	}
 
 	swap_free(entry);
@@ -3396,7 +3396,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	}
 
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, vmf->address, false);
+	page_add_new_anon_rmap(page, vma, vmf->address, 0);
 	lru_cache_add_inactive_or_unevictable(page, vma);
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
@@ -3655,7 +3655,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
 	/* copy-on-write page */
 	if (write && !(vma->vm_flags & VM_SHARED)) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-		page_add_new_anon_rmap(page, vma, vmf->address, false);
+		page_add_new_anon_rmap(page, vma, vmf->address, 0);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
diff --git a/mm/migrate.c b/mm/migrate.c
index 3ab965f83029..a7320e9d859c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -270,7 +270,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 
 			if (PageAnon(new))
-				page_add_anon_rmap(new, vma, pvmw.address, false);
+				page_add_anon_rmap(new, vma, pvmw.address, 0);
 			else
 				page_add_file_rmap(new, false);
 		}
@@ -2194,7 +2194,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	 * new page and page_add_new_anon_rmap guarantee the copy is
 	 * visible before the pagetable update.
 	 */
-	page_add_anon_rmap(new_page, vma, start, true);
+	page_add_anon_rmap(new_page, vma, start, HPAGE_PMD_ORDER);
 	/*
 	 * At this point the pmd is numa/protnone (i.e. non present) and the TLB
 	 * has already been flushed globally.  So no TLB can be currently
@@ -2211,7 +2211,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 	page_ref_unfreeze(page, 2);
 	mlock_migrate_page(new_page, page);
-	page_remove_rmap(page, true);
+	page_remove_rmap(page, HPAGE_PMD_ORDER);
 	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
 
 	spin_unlock(ptl);
@@ -2455,7 +2455,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			 * drop page refcount. Page won't be freed, as we took
 			 * a reference just above.
 			 */
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, 0);
 			put_page(page);
 
 			if (pte_present(pte))
@@ -2940,7 +2940,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		goto unlock_abort;
 
 	inc_mm_counter(mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, addr, false);
+	page_add_new_anon_rmap(page, vma, addr, 0);
 	if (!is_zone_device_page(page))
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	get_page(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index 629f8fe7ffac..0d922e5fb38c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1100,7 +1100,7 @@ static void __page_check_anon_rmap(struct page *page,
  * @page:	the page to add the mapping to
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
- * @compound:	charge the page as compound or small page
+ * @map_order:	the order of the charged page
  *
  * The caller needs to hold the pte lock, and the page must be locked in
  * the anon_vma case: to serialize mapping,index checking after setting,
@@ -1108,9 +1108,10 @@ static void __page_check_anon_rmap(struct page *page,
  * (but PageKsm is never downgraded to PageAnon).
  */
 void page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, bool compound)
+	struct vm_area_struct *vma, unsigned long address, int map_order)
 {
-	do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0);
+	do_page_add_anon_rmap(page, vma, address,
+			      map_order > 0 ? RMAP_COMPOUND : 0, map_order);
 }
 
 /*
@@ -1119,7 +1120,8 @@ void page_add_anon_rmap(struct page *page,
  * Everybody else should continue to use page_add_anon_rmap above.
  */
 void do_page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, int flags)
+	struct vm_area_struct *vma, unsigned long address, int flags,
+	int map_order)
 {
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
@@ -1174,15 +1176,16 @@ void do_page_add_anon_rmap(struct page *page,
  * @page:	the page to add the mapping to
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
- * @compound:	charge the page as compound or small page
+ * @map_order:	the order of the charged page
  *
  * Same as page_add_anon_rmap but must only be called on *new* pages.
  * This means the inc-and-test can be bypassed.
  * Page does not have to be locked.
  */
 void page_add_new_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, bool compound)
+	struct vm_area_struct *vma, unsigned long address, int map_order)
 {
+	bool compound = map_order > 0;
 	int nr = compound ? thp_nr_pages(page) : 1;
 
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
@@ -1339,12 +1342,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
 /**
  * page_remove_rmap - take down pte mapping from a page
  * @page:	page to remove mapping from
- * @compound:	uncharge the page as compound or small page
+ * @map_order:	the order of the uncharged page
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page, bool compound)
+void page_remove_rmap(struct page *page, int map_order)
 {
+	bool compound = map_order > 0;
 	lock_page_memcg(page);
 
 	if (!PageAnon(page)) {
@@ -1734,7 +1738,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, PageHuge(page));
+		page_remove_rmap(subpage, compound_order(page));
 		put_page(page);
 	}
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 20012c0c0252..495ecdbd7859 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1919,9 +1919,9 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	if (page == swapcache) {
-		page_add_anon_rmap(page, vma, addr, false);
+		page_add_anon_rmap(page, vma, addr, 0);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, addr, false);
+		page_add_new_anon_rmap(page, vma, addr, 0);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	}
 	swap_free(entry);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9a3d451402d7..4979e64d7e47 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -122,7 +122,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release_uncharge_unlock;
 
 	inc_mm_counter(dst_mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+	page_add_new_anon_rmap(page, dst_vma, dst_addr, 0);
 	lru_cache_add_inactive_or_unevictable(page, dst_vma);
 
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 13/30] mm: rmap: add map_order to page_remove_anon_compound_rmap.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (11 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 12/30] mm: rmap: add mappped/unmapped page order to anonymous page rmap functions Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 14/30] mm: thp: add PUD THP split_huge_pud_page() function Zi Yan
                   ` (17 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

When PMD-mapped PUD THP is enabled by the upcoming commits, we can unmap
a PMD-mapped PUD THP that should be counted as NR_ANON_THPS. The added
map_order tells us about this situation.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/rmap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 0d922e5fb38c..7fc0bf07b9bc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1292,7 +1292,7 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 		clear_page_mlock(page);
 }
 
-static void page_remove_anon_compound_rmap(struct page *page)
+static void page_remove_anon_compound_rmap(struct page *page, int map_order)
 {
 	int i, nr;
 
@@ -1306,7 +1306,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 		return;
 
-	if (thp_nr_pages(page) == HPAGE_PMD_NR)
+	if (map_order == HPAGE_PMD_ORDER)
 		__dec_lruvec_page_state(page, NR_ANON_THPS);
 	else
 		__dec_lruvec_page_state(page, NR_ANON_THPS_PUD);
@@ -1357,7 +1357,7 @@ void page_remove_rmap(struct page *page, int map_order)
 	}
 
 	if (compound) {
-		page_remove_anon_compound_rmap(page);
+		page_remove_anon_compound_rmap(page, map_order);
 		goto out;
 	}
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 14/30] mm: thp: add PUD THP split_huge_pud_page() function.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (12 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 13/30] mm: rmap: add map_order to page_remove_anon_compound_rmap Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 15/30] mm: thp: add PUD THP to deferred split list when PUD mapping is gone Zi Yan
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

It mimics PMD-level THP split. In addition, to support PMD-mapped PUD
THP, PMDPageInPUD() is added to identify the first page in the PMD sized
aligned physical pages. For example, in x86_64, the page[0], page[512],
page[1024], ... are regarded as PMDPageInPUD.

For the mapcount of PMD-mapped PUD THP, sub_compound_mapcount() is added
to uses (PMDPageInPUD+3).compound_mapcount as the mapcount, since each
base page's mapcount is used for PTE mapping, first tail page's
compound_mapcount is already in use, and second tail page's
compound_mapcount overlaps with in-use deferred_list.

PagePUDDoubleMap() is added to indicate both PUD-mapped and PMD-mapped
PUD THPs. PageDoubleMap() remains its original meaning, indicating both
PMD-mapped and PTE-mapped THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/include/asm/pgalloc.h |   9 +
 arch/x86/include/asm/pgtable.h |  21 ++
 include/linux/huge_mm.h        |  31 +-
 include/linux/memcontrol.h     |   5 +
 include/linux/mm.h             |  25 +-
 include/linux/page-flags.h     |  48 +++
 include/linux/pgtable.h        |  17 ++
 include/linux/rmap.h           |   1 +
 include/linux/swap.h           |   2 +
 include/linux/vm_event_item.h  |   4 +
 mm/huge_memory.c               | 525 +++++++++++++++++++++++++++++++--
 mm/memcontrol.c                |  13 +
 mm/memory.c                    |   2 +-
 mm/page_alloc.c                |  21 +-
 mm/pagewalk.c                  |   2 +-
 mm/pgtable-generic.c           |  11 +
 mm/rmap.c                      |  93 +++++-
 mm/swap.c                      |  30 ++
 mm/util.c                      |  22 +-
 mm/vmstat.c                    |   4 +
 20 files changed, 832 insertions(+), 54 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index b24284522973..f6926725c379 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -99,6 +99,15 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 
 #define pmd_pgtable(pmd) pmd_page(pmd)
 
+static inline void pud_populate_with_pgtable(struct mm_struct *mm, pud_t *pud,
+				struct page *pte)
+{
+	unsigned long pfn = page_to_pfn(pte);
+
+	paravirt_alloc_pmd(mm, pfn);
+	set_pud(pud, __pud(((pteval_t)pfn << PAGE_SHIFT) | _PAGE_TABLE));
+}
+
 #if CONFIG_PGTABLE_LEVELS > 2
 static inline pmd_t *pmd_alloc_one_page_with_ptes(struct mm_struct *mm, unsigned long addr)
 {
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 8bf7bfd71a46..575c349e08b2 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -630,6 +630,12 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)
 		      __pgprot(pmd_flags(pmd) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
 }
 
+static inline pud_t pud_mknotpresent(pud_t pud)
+{
+	return pfn_pud(pud_pfn(pud),
+	      __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
+}
+
 static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
 
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
@@ -1246,6 +1252,21 @@ static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
 }
 #endif /* CONFIG_PAGE_TABLE_ISOLATION */
 
+#ifndef pudp_establish
+#define pudp_establish pudp_establish
+static inline pud_t pudp_establish(struct vm_area_struct *vma,
+		unsigned long address, pud_t *pudp, pud_t pud)
+{
+	if (IS_ENABLED(CONFIG_SMP)) {
+		return xchg(pudp, pud);
+	} else {
+		pud_t old = *pudp;
+		*pudp = pud;
+		return old;
+	}
+}
+#endif
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7528652400e4..e5c68e680907 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -222,17 +222,27 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 		bool freeze, struct page *page);
 
+bool can_split_huge_pud_page(struct page *page, int *pextra_pins);
+int split_huge_pud_page_to_list(struct page *page, struct list_head *list);
+static inline int split_huge_pud_page(struct page *page)
+{
+	return split_huge_pud_page_to_list(page, NULL);
+}
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long address);
+		unsigned long address, bool freeze, struct page *page);
 
 #define split_huge_pud(__vma, __pud, __address)				\
 	do {								\
 		pud_t *____pud = (__pud);				\
 		if (pud_trans_huge(*____pud)				\
 					|| pud_devmap(*____pud))	\
-			__split_huge_pud(__vma, __pud, __address);	\
+			__split_huge_pud(__vma, __pud, __address,	\
+						false, NULL);		\
 	}  while (0)
 
+void split_huge_pud_address(struct vm_area_struct *vma, unsigned long address,
+		bool freeze, struct page *page);
+
 extern int hugepage_madvise(struct vm_area_struct *vma,
 			    unsigned long *vm_flags, int advice);
 extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
@@ -422,8 +432,25 @@ static inline void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 static inline void split_huge_pmd_address(struct vm_area_struct *vma,
 		unsigned long address, bool freeze, struct page *page) {}
 
+static inline bool
+can_split_huge_pud_page(struct page *page, int *pextra_pins)
+{
+	BUILD_BUG();
+	return false;
+}
+static inline int
+split_huge_pud_page_to_list(struct page *page, struct list_head *list)
+{
+	return 0;
+}
+static inline int split_huge_pud_page(struct page *page)
+{
+	return 0;
+}
 #define split_huge_pud(__vma, __pmd, __address)	\
 	do { } while (0)
+static inline void split_huge_pud_address(struct vm_area_struct *vma,
+		unsigned long address, bool freeze, struct page *page) {}
 
 static inline int hugepage_madvise(struct vm_area_struct *vma,
 				   unsigned long *vm_flags, int advice)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e391e3c56de5..a7622510d43d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -932,6 +932,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head);
+void mem_cgroup_split_huge_pud_fixup(struct page *head);
 #endif
 
 #else /* CONFIG_MEMCG */
@@ -1264,6 +1265,10 @@ static inline void mem_cgroup_split_huge_fixup(struct page *head)
 {
 }
 
+static inline void mem_cgroup_split_huge_pud_fixup(struct page *head)
+{
+}
+
 static inline void count_memcg_events(struct mem_cgroup *memcg,
 				      enum vm_event_item idx,
 				      unsigned long count)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8f54f06c8eb6..51b75ffa6a6c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -801,6 +801,24 @@ static inline int compound_mapcount(struct page *page)
 	return head_compound_mapcount(page);
 }
 
+static inline unsigned int compound_order(struct page *page);
+static inline atomic_t *sub_compound_mapcount_ptr(struct page *page, int sub_level)
+{
+	struct page *head = compound_head(page);
+
+	VM_BUG_ON_PAGE(!PageCompound(page), page);
+	VM_BUG_ON_PAGE(compound_order(head) != HPAGE_PUD_ORDER, page);
+	VM_BUG_ON_PAGE((page - head) % HPAGE_PMD_NR, page);
+	VM_BUG_ON_PAGE(sub_level != 1, page);
+	return &page[2 + sub_level].compound_mapcount;
+}
+
+/* Only works for PUD pages */
+static inline int sub_compound_mapcount(struct page *page)
+{
+	return atomic_read(sub_compound_mapcount_ptr(page, 1)) + 1;
+}
+
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
@@ -893,13 +911,6 @@ static inline void destroy_compound_page(struct page *page)
 	compound_page_dtors[page[1].compound_dtor](page);
 }
 
-static inline unsigned int compound_order(struct page *page)
-{
-	if (!PageHead(page))
-		return 0;
-	return page[1].compound_order;
-}
-
 static inline bool hpage_pincount_available(struct page *page)
 {
 	/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index fbbb841a9346..f1bfb02622cf 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -235,6 +235,9 @@ static inline void page_init_poison(struct page *page, size_t size)
  *
  * PF_SECOND:
  *     the page flag is stored in the first tail page.
+ *
+ * PF_THIRD:
+ *     the page flag is stored in the second tail page.
  */
 #define PF_POISONED_CHECK(page) ({					\
 		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
@@ -253,6 +256,9 @@ static inline void page_init_poison(struct page *page, size_t size)
 #define PF_SECOND(page, enforce) ({					\
 		VM_BUG_ON_PGFLAGS(!PageHead(page), page);		\
 		PF_POISONED_CHECK(&page[1]); })
+#define PF_THIRD(page, enforce) ({					\
+		VM_BUG_ON_PGFLAGS(!PageHead(page), page);		\
+		PF_POISONED_CHECK(&page[2]); })
 
 /*
  * Macros to create function definitions for page flags
@@ -674,6 +680,30 @@ static inline int PageTransTail(struct page *page)
 	return PageTail(page);
 }
 
+#define HPAGE_PMD_SHIFT PMD_SHIFT
+#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
+
+#define HPAGE_PUD_SHIFT PUD_SHIFT
+#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
+
+static inline unsigned int compound_order(struct page *page)
+{
+	if (!PageHead(page))
+		return 0;
+	return page[1].compound_order;
+}
+
+
+static inline int PMDPageInPUD(struct page *page)
+{
+	struct page *head = compound_head(page);
+
+	return (PageCompound(page) && compound_order(head) == HPAGE_PUD_ORDER &&
+		((page - head) % HPAGE_PMD_NR == 0));
+}
+
 /*
  * PageDoubleMap indicates that the compound page is mapped with PTEs as well
  * as PMDs.
@@ -689,13 +719,31 @@ static inline int PageTransTail(struct page *page)
  */
 PAGEFLAG(DoubleMap, double_map, PF_SECOND)
 	TESTSCFLAG(DoubleMap, double_map, PF_SECOND)
+/*
+ * PagePUDDoubleMap indicates that the compound page is mapped with PMDs as well
+ * as PUDs.
+ *
+ * This is required for optimization of rmap operations for THP: we can postpone
+ * per small page mapcount accounting (and its overhead from atomic operations)
+ * until the first PUD split.
+ *
+ * For the page PagePUDDoubleMap means ->_mapcount in all sub-PMD pages is
+ * offset up by one. This reference will go away with last sub_compound_mapcount.
+ *
+ * See also __split_huge_pud_locked() and page_remove_anon_compound_rmap().
+ */
+PAGEFLAG(PUDDoubleMap, double_map, PF_THIRD)
+	TESTSCFLAG(PUDDoubleMap, double_map, PF_THIRD)
 #else
 TESTPAGEFLAG_FALSE(TransHuge)
 TESTPAGEFLAG_FALSE(TransCompound)
 TESTPAGEFLAG_FALSE(TransCompoundMap)
 TESTPAGEFLAG_FALSE(TransTail)
+TESTPAGEFLAG_FALSE(PMDPageInPUD)
 PAGEFLAG_FALSE(DoubleMap)
 	TESTSCFLAG_FALSE(DoubleMap)
+PAGEFLAG_FALSE(PUDDoubleMap)
+	TESTSETFLAG_FALSE(PUDDoubleMap)
 #endif
 
 /*
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index bb163504fb01..02279a97e170 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -508,6 +508,11 @@ extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 #endif
 
+#ifndef __HAVE_ARCH_PUDP_INVALIDATE
+extern pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+			    pud_t *pudp);
+#endif
+
 #ifndef __HAVE_ARCH_PTE_SAME
 static inline int pte_same(pte_t pte_a, pte_t pte_b)
 {
@@ -1161,6 +1166,18 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 }
 #endif
 
+#ifndef pud_read_atomic
+static inline pud_t pud_read_atomic(pud_t *pudp)
+{
+	/*
+	 * Depend on compiler for an atomic pmd read. NOTE: this is
+	 * only going to work, if the pmdval_t isn't larger than
+	 * an unsigned long.
+	 */
+	return *pudp;
+}
+#endif
+
 #ifndef arch_needs_pgtable_deposit
 #define arch_needs_pgtable_deposit() (false)
 #endif
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 1244549f3eaf..0680b9fff2b3 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -99,6 +99,7 @@ enum ttu_flags {
 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
 					 * caller holds it */
 	TTU_SPLIT_FREEZE	= 0x100,		/* freeze pte under splitting thp */
+	TTU_SPLIT_HUGE_PUD	= 0x200,		/* split huge PUD if any */
 };
 
 #ifdef CONFIG_MMU
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f32804e2fad5..dee400a56e84 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -340,6 +340,8 @@ extern void lru_note_cost_page(struct page *);
 extern void lru_cache_add(struct page *);
 extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
+extern void lru_add_pud_page_tail(struct page *page, struct page *page_tail,
+			 struct lruvec *lruvec, struct list_head *head);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 416d9966fa3f..cf2b5632b96c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -97,6 +97,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK_PUD,
 		THP_FAULT_FALLBACK_PUD_CHARGE,
 		THP_SPLIT_PUD,
+		THP_SPLIT_PUD_PAGE,
+		THP_SPLIT_PUD_PAGE_FAILED,
+		THP_ZERO_PUD_PAGE_ALLOC,
+		THP_ZERO_PUD_PAGE_ALLOC_FAILED,
 #endif
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6716c5286494..4a899e856088 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1775,7 +1775,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			page = pmd_page(orig_pmd);
 			page_remove_rmap(page, HPAGE_PMD_ORDER);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-			VM_BUG_ON_PAGE(!PageHead(page), page);
+			VM_BUG_ON_PAGE(!PageHead(page) && !PMDPageInPUD(page), page);
 		} else if (thp_migration_supported()) {
 			swp_entry_t entry;
 
@@ -2082,8 +2082,16 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 }
 
 static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long haddr)
+		unsigned long haddr, bool freeze)
 {
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *page;
+	pgtable_t pgtable;
+	pud_t _pud, old_pud;
+	bool young, write, dirty, soft_dirty;
+	unsigned long addr;
+	int i;
+
 	VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
@@ -2091,23 +2099,141 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
 
 	count_vm_event(THP_SPLIT_PUD);
 
-	pudp_huge_clear_flush_notify(vma, haddr, pud);
+	if (!vma_is_anonymous(vma)) {
+		_pud = pudp_huge_clear_flush_notify(vma, haddr, pud);
+		/*
+		 * We are going to unmap this huge page. So
+		 * just go ahead and zap it
+		 */
+		if (arch_needs_pgtable_deposit())
+			zap_pud_deposited_table(mm, pud);
+		if (vma_is_dax(vma))
+			return;
+		page = pud_page(_pud);
+		if (!PageReferenced(page) && pud_young(_pud))
+			SetPageReferenced(page);
+		page_remove_rmap(page, HPAGE_PUD_ORDER);
+		put_page(page);
+		add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PUD_NR);
+		return;
+	}
+
+	/* See the comment above pmdp_invalidate() in __split_huge_pmd_locked() */
+	old_pud = pudp_invalidate(vma, haddr, pud);
+
+	page = pud_page(old_pud);
+	VM_BUG_ON_PAGE(!page_count(page), page);
+	page_ref_add(page, (1<<(HPAGE_PUD_ORDER-HPAGE_PMD_ORDER)) - 1);
+	if (pud_dirty(old_pud))
+		SetPageDirty(page);
+	write = pud_write(old_pud);
+	young = pud_young(old_pud);
+	dirty = pud_dirty(old_pud);
+	soft_dirty = pud_soft_dirty(old_pud);
+
+	pgtable = pgtable_trans_huge_pud_withdraw(mm, pud);
+	pud_populate_with_pgtable(mm, &_pud, pgtable);
+
+	for (i = 0, addr = haddr; i < HPAGE_PUD_NR;
+		 i += HPAGE_PMD_NR, addr += PMD_SIZE) {
+		pmd_t entry, *pmd;
+		/*
+		 * Note that NUMA hinting access restrictions are not
+		 * transferred to avoid any possibility of altering
+		 * permissions across VMAs.
+		 */
+		if (freeze) {
+			swp_entry_t swp_entry;
+
+			swp_entry = make_migration_entry(page + i, write);
+			entry = swp_entry_to_pmd(swp_entry);
+			if (soft_dirty)
+				entry = pmd_swp_mksoft_dirty(entry);
+		} else {
+			entry = mk_huge_pmd(page + i, READ_ONCE(vma->vm_page_prot));
+			entry = maybe_pmd_mkwrite(entry, vma);
+			if (!write)
+				entry = pmd_wrprotect(entry);
+			if (!young)
+				entry = pmd_mkold(entry);
+			if (soft_dirty)
+				entry = pmd_mksoft_dirty(entry);
+		}
+		pmd = pmd_offset(&_pud, addr);
+		VM_BUG_ON(!pmd_none(*pmd));
+		set_pmd_at(mm, addr, pmd, entry);
+		/* distinguish between pud compound_mapcount and pmd compound_mapcount */
+		if (atomic_inc_and_test(sub_compound_mapcount_ptr(&page[i], 1))) {
+			/* first pmd-mapped pud page */
+			lock_page_memcg(page);
+			__inc_lruvec_page_state(page, NR_ANON_THPS);
+			unlock_page_memcg(page);
+		}
+	}
+
+	/*
+	 * Set PG_double_map before dropping compound_mapcount to avoid
+	 * false-negative page_mapped().
+	 */
+	if (compound_mapcount(page) > 1 && !TestSetPagePUDDoubleMap(page)) {
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+		/* distinguish between pud compound_mapcount and pmd compound_mapcount */
+			atomic_inc(sub_compound_mapcount_ptr(&page[i], 1));
+	}
+
+	lock_page_memcg(page);
+	if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
+		/* Last compound_mapcount is gone. */
+		__dec_lruvec_page_state(page, NR_ANON_THPS_PUD);
+		if (TestClearPagePUDDoubleMap(page)) {
+			/* No need in mapcount reference anymore */
+			for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+		/* distinguish between pud compound_mapcount and pmd compound_mapcount */
+				atomic_dec(sub_compound_mapcount_ptr(&page[i], 1));
+		}
+	}
+	unlock_page_memcg(page);
+
+	smp_wmb(); /* make pte visible before pmd */
+	pud_populate_with_pgtable(mm, pud, pgtable);
+
+	if (freeze) {
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR) {
+			page_remove_rmap(page + i, HPAGE_PMD_ORDER);
+			put_page(page + i);
+		}
+	}
 }
 
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long address)
+		unsigned long address, bool freeze, struct page *page)
 {
 	spinlock_t *ptl;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long haddr = address & HPAGE_PUD_MASK;
 	struct mmu_notifier_range range;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
 				address & HPAGE_PUD_MASK,
 				(address & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
-	ptl = pud_lock(vma->vm_mm, pud);
-	if (unlikely(!pud_trans_huge(*pud) && !pud_devmap(*pud)))
+	ptl = pud_lock(mm, pud);
+
+	/*
+	 * If caller asks to setup a migration entries, we need a page to check
+	 * pmd against. Otherwise we can end up replacing wrong page.
+	 */
+	VM_BUG_ON(freeze && !page);
+	if (page && page != pud_page(*pud))
+		goto out;
+
+	if (pud_trans_huge(*pud)) {
+		page = pud_page(*pud);
+		if (PageMlocked(page))
+			clear_page_mlock(page);
+	} else if (unlikely(!pud_devmap(*pud)))
 		goto out;
-	__split_huge_pud_locked(vma, pud, range.start);
+	__split_huge_pud_locked(vma, pud, haddr, freeze);
 
 out:
 	spin_unlock(ptl);
@@ -2117,6 +2243,280 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 	 */
 	mmu_notifier_invalidate_range_only_end(&range);
 }
+
+void split_huge_pud_address(struct vm_area_struct *vma, unsigned long address,
+		bool freeze, struct page *page)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+
+	pgd = pgd_offset(vma->vm_mm, address);
+	if (!pgd_present(*pgd))
+		return;
+
+	p4d = p4d_offset(pgd, address);
+	if (!p4d_present(*p4d))
+		return;
+
+	pud = pud_offset(p4d, address);
+
+	__split_huge_pud(vma, pud, address, freeze, page);
+}
+
+static void unmap_pud_page(struct page *page)
+{
+	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
+		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PUD;
+	bool unmap_success;
+
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+
+	if (PageAnon(page))
+		ttu_flags |= TTU_SPLIT_FREEZE;
+
+	unmap_success = try_to_unmap(page, ttu_flags);
+	VM_BUG_ON_PAGE(!unmap_success, page);
+}
+
+static void remap_pud_page(struct page *page)
+{
+	int i;
+
+	VM_BUG_ON(!PageTransHuge(page));
+	if (compound_order(page) == HPAGE_PUD_ORDER) {
+		remove_migration_ptes(page, page, true);
+	} else if (compound_order(page) == HPAGE_PMD_ORDER) {
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+			remove_migration_ptes(page + i, page + i, true);
+	} else
+		VM_BUG_ON_PAGE(1, page);
+}
+
+static void __split_huge_pud_page_tail(struct page *head, int tail,
+		struct lruvec *lruvec, struct list_head *list)
+{
+	struct page *page_tail = head + tail;
+
+	VM_BUG_ON_PAGE(page_ref_count(page_tail) != 0, page_tail);
+
+	/*
+	 * Clone page flags before unfreezing refcount.
+	 *
+	 * After successful get_page_unless_zero() might follow flags change,
+	 * for example lock_page() which set PG_waiters.
+	 */
+
+	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+	page_tail->flags |= (head->flags &
+			((1L << PG_referenced) |
+			 (1L << PG_swapbacked) |
+			 (1L << PG_swapcache) |
+			 (1L << PG_mlocked) |
+			 (1L << PG_uptodate) |
+			 (1L << PG_active) |
+			 (1L << PG_locked) |
+			 (1L << PG_unevictable) |
+			 (1L << PG_dirty) |
+			 /* preserve THP */
+			 (1L << PG_head)));
+
+	/* ->mapping in first tail page is compound_mapcount */
+	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
+			page_tail);
+	page_tail->mapping = head->mapping;
+	page_tail->index = head->index + tail;
+
+	/* Page flags also must be visible before we make the page PMD-compound. */
+	smp_wmb();
+
+	clear_compound_head(page_tail);
+	prep_compound_page(page_tail, HPAGE_PMD_ORDER);
+	prep_transhuge_page(page_tail);
+
+	/* Finally unfreeze refcount. Additional reference from page cache. */
+	page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
+					  PageSwapCache(head)));
+
+	if (page_is_young(head))
+		set_page_young(page_tail);
+	if (page_is_idle(head))
+		set_page_idle(page_tail);
+
+	page_cpupid_xchg_last(page_tail, page_cpupid_last(head));
+	lru_add_pud_page_tail(head, page_tail, lruvec, list);
+}
+
+static void __split_huge_pud_page(struct page *page, struct list_head *list,
+		unsigned long flags)
+{
+	struct page *head = compound_head(page);
+	pg_data_t *pgdat = page_pgdat(head);
+	struct lruvec *lruvec;
+	int i;
+
+	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+
+	/* complete memcg works before add pages to LRU */
+	mem_cgroup_split_huge_pud_fixup(head);
+
+	/* no file-back page support yet */
+	VM_BUG_ON(!PageAnon(page));
+
+	for (i = HPAGE_PUD_NR - HPAGE_PMD_NR; i >= 1; i -= HPAGE_PMD_NR)
+		__split_huge_pud_page_tail(head, i, lruvec, list);
+
+	/* reset head page order  */
+	prep_compound_page(head, HPAGE_PMD_ORDER);
+	prep_transhuge_page(head);
+
+	page_ref_inc(head);
+
+	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+
+	remap_pud_page(head);
+
+	for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR) {
+		struct page *subpage = head + i;
+
+		if (subpage == page)
+			continue;
+		unlock_page(subpage);
+
+		/*
+		 * Subpages may be freed if there wasn't any mapping
+		 * like if add_to_swap() is running on a lru page that
+		 * had its mapping zapped. And freeing these pages
+		 * requires taking the lru_lock so we do the put_page
+		 * of the tail pages after the split is complete.
+		 */
+		put_page(subpage);
+	}
+}
+/* Racy check whether the huge page can be split */
+bool can_split_huge_pud_page(struct page *page, int *pextra_pins)
+{
+	int extra_pins;
+
+	VM_BUG_ON(!PageAnon(page));
+
+	extra_pins = PageSwapCache(page) ? HPAGE_PUD_NR : 0;
+
+	if (pextra_pins)
+		*pextra_pins = extra_pins;
+	return total_mapcount(page) == page_count(page) - extra_pins - 1;
+}
+
+/*
+ * This function splits huge page into normal pages. @page can point to any
+ * subpage of huge page to split. Split doesn't change the position of @page.
+ *
+ * Only caller must hold pin on the @page, otherwise split fails with -EBUSY.
+ * The huge page must be locked.
+ *
+ * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
+ *
+ * Both head page and tail pages will inherit mapping, flags, and so on from
+ * the hugepage.
+ *
+ * GUP pin and PG_locked transferred to @page. Rest subpages can be freed if
+ * they are not mapped.
+ *
+ * Returns 0 if the hugepage is split successfully.
+ * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under
+ * us.
+ */
+int split_huge_pud_page_to_list(struct page *page, struct list_head *list)
+{
+	struct page *head = compound_head(page);
+	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
+	struct deferred_split *ds_queue = get_deferred_split_queue(head);
+	struct anon_vma *anon_vma = NULL;
+	struct address_space *mapping = NULL;
+	int count, mapcount, extra_pins, ret;
+	bool mlocked;
+	unsigned long flags;
+
+	VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageCompound(page), page);
+	VM_BUG_ON_PAGE(!PageAnon(page), page);
+
+	if (PageWriteback(page))
+		return -EBUSY;
+
+	/*
+	 * The caller does not necessarily hold an mmap_sem that would
+	 * prevent the anon_vma disappearing so we first we take a
+	 * reference to it and then lock the anon_vma for write. This
+	 * is similar to page_lock_anon_vma_read except the write lock
+	 * is taken to serialise against parallel split or collapse
+	 * operations.
+	 */
+	anon_vma = page_get_anon_vma(head);
+	if (!anon_vma) {
+		ret = -EBUSY;
+		goto out;
+	}
+	mapping = NULL;
+	anon_vma_lock_write(anon_vma);
+	/*
+	 * Racy check if we can split the page, before unmap_pud_page() will
+	 * split PUDs
+	 */
+	if (!can_split_huge_pud_page(head, &extra_pins)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	mlocked = PageMlocked(page);
+	unmap_pud_page(head);
+	VM_BUG_ON_PAGE(compound_mapcount(head), head);
+
+	/* Make sure the page is not on per-CPU pagevec as it takes pin */
+	if (mlocked)
+		lru_add_drain();
+
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock_irqsave(&pgdata->lru_lock, flags);
+
+	/* Prevent deferred_split_scan() touching ->_refcount */
+	spin_lock(&ds_queue->split_queue_lock);
+	count = page_count(head);
+	mapcount = total_mapcount(head);
+	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
+		if (!list_empty(page_deferred_list(head))) {
+			ds_queue->split_queue_len--;
+			list_del(page_deferred_list(head));
+		}
+		if (mapping)
+			__dec_node_page_state(page, NR_SHMEM_THPS);
+		spin_unlock(&ds_queue->split_queue_lock);
+		__split_huge_pud_page(page, list, flags);
+		ret = 0;
+	} else {
+		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
+			pr_alert("total_mapcount: %u, page_count(): %u\n",
+					mapcount, count);
+			if (PageTail(page))
+				dump_page(head, NULL);
+			dump_page(page, "total_mapcount(head) > 0");
+		}
+		spin_unlock(&ds_queue->split_queue_lock);
+		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+		remap_pud_page(head);
+		ret = -EBUSY;
+	}
+
+out_unlock:
+	if (anon_vma) {
+		anon_vma_unlock_write(anon_vma);
+		put_anon_vma(anon_vma);
+	}
+out:
+	count_vm_event(!ret ? THP_SPLIT_PUD_PAGE : THP_SPLIT_PUD_PAGE_FAILED);
+	return ret;
+}
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
@@ -2157,7 +2557,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long haddr, bool freeze)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct page *page;
+	struct page *page, *head;
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
 	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
@@ -2246,7 +2646,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		uffd_wp = pmd_uffd_wp(old_pmd);
 	}
 	VM_BUG_ON_PAGE(!page_count(page), page);
-	page_ref_add(page, HPAGE_PMD_NR - 1);
+	head = compound_head(page);
+	page_ref_add(head, HPAGE_PMD_NR - 1);
 
 	/*
 	 * Withdraw the table only after we mark the pmd entry invalid.
@@ -2294,15 +2695,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		/*
 		 * Set PG_double_map before dropping compound_mapcount to avoid
 		 * false-negative page_mapped().
+		 * Don't set it if the PUD page is mapped at PUD level, since
+		 * page_mapped() is true in that case.
 		 */
-		if (compound_mapcount(page) > 1 &&
-		    !TestSetPageDoubleMap(page)) {
+		if (((PMDPageInPUD(page) &&
+			sub_compound_mapcount(page) >
+				(1 + PagePUDDoubleMap(compound_head(page)))) ||
+		    (!PMDPageInPUD(page) &&
+			compound_mapcount(page) > 1))
+			&& !TestSetPageDoubleMap(page)) {
 			for (i = 0; i < HPAGE_PMD_NR; i++)
 				atomic_inc(&page[i]._mapcount);
 		}
 
 		lock_page_memcg(page);
-		if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
+
+		if ((PMDPageInPUD(page) &&
+			atomic_add_negative(-1, sub_compound_mapcount_ptr(page, 1))) ||
+		    (!PMDPageInPUD(page) &&
+			atomic_add_negative(-1, compound_mapcount_ptr(page)))) {
 			/* Last compound_mapcount is gone. */
 			__dec_lruvec_page_state(page, NR_ANON_THPS);
 			if (TestClearPageDoubleMap(page)) {
@@ -2430,6 +2841,11 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 	 * previously contain an hugepage: check if we need to split
 	 * an huge pmd.
 	 */
+	if (start & ~HPAGE_PUD_MASK &&
+	    (start & HPAGE_PUD_MASK) >= vma->vm_start &&
+	    (start & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE <= vma->vm_end)
+		split_huge_pud_address(vma, start, false, NULL);
+
 	if (start & ~HPAGE_PMD_MASK &&
 	    (start & HPAGE_PMD_MASK) >= vma->vm_start &&
 	    (start & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
@@ -2440,6 +2856,11 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 	 * previously contain an hugepage: check if we need to split
 	 * an huge pmd.
 	 */
+	if (end & ~HPAGE_PUD_MASK &&
+	    (end & HPAGE_PUD_MASK) >= vma->vm_start &&
+	    (end & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE <= vma->vm_end)
+		split_huge_pud_address(vma, end, false, NULL);
+
 	if (end & ~HPAGE_PMD_MASK &&
 	    (end & HPAGE_PMD_MASK) >= vma->vm_start &&
 	    (end & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
@@ -2454,6 +2875,11 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 		struct vm_area_struct *next = vma->vm_next;
 		unsigned long nstart = next->vm_start;
 		nstart += adjust_next;
+		if (nstart & ~HPAGE_PUD_MASK &&
+		    (nstart & HPAGE_PUD_MASK) >= next->vm_start &&
+		    (nstart & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE <= next->vm_end)
+			split_huge_pud_address(next, nstart, false, NULL);
+
 		if (nstart & ~HPAGE_PMD_MASK &&
 		    (nstart & HPAGE_PMD_MASK) >= next->vm_start &&
 		    (nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
@@ -2645,12 +3071,23 @@ int total_mapcount(struct page *page)
 	if (PageHuge(page))
 		return compound;
 	ret = compound;
-	for (i = 0; i < nr; i++)
-		ret += atomic_read(&page[i]._mapcount) + 1;
+	/* if PMD, read all base page, if PUD, read the sub_compound_mapcount()*/
+	if (compound_order(page) == HPAGE_PMD_ORDER) {
+		for (i = 0; i < nr; i++)
+			ret += atomic_read(&page[i]._mapcount) + 1;
+	} else if (compound_order(page) == HPAGE_PUD_ORDER) {
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+			ret += sub_compound_mapcount(&page[i]);
+		for (i = 0; i < nr; i++)
+			ret += atomic_read(&page[i]._mapcount) + 1;
+		/* both PUD and PMD has HPAGE_PMD_NR sub pages */
+		nr = HPAGE_PMD_NR;
+	} else
+		VM_BUG_ON_PAGE(1, page);
 	/* File pages has compound_mapcount included in _mapcount */
 	if (!PageAnon(page))
 		return ret - compound * nr;
-	if (PageDoubleMap(page))
+	if (PagePUDDoubleMap(page) || PageDoubleMap(page))
 		ret -= nr;
 	return ret;
 }
@@ -2681,7 +3118,7 @@ int total_mapcount(struct page *page)
  */
 int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
 {
-	int i, ret, _total_mapcount, mapcount;
+	int i, ret, _total_mapcount, mapcount, nr;
 
 	/* hugetlbfs shouldn't call it */
 	VM_BUG_ON_PAGE(PageHuge(page), page);
@@ -2696,14 +3133,41 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
 	page = compound_head(page);
 
 	_total_mapcount = ret = 0;
-	for (i = 0; i < thp_nr_pages(page); i++) {
-		mapcount = atomic_read(&page[i]._mapcount) + 1;
-		ret = max(ret, mapcount);
-		_total_mapcount += mapcount;
-	}
-	if (PageDoubleMap(page)) {
+	nr = thp_nr_pages(page);
+	/* if PMD, read all base page, if PUD, read the sub_compound_mapcount()*/
+	if (compound_order(page) == HPAGE_PMD_ORDER) {
+		for (i = 0; i < nr; i++) {
+			mapcount = atomic_read(&page[i]._mapcount) + 1;
+			ret = max(ret, mapcount);
+			_total_mapcount += mapcount;
+		}
+	} else if (compound_order(page) == HPAGE_PUD_ORDER) {
+		for (i = 0; i < nr; i += HPAGE_PMD_NR) {
+			int j;
+
+			mapcount = sub_compound_mapcount(&page[i]);
+			ret = max(ret, mapcount);
+			_total_mapcount += mapcount;
+
+			/* Triple mapped at base page size */
+			for (j = 0; j < HPAGE_PMD_NR; j++) {
+				mapcount = atomic_read(&page[i + j]._mapcount) + 1;
+				ret = max(ret, mapcount);
+				_total_mapcount += mapcount;
+			}
+
+			if (PageDoubleMap(&page[i])) {
+				ret -= 1;
+				_total_mapcount -= HPAGE_PMD_NR;
+			}
+		}
+		/* both PUD and PMD has HPAGE_PMD_NR sub pages */
+		nr = HPAGE_PMD_NR;
+	} else
+		VM_BUG_ON_PAGE(1, page);
+	if (PageDoubleMap(page) || PagePUDDoubleMap(page)) {
 		ret -= 1;
-		_total_mapcount -= thp_nr_pages(page);
+		_total_mapcount -= nr;
 	}
 	mapcount = compound_mapcount(page);
 	ret += mapcount;
@@ -2948,6 +3412,9 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
 	return READ_ONCE(ds_queue->split_queue_len);
 }
 
+#define deferred_list_entry(x) (compound_head(list_entry((void *)x, \
+					struct page, mapping)))
+
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 		struct shrink_control *sc)
 {
@@ -2981,12 +3448,18 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
 	list_for_each_safe(pos, next, &list) {
-		page = list_entry((void *)pos, struct page, mapping);
+		page = deferred_list_entry(pos);
 		if (!trylock_page(page))
 			goto next;
 		/* split_huge_page() removes page from list on success */
-		if (!split_huge_page(page))
-			split++;
+		if (compound_order(page) == HPAGE_PUD_ORDER) {
+			if (!split_huge_pud_page(page))
+				split++;
+		} else if (compound_order(page) == HPAGE_PMD_ORDER) {
+			if (!split_huge_page(page))
+				split++;
+		} else
+			VM_BUG_ON_PAGE(1, page);
 		unlock_page(page);
 next:
 		put_page(page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b28f620c1c5b..ed75ef95b24a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3281,6 +3281,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 		head[i].mem_cgroup = memcg;
 	}
 }
+
+void mem_cgroup_split_huge_pud_fixup(struct page *head)
+{
+	int i;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	for (i = HPAGE_PMD_NR; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+		head[i].mem_cgroup = head->mem_cgroup;
+
+	/*__mod_memcg_state(head->mem_cgroup, MEMCG_RSS_HUGE, -HPAGE_PUD_NR);*/
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #ifdef CONFIG_MEMCG_SWAP
diff --git a/mm/memory.c b/mm/memory.c
index 37e206a7d213..e0e0459c0caf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4133,7 +4133,7 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
 	}
 split:
 	/* COW or write-notify not handled on PUD level: split pud.*/
-	__split_huge_pud(vmf->vma, vmf->pud, vmf->address);
+	__split_huge_pud(vmf->vma, vmf->pud, vmf->address, false, NULL);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 	return VM_FAULT_FALLBACK;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 29abeff09fcc..6bdb38a8fb48 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -679,6 +679,9 @@ void prep_compound_page(struct page *page, unsigned int order)
 	atomic_set(compound_mapcount_ptr(page), -1);
 	if (hpage_pincount_available(page))
 		atomic_set(compound_pincount_ptr(page), 0);
+	if (order == HPAGE_PUD_ORDER)
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+			atomic_set(sub_compound_mapcount_ptr(&page[i], 1), -1);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
@@ -1132,6 +1135,16 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
 		 */
 		break;
 	default:
+		/* sub_compound_map_ptr store here */
+		if (compound_order(head_page) == HPAGE_PUD_ORDER &&
+			(page - head_page) % HPAGE_PMD_NR == 3) {
+			if (unlikely(atomic_read(&page->compound_mapcount) != -1)) {
+				pr_err("sub_compound_mapcount: %d\n",
+				       atomic_read(&page->compound_mapcount) + 1);
+				bad_page(page, "nonzero sub_compound_mapcount");
+			}
+			break;
+		}
 		if (page->mapping != TAIL_MAPPING) {
 			bad_page(page, "corrupted mapping in tail page");
 			goto out;
@@ -1183,8 +1196,14 @@ static __always_inline bool free_pages_prepare(struct page *page,
 
 		VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
 
-		if (compound)
+		if (compound) {
 			ClearPageDoubleMap(page);
+			if (order == HPAGE_PUD_ORDER) {
+				ClearPagePUDDoubleMap(page);
+				for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+					ClearPageDoubleMap(&page[i]);
+			}
+		}
 		for (i = 1; i < (1 << order); i++) {
 			if (compound)
 				bad += free_tail_pages_check(page, page + i);
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index a3752c82a7b2..c190140637c9 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -160,7 +160,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 		if (walk->vma) {
 			split_huge_pud(walk->vma, pudp, addr);
 			pud = READ_ONCE(*pudp);
-			if (pud_none(pud))
+			if (pud_trans_unstable(&pud))
 				goto again;
 		}
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a014cf847067..2b83dd4807e5 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -218,6 +218,17 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 }
 #endif
 
+#ifndef __HAVE_ARCH_PUDP_INVALIDATE
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		     pud_t *pudp)
+{
+	pud_t old = pudp_establish(vma, address, pudp, pud_mknotpresent(*pudp));
+
+	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+	return old;
+}
+#endif
+
 #ifndef pmdp_collapse_flush
 pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 			  pmd_t *pmdp)
diff --git a/mm/rmap.c b/mm/rmap.c
index 7fc0bf07b9bc..b4950f7a0978 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1132,10 +1132,21 @@ void do_page_add_anon_rmap(struct page *page,
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 
 	if (compound) {
-		atomic_t *mapcount;
+		atomic_t *mapcount = NULL;
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		mapcount = compound_mapcount_ptr(page);
+		if (compound_order(page) == HPAGE_PUD_ORDER) {
+			if (map_order == HPAGE_PUD_ORDER) {
+				mapcount = compound_mapcount_ptr(page);
+			} else if (map_order == HPAGE_PMD_ORDER) {
+				VM_BUG_ON(!PMDPageInPUD(page));
+				mapcount = sub_compound_mapcount_ptr(page, 1);
+			} else
+				VM_BUG_ON(1);
+		} else if (compound_order(page) == HPAGE_PMD_ORDER) {
+			mapcount = compound_mapcount_ptr(page);
+		} else
+			VM_BUG_ON(1);
 		first = atomic_inc_and_test(mapcount);
 	} else {
 		first = atomic_inc_and_test(&page->_mapcount);
@@ -1150,7 +1161,7 @@ void do_page_add_anon_rmap(struct page *page,
 		 * disabled.
 		 */
 		if (compound) {
-			if (nr == HPAGE_PMD_NR)
+			if (map_order == HPAGE_PMD_ORDER)
 				__inc_lruvec_page_state(page, NR_ANON_THPS);
 			else
 				__inc_lruvec_page_state(page, NR_ANON_THPS_PUD);
@@ -1197,10 +1208,15 @@ void page_add_new_anon_rmap(struct page *page,
 		if (hpage_pincount_available(page))
 			atomic_set(compound_pincount_ptr(page), 0);
 
-		if (nr == HPAGE_PMD_NR)
-			__inc_lruvec_page_state(page, NR_ANON_THPS);
-		else
+		if (map_order == HPAGE_PUD_ORDER) {
+			VM_BUG_ON(compound_order(page) != HPAGE_PUD_ORDER);
+			/* Anon THP always mapped first with PMD */
 			__inc_lruvec_page_state(page, NR_ANON_THPS_PUD);
+		} else if (map_order == HPAGE_PMD_ORDER) {
+			VM_BUG_ON(compound_order(page) != HPAGE_PMD_ORDER);
+			__inc_lruvec_page_state(page, NR_ANON_THPS);
+		} else
+			VM_BUG_ON(1);
 	} else {
 		/* Anon THP always mapped first with PMD */
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1294,10 +1310,38 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 
 static void page_remove_anon_compound_rmap(struct page *page, int map_order)
 {
-	int i, nr;
+	int i, nr = 0;
+	struct page *head = compound_head(page);
+
+	if (compound_order(head) == HPAGE_PUD_ORDER) {
+		if (map_order == HPAGE_PMD_ORDER) {
+			VM_BUG_ON(!PMDPageInPUD(page));
+			if (atomic_add_negative(-1, sub_compound_mapcount_ptr(page, 1))) {
+				if (TestClearPageDoubleMap(page)) {
+					/*
+					 * Subpages can be mapped with PTEs too. Check how many of
+					 * themi are still mapped.
+					 */
+					for (i = 0; i < thp_nr_pages(head); i++) {
+						if (atomic_add_negative(-1, &head[i]._mapcount))
+							nr++;
+					}
+				}
+				__dec_node_page_state(page, NR_ANON_THPS);
+			}
+			nr += HPAGE_PMD_NR;
+			__mod_node_page_state(page_pgdat(head), NR_ANON_MAPPED, -nr);
+			return;
+		}
 
-	if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
-		return;
+		VM_BUG_ON(map_order != HPAGE_PUD_ORDER);
+		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+			return;
+	} else if (compound_order(head) == HPAGE_PMD_ORDER) {
+		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+			return;
+	} else
+		VM_BUG_ON_PAGE(1, page);
 
 	/* Hugepages are not counted in NR_ANON_PAGES for now. */
 	if (unlikely(PageHuge(page)))
@@ -1308,10 +1352,31 @@ static void page_remove_anon_compound_rmap(struct page *page, int map_order)
 
 	if (map_order == HPAGE_PMD_ORDER)
 		__dec_lruvec_page_state(page, NR_ANON_THPS);
-	else
+	else if (map_order == HPAGE_PUD_ORDER)
 		__dec_lruvec_page_state(page, NR_ANON_THPS_PUD);
+	else
+		VM_BUG_ON(1);
 
-	if (TestClearPageDoubleMap(page)) {
+	/* PMD-mapped PUD THP is handled above */
+	if (TestClearPagePUDDoubleMap(head)) {
+		VM_BUG_ON(!(compound_order(head) == HPAGE_PUD_ORDER || head == page));
+		/*
+		 * Subpages can be mapped with PMDs too. Check how many of
+		 * them are still mapped.
+		 */
+		for (i = 0, nr = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR) {
+			if (atomic_add_negative(-1, sub_compound_mapcount_ptr(&head[i], 1)))
+				nr += HPAGE_PMD_NR;
+		}
+		/*
+		 * Queue the page for deferred split if at least one pmd page
+		 * of the pud compound page is unmapped, but at least one
+		 * pmd page is still mapped.
+		 */
+		if (nr && nr < thp_nr_pages(head))
+			deferred_split_huge_page(head);
+	} else if (TestClearPageDoubleMap(head)) {
+		VM_BUG_ON(compound_order(head) != HPAGE_PMD_ORDER);
 		/*
 		 * Subpages can be mapped with PTEs too. Check how many of
 		 * them are still mapped.
@@ -1335,8 +1400,10 @@ static void page_remove_anon_compound_rmap(struct page *page, int map_order)
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
 
-	if (nr)
-		__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
+	if (nr) {
+		__mod_lruvec_page_state(head, NR_ANON_MAPPED, -nr);
+		deferred_split_huge_page(head);
+	}
 }
 
 /**
diff --git a/mm/swap.c b/mm/swap.c
index 7e79829a2e73..43c18e5b6916 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1005,6 +1005,36 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 					  page_lru(page_tail));
 	}
 }
+
+/* used by __split_pud_huge_page_tail() */
+void lru_add_pud_page_tail(struct page *page, struct page *page_tail,
+		       struct lruvec *lruvec, struct list_head *list)
+{
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+
+	if (!list)
+		SetPageLRU(page_tail);
+
+	if (likely(PageLRU(page)))
+		list_add_tail(&page_tail->lru, &page->lru);
+	else if (list) {
+		/* page reclaim is reclaiming a huge page */
+		get_page(page_tail);
+		list_add_tail(&page_tail->lru, list);
+	} else {
+		/*
+		 * Head page has not yet been counted, as an hpage,
+		 * so we must account for each subpage individually.
+		 *
+		 * Put page_tail on the list at the correct position
+		 * so they all end up in order.
+		 */
+		add_page_to_lru_list_tail(page_tail, lruvec,
+					  page_lru(page_tail));
+	}
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
diff --git a/mm/util.c b/mm/util.c
index bb902f5a6582..e22d04d9e020 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -653,6 +653,12 @@ bool page_mapped(struct page *page)
 	page = compound_head(page);
 	if (atomic_read(compound_mapcount_ptr(page)) >= 0)
 		return true;
+	if (compound_order(page) == HPAGE_PUD_ORDER) {
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR) {
+			if (sub_compound_mapcount(page + i) > 0)
+				return true;
+		}
+	}
 	if (PageHuge(page))
 		return false;
 	for (i = 0; i < compound_nr(page); i++) {
@@ -713,17 +719,27 @@ struct address_space *page_mapping_file(struct page *page)
 int __page_mapcount(struct page *page)
 {
 	int ret;
+	struct page *head = compound_head(page);
 
+	/* base page mapping */
 	ret = atomic_read(&page->_mapcount) + 1;
+
+	/* PMDInPUD mapping */
+	if (compound_order(head) == HPAGE_PUD_ORDER) {
+		struct page *sub_compound_page = head +
+			(((page - head) / HPAGE_PMD_NR) * HPAGE_PMD_NR);
+
+		ret += sub_compound_mapcount(sub_compound_page);
+	}
 	/*
 	 * For file THP page->_mapcount contains total number of mapping
 	 * of the page: no need to look into compound_mapcount.
 	 */
 	if (!PageAnon(page) && !PageHuge(page))
 		return ret;
-	page = compound_head(page);
-	ret += atomic_read(compound_mapcount_ptr(page)) + 1;
-	if (PageDoubleMap(page))
+	/* highest compound mapping */
+	ret += atomic_read(compound_mapcount_ptr(head)) + 1;
+	if (PageDoubleMap(head))
 		ret--;
 	return ret;
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a9e50ef6a40d..2bb702d79f01 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1331,6 +1331,10 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback_pud",
 	"thp_fault_fallback_pud_charge",
 	"thp_split_pud",
+	"thp_split_pud_page",
+	"thp_split_pud_page_failed",
+	"thp_zero_pud_page_alloc",
+	"thp_zero_pud_page_alloc_failed",
 #endif
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 15/30] mm: thp: add PUD THP to deferred split list when PUD mapping is gone.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (13 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 14/30] mm: thp: add PUD THP split_huge_pud_page() function Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 16/30] mm: debug: adapt dump_page to PUD THP Zi Yan
                   ` (15 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

When PUD mapping is gone, there is no need to keep the PUD THP. Add it
to deferred split list, so when memory pressure comes, the THP will be
split.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/rmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/rmap.c b/mm/rmap.c
index b4950f7a0978..424322807966 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1329,6 +1329,9 @@ static void page_remove_anon_compound_rmap(struct page *page, int map_order)
 				}
 				__dec_node_page_state(page, NR_ANON_THPS);
 			}
+			/* deferred split huge pud page if PUD map is gone */
+			if (!compound_mapcount(head))
+				deferred_split_huge_page(head);
 			nr += HPAGE_PMD_NR;
 			__mod_node_page_state(page_pgdat(head), NR_ANON_MAPPED, -nr);
 			return;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 16/30] mm: debug: adapt dump_page to PUD THP.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (14 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 15/30] mm: thp: add PUD THP to deferred split list when PUD mapping is gone Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 17/30] mm: thp: PUD THP COW splits PUD page and falls back to PMD page Zi Yan
                   ` (14 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

Since the order of a PUD THP is greater than MAX_ORDER, do not consider
its tail pages corrupted. Also print sub_compound_mapcount when dumping
a PMDPageInPUD.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/debug.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/debug.c b/mm/debug.c
index ccca576b2899..f5b035dc620d 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -68,7 +68,9 @@ void __dump_page(struct page *page, const char *reason)
 		goto hex_only;
 	}
 
-	if (page < head || (page >= head + MAX_ORDER_NR_PAGES)) {
+	if (page < head ||
+	    (page >= head + max_t(unsigned long, compound_nr(head),
+				  (unsigned long)MAX_ORDER_NR_PAGES))) {
 		/*
 		 * Corrupt page, so we cannot call page_mapping. Instead, do a
 		 * safe subset of the steps that page_mapping() does. Caution:
@@ -109,6 +111,8 @@ void __dump_page(struct page *page, const char *reason)
 					head, compound_order(head),
 					head_compound_mapcount(head));
 		}
+		if (compound_order(head) == HPAGE_PUD_ORDER && PMDPageInPUD(page))
+			pr_warn("sub_compound_mapcount:%d\n", sub_compound_mapcount(page));
 	}
 	if (PageKsm(page))
 		type = "ksm ";
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 17/30] mm: thp: PUD THP COW splits PUD page and falls back to PMD page.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (15 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 16/30] mm: debug: adapt dump_page to PUD THP Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 18/30] mm: thp: PUD THP follow_p*d_page() support Zi Yan
                   ` (13 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

COW on PUD THPs has the same behavior as COW on PMD THPs to avoid high
COW overhead. As a result, do_huge_pmd_wp will see PMD-mapped PUD THPs,
thus needs to count PUD mappings in total mapcount when calling
page_trans_huge_map_swapcount in reuse_swap_page to avoid false positive.
Change page_trans_huge_map_swapcount to get it right.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h |  5 +++++
 mm/huge_memory.c        | 13 +++++++++++++
 mm/memory.c             |  3 +--
 mm/swapfile.c           |  7 ++++++-
 4 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e5c68e680907..589e5af5a1c2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -19,6 +19,7 @@ extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 extern void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
 extern int do_huge_pud_anonymous_page(struct vm_fault *vmf);
+extern vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud);
 #else
 static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 {
@@ -27,6 +28,10 @@ extern int do_huge_pud_anonymous_page(struct vm_fault *vmf)
 {
 	return VM_FAULT_FALLBACK;
 }
+extern vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
+{
+	return VM_FAULT_FALLBACK;
+}
 #endif
 
 extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4a899e856088..9aa19aa643cd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1335,6 +1335,19 @@ void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 unlock:
 	spin_unlock(vmf->ptl);
 }
+
+vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
+{
+	struct vm_area_struct *vma = vmf->vma;
+
+	/*
+	 * split pud directly. a whole pud page is not swappable, so there is
+	 * no need to try reuse_swap_page
+	 */
+	__split_huge_pud(vma, vmf->pud, vmf->address, false, NULL);
+	return VM_FAULT_FALLBACK;
+}
+
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd)
diff --git a/mm/memory.c b/mm/memory.c
index e0e0459c0caf..ab80d13807aa 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4141,9 +4141,8 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
 static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	/* No support for anonymous transparent PUD pages yet */
 	if (vma_is_anonymous(vmf->vma))
-		return VM_FAULT_FALLBACK;
+		return do_huge_pud_wp_page(vmf, orig_pud);
 	if (vmf->vma->vm_ops->huge_fault)
 		return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 495ecdbd7859..a6989b0c4d44 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1635,7 +1635,12 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
 	/* hugetlbfs shouldn't call it */
 	VM_BUG_ON_PAGE(PageHuge(page), page);
 
-	if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!PageTransCompound(page))) {
+	if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!PageTransCompound(page)) ||
+	    /*
+	     * PMD-mapped PUD THP need to take PUD mappings into account by
+	     * using page_trans_huge_mapcount
+	     */
+	    unlikely(thp_order(page) == HPAGE_PUD_ORDER)) {
 		mapcount = page_trans_huge_mapcount(page, total_mapcount);
 		if (PageSwapCache(page))
 			swapcount = page_swapcount(page);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 18/30] mm: thp: PUD THP follow_p*d_page() support.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (16 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 17/30] mm: thp: PUD THP COW splits PUD page and falls back to PMD page Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 19/30] mm: stats: make smap stats understand PUD THPs Zi Yan
                   ` (12 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

Add follow_page support for PUD THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h | 11 +++++++
 mm/gup.c                | 60 ++++++++++++++++++++++++++++++++-
 mm/huge_memory.c        | 73 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 142 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 589e5af5a1c2..c7bc40c4a5e2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -20,6 +20,10 @@ extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 extern void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
 extern int do_huge_pud_anonymous_page(struct vm_fault *vmf);
 extern vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud);
+extern struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
+					  unsigned long addr,
+					  pud_t *pud,
+					  unsigned int flags);
 #else
 static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 {
@@ -32,6 +36,13 @@ extern vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
 {
 	return VM_FAULT_FALLBACK;
 }
+struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
+					  unsigned long addr,
+					  pud_t *pud,
+					  unsigned int flags)
+{
+	return NULL;
+}
 #endif
 
 extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
diff --git a/mm/gup.c b/mm/gup.c
index b21cc220f036..972cca69f228 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -696,10 +696,68 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 		if (page)
 			return page;
 	}
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	if (likely(!pud_trans_huge(*pud))) {
+		if (unlikely(pud_bad(*pud)))
+			return no_page_table(vma, flags);
+		return follow_pmd_mask(vma, address, pud, flags, ctx);
+	}
+
+	ptl = pud_lock(mm, pud);
+
+	if (unlikely(!pud_trans_huge(*pud))) {
+		spin_unlock(ptl);
+		if (unlikely(pud_bad(*pud)))
+			return no_page_table(vma, flags);
+		return follow_pmd_mask(vma, address, pud, flags, ctx);
+	}
+
+	if (flags & FOLL_SPLIT) {
+		int ret;
+		pmd_t *pmd = NULL;
+
+		page = pud_page(*pud);
+		if (is_huge_zero_page(page)) {
+
+			spin_unlock(ptl);
+			ret = 0;
+			split_huge_pud(vma, pud, address);
+			pmd = pmd_offset(pud, address);
+			split_huge_pmd(vma, pmd, address);
+			if (pmd_trans_unstable(pmd))
+				ret = -EBUSY;
+		} else {
+			get_page(page);
+			spin_unlock(ptl);
+			lock_page(page);
+			ret = split_huge_pud_page(page);
+			if (!ret)
+				ret = split_huge_page(page);
+			else {
+				unlock_page(page);
+				put_page(page);
+				goto out;
+			}
+			unlock_page(page);
+			put_page(page);
+			if (pud_none(*pud))
+				return no_page_table(vma, flags);
+			pmd = pmd_offset(pud, address);
+		}
+out:
+		return ret ? ERR_PTR(ret) :
+			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
+	}
+	page = follow_trans_huge_pud(vma, address, pud, flags);
+	spin_unlock(ptl);
+	ctx->page_mask = HPAGE_PUD_NR - 1;
+	return page;
+#else
 	if (unlikely(pud_bad(*pud)))
 		return no_page_table(vma, flags);
-
 	return follow_pmd_mask(vma, address, pud, flags, ctx);
+#endif
 }
 
 static struct page *follow_p4d_mask(struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9aa19aa643cd..61ae7a0ded84 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1258,6 +1258,77 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	return page;
 }
 
+/*
+ * FOLL_FORCE can write to even unwritable pmd's, but only
+ * after we've gone through a COW cycle and they are dirty.
+ */
+static inline bool can_follow_write_pud(pud_t pud, unsigned int flags)
+{
+	return pud_write(pud) ||
+	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pud_dirty(pud));
+}
+
+struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
+				   unsigned long addr,
+				   pud_t *pud,
+				   unsigned int flags)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *page = NULL;
+
+	assert_spin_locked(pud_lockptr(mm, pud));
+
+	if (flags & FOLL_WRITE && !can_follow_write_pud(*pud, flags))
+		goto out;
+
+	/* Avoid dumping huge zero page */
+	if ((flags & FOLL_DUMP) && is_huge_zero_pud(*pud))
+		return ERR_PTR(-EFAULT);
+
+	/* Full NUMA hinting faults to serialise migration in fault paths */
+	/*&& pud_protnone(*pmd)*/
+	if ((flags & FOLL_NUMA))
+		goto out;
+
+	page = pud_page(*pud);
+	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+	if (flags & FOLL_TOUCH)
+		touch_pud(vma, addr, pud, flags);
+	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
+		/*
+		 * We don't mlock() pte-mapped THPs. This way we can avoid
+		 * leaking mlocked pages into non-VM_LOCKED VMAs.
+		 *
+		 * For anon THP:
+		 *
+		 * We do the same thing as PMD-level THP.
+		 *
+		 * For file THP:
+		 *
+		 * No support yet.
+		 *
+		 */
+
+		if (PageAnon(page) && compound_mapcount(page) != 1)
+			goto skip_mlock;
+		if (PagePUDDoubleMap(page) || !page->mapping)
+			goto skip_mlock;
+		if (!trylock_page(page))
+			goto skip_mlock;
+		lru_add_drain();
+		if (page->mapping && !PagePUDDoubleMap(page))
+			mlock_vma_page(page);
+		unlock_page(page);
+	}
+skip_mlock:
+	page += (addr & ~HPAGE_PUD_MASK) >> PAGE_SHIFT;
+	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
+	if (flags & FOLL_GET)
+		get_page(page);
+
+out:
+	return page;
+}
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
 		  struct vm_area_struct *vma)
@@ -1462,7 +1533,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 		goto out;
 
 	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page) && !PMDPageInPUD(page), page);
 
 	if (!try_grab_page(page, flags))
 		return ERR_PTR(-ENOMEM);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 19/30] mm: stats: make smap stats understand PUD THPs.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (17 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 18/30] mm: thp: PUD THP follow_p*d_page() support Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 20/30] mm: page_vma_walk: teach it about PMD-mapped PUD THP Zi Yan
                   ` (11 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 fs/proc/task_mmu.c | 68 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 63 insertions(+), 5 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index a21484b1414d..077196182288 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -430,10 +430,9 @@ static void smaps_page_accumulate(struct mem_size_stats *mss,
 }
 
 static void smaps_account(struct mem_size_stats *mss, struct page *page,
-		bool compound, bool young, bool dirty, bool locked)
+		unsigned long size, bool young, bool dirty, bool locked)
 {
-	int i, nr = compound ? compound_nr(page) : 1;
-	unsigned long size = nr * PAGE_SIZE;
+	int i, nr = size / PAGE_SIZE;
 
 	/*
 	 * First accumulate quantities that depend only on |size| and the type
@@ -530,7 +529,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 	if (!page)
 		return;
 
-	smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte), locked);
+	smaps_account(mss, page, PAGE_SIZE, pte_young(*pte), pte_dirty(*pte), locked);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -561,8 +560,44 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 		/* pass */;
 	else
 		mss->file_thp += HPAGE_PMD_SIZE;
-	smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd), locked);
+	smaps_account(mss, page, HPAGE_PMD_SIZE, pmd_young(*pmd),
+		      pmd_dirty(*pmd), locked);
 }
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static void smaps_pud_entry(pud_t *pud, unsigned long addr,
+		struct mm_walk *walk)
+{
+	struct mem_size_stats *mss = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	bool locked = !!(vma->vm_flags & VM_LOCKED);
+	struct page *page = NULL;
+
+	if (pud_present(*pud)) {
+		/* FOLL_DUMP will return -EFAULT on huge zero page */
+		page = follow_trans_huge_pud(vma, addr, pud, FOLL_DUMP);
+	}
+	if (IS_ERR_OR_NULL(page))
+		return;
+	if (PageAnon(page))
+		mss->anonymous_thp += HPAGE_PUD_SIZE;
+	else if (PageSwapBacked(page))
+		mss->shmem_thp += HPAGE_PUD_SIZE;
+	else if (is_zone_device_page(page))
+		/* pass */;
+	else
+		mss->file_thp += HPAGE_PUD_SIZE;
+	smaps_account(mss, page, HPAGE_PUD_SIZE, pud_young(*pud),
+		      pud_dirty(*pud), locked);
+}
+#else
+static void smaps_pud_entry(pud_t *pud, unsigned long addr,
+		struct mm_walk *walk)
+{
+}
+#endif
+
+
 #else
 static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 		struct mm_walk *walk)
@@ -570,6 +605,28 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 }
 #endif
 
+static int smaps_pud_range(pud_t pud, pud_t *pudp, unsigned long addr,
+			unsigned long end, struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+	spinlock_t *ptl;
+
+	ptl = pud_trans_huge_lock(pudp, vma);
+	if (ptl) {
+		if (memcmp(pudp, &pud, sizeof(pud)) != 0) {
+			walk->action = ACTION_AGAIN;
+			spin_unlock(ptl);
+			return 0;
+		}
+		smaps_pud_entry(pudp, addr, walk);
+		spin_unlock(ptl);
+		walk->action = ACTION_CONTINUE;
+	}
+
+	cond_resched();
+	return 0;
+}
+
 static int smaps_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 			unsigned long end, struct mm_walk *walk)
 {
@@ -712,6 +769,7 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 #endif /* HUGETLB_PAGE */
 
 static const struct mm_walk_ops smaps_walk_ops = {
+	.pud_entry		= smaps_pud_range,
 	.pmd_entry		= smaps_pte_range,
 	.hugetlb_entry		= smaps_hugetlb_range,
 };
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 20/30] mm: page_vma_walk: teach it about PMD-mapped PUD THP.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (18 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 19/30] mm: stats: make smap stats understand PUD THPs Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 21/30] mm: thp: PUD THP support in try_to_unmap() Zi Yan
                   ` (10 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

We now have PMD-mapped PUD THP and PTE-mapped PUD THP, page_vma_walk
should handle them properly.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_vma_mapped.c | 152 +++++++++++++++++++++++++++++++++----------
 1 file changed, 118 insertions(+), 34 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index f88e845ad5e6..5a3c1b561ff5 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -7,6 +7,12 @@
 
 #include "internal.h"
 
+enum check_pmd_result {
+	PVM_NOT_MAPPED = 0,
+	PVM_LEAF_ENTRY,
+	PVM_NONLEAF_ENTRY,
+};
+
 static inline bool not_found(struct page_vma_mapped_walk *pvmw)
 {
 	page_vma_mapped_walk_done(pvmw);
@@ -52,6 +58,22 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
 	return true;
 }
 
+static bool map_pmd(struct page_vma_mapped_walk *pvmw)
+{
+	pmd_t pmde;
+
+	pvmw->pmd = pmd_offset(pvmw->pud, pvmw->address);
+	pmde = READ_ONCE(*pvmw->pmd);
+	if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
+		pvmw->ptl = pmd_lock(pvmw->vma->vm_mm, pvmw->pmd);
+		return true;
+	} else if (!pmd_present(pmde))
+		return false;
+
+	pvmw->ptl = pmd_lock(pvmw->vma->vm_mm, pvmw->pmd);
+	return true;
+}
+
 static inline bool pfn_is_match(struct page *page, unsigned long pfn)
 {
 	unsigned long page_pfn = page_to_pfn(page);
@@ -115,6 +137,57 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 	return pfn_is_match(pvmw->page, pfn);
 }
 
+/**
+ * check_pmd - check if @pvmw->page is mapped at the @pvmw->pmd
+ *
+ * page_vma_mapped_walk() found a place where @pvmw->page is *potentially*
+ * mapped. check_pmd() has to validate this.
+ *
+ * @pvmw->pmd may point to empty PMD, migraiton PMD, PMD pointing to arbitrary
+ * huge page, or PMD pointing to a PTE page table page.
+ *
+ * If PVMW_MIGRATION flag is set, returns PVM_LEAF_ENTRY if @pvmw->pmd contains
+ * migration entry that points to @pvmw->page.
+ *
+ * If PVMW_MIGRATION flag is not set, returns PVM_LEAF_ENTRY if @pvmw->pmd
+ * points to @pvmw->page.
+ *
+ * If @pvmw->pmd points to a PTE page table page, returns PVM_NONLEAF_ENTRY.
+ *
+ * Otherwise, return PVM_NOT_MAPPED.
+ *
+ */
+static enum check_pmd_result check_pmd(struct page_vma_mapped_walk *pvmw)
+{
+	unsigned long pfn;
+
+	if (likely(pmd_trans_huge(*pvmw->pmd))) {
+		if (pvmw->flags & PVMW_MIGRATION)
+			return 0;
+		pfn = pmd_pfn(*pvmw->pmd);
+		if (!pfn_is_match(pvmw->page, pfn))
+			return PVM_NOT_MAPPED;
+		return PVM_LEAF_ENTRY;
+	} else if (!pmd_present(*pvmw->pmd)) {
+		if (thp_migration_supported()) {
+			if (!(pvmw->flags & PVMW_MIGRATION))
+				return 0;
+			if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
+				swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
+
+				pfn = migration_entry_to_pfn(entry);
+				if (!pfn_is_match(pvmw->page, pfn))
+					return PVM_NOT_MAPPED;
+				return PVM_LEAF_ENTRY;
+			}
+		}
+		return 0;
+	}
+	/* THP pmd was split under us: handle on pte level */
+	spin_unlock(pvmw->ptl);
+	pvmw->ptl = NULL;
+	return PVM_NONLEAF_ENTRY;
+}
 /**
  * page_vma_mapped_walk - check if @pvmw->page is mapped in @pvmw->vma at
  * @pvmw->address
@@ -146,14 +219,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t pude;
-	pmd_t pmde;
+	enum check_pmd_result pmd_check_res;
 
 	if (!pvmw->pte && !pvmw->pmd && pvmw->pud)
 		return not_found(pvmw);
 
 	/* The only possible pmd mapping has been handled on last iteration */
 	if (pvmw->pmd && !pvmw->pte)
-		return not_found(pvmw);
+		goto next_pmd;
 
 	if (pvmw->pte)
 		goto next_pte;
@@ -202,42 +275,47 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	} else if (!pud_present(pude))
 		return false;
 
-	pvmw->pmd = pmd_offset(pvmw->pud, pvmw->address);
-	/*
-	 * Make sure the pmd value isn't cached in a register by the
-	 * compiler and used as a stale value after we've observed a
-	 * subsequent update.
-	 */
-	pmde = READ_ONCE(*pvmw->pmd);
-	if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
-		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
-		if (likely(pmd_trans_huge(*pvmw->pmd))) {
-			if (pvmw->flags & PVMW_MIGRATION)
-				return not_found(pvmw);
-			if (pmd_page(*pvmw->pmd) != page)
-				return not_found(pvmw);
+	if (!map_pmd(pvmw))
+		goto next_pmd;
+	/* pmd locked after map_pmd  */
+	while (1) {
+		pmd_check_res = check_pmd(pvmw);
+		if (pmd_check_res == PVM_LEAF_ENTRY)
 			return true;
-		} else if (!pmd_present(*pvmw->pmd)) {
-			if (thp_migration_supported()) {
-				if (!(pvmw->flags & PVMW_MIGRATION))
-					return not_found(pvmw);
-				if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
-					swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
-
-					if (migration_entry_to_page(entry) != page)
-						return not_found(pvmw);
-					return true;
+		else if (pmd_check_res == PVM_NONLEAF_ENTRY)
+			goto pte_level;
+next_pmd:
+		/* Only PMD-mapped PUD THP has next pmd. */
+		if (!(PageTransHuge(pvmw->page) && compound_order(pvmw->page) == HPAGE_PUD_ORDER))
+			return not_found(pvmw);
+		do {
+			pvmw->address += HPAGE_PMD_SIZE;
+			if (pvmw->address >= pvmw->vma->vm_end ||
+			    pvmw->address >=
+					__vma_address(pvmw->page, pvmw->vma) +
+					thp_nr_pages(pvmw->page) * PAGE_SIZE)
+				return not_found(pvmw);
+			/* Did we cross page table boundary? */
+			if (pvmw->address % PUD_SIZE == 0) {
+				/*
+				 * Reset pmd here, so we will no stay at PMD
+				 * level after restart.
+				 */
+				pvmw->pmd = NULL;
+				if (pvmw->ptl) {
+					spin_unlock(pvmw->ptl);
+					pvmw->ptl = NULL;
 				}
+				goto restart;
+			} else {
+				pvmw->pmd++;
 			}
-			return not_found(pvmw);
-		} else {
-			/* THP pmd was split under us: handle on pte level */
-			spin_unlock(pvmw->ptl);
-			pvmw->ptl = NULL;
-		}
-	} else if (!pmd_present(pmde)) {
-		return false;
+		} while (pmd_none(*pvmw->pmd));
+
+		if (!pvmw->ptl)
+			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
 	}
+pte_level:
 	if (!map_pte(pvmw))
 		goto next_pte;
 	while (1) {
@@ -257,6 +335,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			/* Did we cross page table boundary? */
 			if (pvmw->address % PMD_SIZE == 0) {
 				pte_unmap(pvmw->pte);
+				/*
+				 * In the case of PTE-mapped PUD THP, next entry
+				 * can be PMD. Reset pte here, so we will not
+				 * stay at PTE level after restart.
+				 */
+				pvmw->pte = NULL;
 				if (pvmw->ptl) {
 					spin_unlock(pvmw->ptl);
 					pvmw->ptl = NULL;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 21/30] mm: thp: PUD THP support in try_to_unmap().
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (19 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 20/30] mm: page_vma_walk: teach it about PMD-mapped PUD THP Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 22/30] mm: thp: split PUD THPs at page reclaim Zi Yan
                   ` (9 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

Unmap different subpages in different sized THPs properly in the
try_to_unmap() function. pvmw.pte, pvmw.pmd, pvmw.pud are used to
identify unmapped page sizes:

1. pvmw.pte != NULL: PTE pages or PageHuge.
2. pvmw.pte == NULL and pvmw.pmd != NULL: PMD pages.
3. pvmw.pte == NULL and pvmw.pmd == NULL and pvmw.pud != NULL: PUD pages.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/migrate.c |   2 +-
 mm/rmap.c    | 156 ++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 117 insertions(+), 41 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index a7320e9d859c..d0e6afe682aa 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -225,7 +225,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 		/* PMD-mapped THP migration entry */
-		if (!pvmw.pte) {
+		if (!pvmw.pte && pvmw.pmd) {
 			VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);
 			remove_migration_pmd(&pvmw, new);
 			continue;
diff --git a/mm/rmap.c b/mm/rmap.c
index 424322807966..32f2e0312e16 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1125,6 +1125,7 @@ void do_page_add_anon_rmap(struct page *page,
 {
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
+	struct page *head = compound_head(page);
 
 	if (unlikely(PageKsm(page)))
 		lock_page_memcg(page);
@@ -1134,7 +1135,7 @@ void do_page_add_anon_rmap(struct page *page,
 	if (compound) {
 		atomic_t *mapcount = NULL;
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+		VM_BUG_ON_PAGE(!PMDPageInPUD(page) && !PageTransHuge(page), page);
 		if (compound_order(page) == HPAGE_PUD_ORDER) {
 			if (map_order == HPAGE_PUD_ORDER) {
 				mapcount = compound_mapcount_ptr(page);
@@ -1143,7 +1144,7 @@ void do_page_add_anon_rmap(struct page *page,
 				mapcount = sub_compound_mapcount_ptr(page, 1);
 			} else
 				VM_BUG_ON(1);
-		} else if (compound_order(page) == HPAGE_PMD_ORDER) {
+		} else if (compound_order(head) == HPAGE_PMD_ORDER) {
 			mapcount = compound_mapcount_ptr(page);
 		} else
 			VM_BUG_ON(1);
@@ -1153,7 +1154,7 @@ void do_page_add_anon_rmap(struct page *page,
 	}
 
 	if (first) {
-		int nr = compound ? thp_nr_pages(page) : 1;
+		int nr = 1<<map_order;
 		/*
 		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
 		 * these counters are not modified in interrupt context, and
@@ -1474,10 +1475,13 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		.address = address,
 	};
 	pte_t pteval;
-	struct page *subpage;
+	pmd_t pmdval;
+	pud_t pudval;
+	struct page *subpage = NULL;
 	bool ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
+	int map_order = 0;
 
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
@@ -1487,6 +1491,11 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	    is_zone_device_page(page) && !is_device_private_page(page))
 		return true;
 
+	if (flags & TTU_SPLIT_HUGE_PUD) {
+		split_huge_pud_address(vma, address,
+				flags & TTU_SPLIT_FREEZE, page);
+	}
+
 	if (flags & TTU_SPLIT_HUGE_PMD) {
 		split_huge_pmd_address(vma, address,
 				flags & TTU_SPLIT_FREEZE, page);
@@ -1519,7 +1528,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	while (page_vma_mapped_walk(&pvmw)) {
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 		/* PMD-mapped THP migration entry */
-		if (!pvmw.pte && (flags & TTU_MIGRATION)) {
+		if (!pvmw.pte && pvmw.pmd && (flags & TTU_MIGRATION)) {
 			VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);
 
 			set_pmd_migration_entry(&pvmw, page);
@@ -1551,9 +1560,25 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		/* Unexpected PMD-mapped THP? */
-		VM_BUG_ON_PAGE(!pvmw.pte, page);
 
-		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
+		if (pvmw.pte) {
+			subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
+			/*
+			 * PageHuge always uses pvmw.pte to store relevant page
+			 * table entry
+			 */
+			if (PageHuge(page))
+				map_order = compound_order(page);
+			else
+				map_order = 0;
+		} else if (!pvmw.pte && pvmw.pmd) {
+			subpage = page - page_to_pfn(page) + pmd_pfn(*pvmw.pmd);
+			map_order = HPAGE_PMD_ORDER;
+		} else if (!pvmw.pte && !pvmw.pmd && pvmw.pud) {
+			subpage = page - page_to_pfn(page) + pud_pfn(*pvmw.pud);
+			map_order = HPAGE_PUD_ORDER;
+		}
+		VM_BUG_ON(!subpage);
 		address = pvmw.address;
 
 		if (PageHuge(page)) {
@@ -1631,8 +1656,12 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		if (!(flags & TTU_IGNORE_ACCESS)) {
-			if (ptep_clear_flush_young_notify(vma, address,
-						pvmw.pte)) {
+			if ((pvmw.pte &&
+				 ptep_clear_flush_young_notify(vma, address, pvmw.pte)) ||
+				((!pvmw.pte && pvmw.pmd) &&
+				 pmdp_clear_flush_young_notify(vma, address, pvmw.pmd)) ||
+				((!pvmw.pte && !pvmw.pmd && pvmw.pud) &&
+				 pudp_clear_flush_young_notify(vma, address, pvmw.pud))) {
 				ret = false;
 				page_vma_mapped_walk_done(&pvmw);
 				break;
@@ -1640,7 +1669,12 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		/* Nuke the page table entry. */
-		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+		if (pvmw.pte)
+			flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+		else if (!pvmw.pte && pvmw.pmd)
+			flush_cache_page(vma, address, pmd_pfn(*pvmw.pmd));
+		else if (!pvmw.pte && !pvmw.pmd && pvmw.pud)
+			flush_cache_page(vma, address, pud_pfn(*pvmw.pud));
 		if (should_defer_flush(mm, flags)) {
 			/*
 			 * We clear the PTE but do not flush so potentially
@@ -1650,16 +1684,34 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 * transition on a cached TLB entry is written through
 			 * and traps if the PTE is unmapped.
 			 */
-			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+			if (pvmw.pte) {
+				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+
+				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+			} else if (!pvmw.pte && pvmw.pmd) {
+				pmdval = pmdp_huge_get_and_clear(mm, address, pvmw.pmd);
 
-			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+				set_tlb_ubc_flush_pending(mm, pmd_dirty(pmdval));
+			} else if (!pvmw.pte && !pvmw.pmd && pvmw.pud) {
+				pudval = pudp_huge_get_and_clear(mm, address, pvmw.pud);
+
+				set_tlb_ubc_flush_pending(mm, pud_dirty(pudval));
+			}
 		} else {
-			pteval = ptep_clear_flush(vma, address, pvmw.pte);
+			if (pvmw.pte)
+				pteval = ptep_clear_flush(vma, address, pvmw.pte);
+			else if (!pvmw.pte && pvmw.pmd)
+				pmdval = pmdp_huge_clear_flush(vma, address, pvmw.pmd);
+			else if (!pvmw.pte && !pvmw.pmd && pvmw.pud)
+				pudval = pudp_huge_clear_flush(vma, address, pvmw.pud);
 		}
 
 		/* Move the dirty bit to the page. Now the pte is gone. */
-		if (pte_dirty(pteval))
-			set_page_dirty(page);
+			if ((pvmw.pte && pte_dirty(pteval)) ||
+				((!pvmw.pte && pvmw.pmd) && pmd_dirty(pmdval)) ||
+				((!pvmw.pte && !pvmw.pmd && pvmw.pud) && pud_dirty(pudval))
+				)
+				set_page_dirty(page);
 
 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
@@ -1694,35 +1746,59 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
 				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
 			swp_entry_t entry;
-			pte_t swp_pte;
 
-			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				ret = false;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
-			}
+			if (pvmw.pte) {
+				pte_t swp_pte;
 
-			/*
-			 * Store the pfn of the page in a special migration
-			 * pte. do_swap_page() will wait until the migration
-			 * pte is removed and then restart fault handling.
-			 */
-			entry = make_migration_entry(subpage,
-					pte_write(pteval));
-			swp_pte = swp_entry_to_pte(entry);
-			if (pte_soft_dirty(pteval))
-				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-			if (pte_uffd_wp(pteval))
-				swp_pte = pte_swp_mkuffd_wp(swp_pte);
-			set_pte_at(mm, address, pvmw.pte, swp_pte);
-			/*
-			 * No need to invalidate here it will synchronize on
-			 * against the special swap migration pte.
-			 */
+				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+					set_pte_at(mm, address, pvmw.pte, pteval);
+					ret = false;
+					page_vma_mapped_walk_done(&pvmw);
+					break;
+				}
+
+				/*
+				 * Store the pfn of the page in a special migration
+				 * pte. do_swap_page() will wait until the migration
+				 * pte is removed and then restart fault handling.
+				 */
+				entry = make_migration_entry(subpage,
+						pte_write(pteval));
+				swp_pte = swp_entry_to_pte(entry);
+				if (pte_soft_dirty(pteval))
+					swp_pte = pte_swp_mksoft_dirty(swp_pte);
+				if (pte_uffd_wp(pteval))
+					swp_pte = pte_swp_mkuffd_wp(swp_pte);
+				set_pte_at(mm, address, pvmw.pte, swp_pte);
+				/*
+				 * No need to invalidate here it will synchronize on
+				 * against the special swap migration pte.
+				 */
+			} else if (!pvmw.pte && pvmw.pmd) {
+				pmd_t swp_pmd;
+				/*
+				 * Store the pfn of the page in a special migration
+				 * pte. do_swap_page() will wait until the migration
+				 * pte is removed and then restart fault handling.
+				 */
+				entry = make_migration_entry(subpage,
+						pmd_write(pmdval));
+				swp_pmd = swp_entry_to_pmd(entry);
+				if (pmd_soft_dirty(pmdval))
+					swp_pmd = pmd_swp_mksoft_dirty(swp_pmd);
+				set_pmd_at(mm, address, pvmw.pmd, swp_pmd);
+				/*
+				 * No need to invalidate here it will synchronize on
+				 * against the special swap migration pte.
+				 */
+			} else if (!pvmw.pte && !pvmw.pmd && pvmw.pud) {
+				VM_BUG_ON(1);
+			}
 		} else if (PageAnon(page)) {
 			swp_entry_t entry = { .val = page_private(subpage) };
 			pte_t swp_pte;
+
+			VM_BUG_ON(!pvmw.pte);
 			/*
 			 * Store the swap location in the pte.
 			 * See handle_pte_fault() ...
@@ -1808,7 +1884,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, compound_order(page));
+		page_remove_rmap(subpage, map_order);
 		put_page(page);
 	}
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 22/30] mm: thp: split PUD THPs at page reclaim.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (20 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 21/30] mm: thp: PUD THP support in try_to_unmap() Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 23/30] mm: support PUD THP pagemap support Zi Yan
                   ` (8 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

We cannot swap PUD THPs, so split them before swap them out. PUD THPs
will be split into PMD THPs, so that if THP_SWAP is enabled, PMD THPs
can be swapped out as a whole.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/swap_slots.c |  2 ++
 mm/vmscan.c     | 33 +++++++++++++++++++++++++++------
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 3e6453573a89..65b8742a0446 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -312,6 +312,8 @@ swp_entry_t get_swap_page(struct page *page)
 	entry.val = 0;
 
 	if (PageTransHuge(page)) {
+		if (compound_order(page) == HPAGE_PUD_ORDER)
+			return entry;
 		if (IS_ENABLED(CONFIG_THP_SWAP))
 			get_swap_pages(1, &entry, HPAGE_PMD_NR);
 		goto out;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eae57d092931..12e169af663c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1244,7 +1244,21 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 			if (!PageSwapCache(page)) {
 				if (!(sc->gfp_mask & __GFP_IO))
 					goto keep_locked;
-				if (PageTransHuge(page)) {
+				if (!PageTransHuge(page))
+					goto try_to_swap;
+				if (compound_order(page) == HPAGE_PUD_ORDER) {
+					/* cannot split THP, skip it */
+					if (!can_split_huge_pud_page(page, NULL))
+						goto activate_locked;
+					/* Split PUD THPs before swapping */
+					if (split_huge_pud_page_to_list(page, page_list))
+						goto activate_locked;
+					else {
+						sc->nr_scanned -= (nr_pages - HPAGE_PMD_NR);
+						nr_pages = HPAGE_PMD_NR;
+					}
+				}
+				if (compound_order(page) == HPAGE_PMD_ORDER) {
 					/* cannot split THP, skip it */
 					if (!can_split_huge_page(page, NULL))
 						goto activate_locked;
@@ -1254,14 +1268,17 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 					 * tail pages can be freed without IO.
 					 */
 					if (!compound_mapcount(page) &&
-					    split_huge_page_to_list(page,
-								    page_list))
+						split_huge_page_to_list(page,
+									page_list))
 						goto activate_locked;
 				}
+try_to_swap:
 				if (!add_to_swap(page)) {
 					if (!PageTransHuge(page))
 						goto activate_locked_split;
 					/* Fallback to swap normal pages */
+					VM_BUG_ON_PAGE(compound_order(page) != HPAGE_PMD_ORDER,
+						       page);
 					if (split_huge_page_to_list(page,
 								    page_list))
 						goto activate_locked;
@@ -1278,6 +1295,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 				mapping = page_mapping(page);
 			}
 		} else if (unlikely(PageTransHuge(page))) {
+			VM_BUG_ON_PAGE(compound_order(page) != HPAGE_PMD_ORDER, page);
 			/* Split file THP */
 			if (split_huge_page_to_list(page, page_list))
 				goto keep_locked;
@@ -1303,9 +1321,12 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 			enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH;
 			bool was_swapbacked = PageSwapBacked(page);
 
-			if (unlikely(PageTransHuge(page)))
-				flags |= TTU_SPLIT_HUGE_PMD;
-
+			if (unlikely(PageTransHuge(page))) {
+				if (compound_order(page) == HPAGE_PMD_ORDER)
+					flags |= TTU_SPLIT_HUGE_PMD;
+				else if (compound_order(page) == HPAGE_PUD_ORDER)
+					flags |= TTU_SPLIT_HUGE_PUD;
+			}
 			if (!try_to_unmap(page, flags)) {
 				stat->nr_unmap_fail += nr_pages;
 				if (!was_swapbacked && PageSwapBacked(page))
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 23/30] mm: support PUD THP pagemap support.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (21 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 22/30] mm: thp: split PUD THPs at page reclaim Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 24/30] mm: madvise: add page size options to MADV_HUGEPAGE and MADV_NOHUGEPAGE Zi Yan
                   ` (7 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

pagemap_pud_range is added to print pud page flags properly.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 fs/proc/task_mmu.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 077196182288..04a3158d0d5b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1553,6 +1553,68 @@ static int pagemap_pmd_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 	return err;
 }
 
+static int pagemap_pud_range(pud_t pud, pud_t *pudp, unsigned long addr,
+			unsigned long end, struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+	struct pagemapread *pm = walk->private;
+	spinlock_t *ptl;
+	int err = 0;
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	ptl = pud_trans_huge_lock(pudp, vma);
+	if (ptl) {
+		u64 flags = 0, frame = 0;
+		struct page *page = NULL;
+
+		if (memcmp(pudp, &pud, sizeof(pud)) != 0) {
+			walk->action = ACTION_AGAIN;
+			spin_unlock(ptl);
+			return 0;
+		}
+		if (vma->vm_flags & VM_SOFTDIRTY)
+			flags |= PM_SOFT_DIRTY;
+
+		if (pud_present(pud)) {
+			page = pud_page(pud);
+
+			flags |= PM_PRESENT;
+			if (pud_soft_dirty(pud))
+				flags |= PM_SOFT_DIRTY;
+			if (pm->show_pfn)
+				frame = pud_pfn(pud) +
+					((addr & ~PUD_MASK) >> PAGE_SHIFT);
+		}
+
+		if (page && page_mapcount(page) == 1)
+			flags |= PM_MMAP_EXCLUSIVE;
+
+		for (; addr != end; addr += PAGE_SIZE) {
+			pagemap_entry_t pme = make_pme(frame, flags);
+
+			err = add_to_pagemap(addr, &pme, pm);
+			if (err)
+				break;
+			if (pm->show_pfn) {
+				if (flags & PM_PRESENT)
+					frame++;
+				else if (flags & PM_SWAP)
+					frame += (1 << MAX_SWAPFILES_SHIFT);
+			}
+		}
+		spin_unlock(ptl);
+		walk->action = ACTION_CONTINUE;
+		return err;
+	}
+
+	if (pud_trans_unstable(&pud)) {
+		walk->action = ACTION_AGAIN;
+		return 0;
+	}
+#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+	return err;
+}
+
 #ifdef CONFIG_HUGETLB_PAGE
 /* This function walks within one hugetlb entry in the single call */
 static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
@@ -1603,6 +1665,7 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
 #endif /* HUGETLB_PAGE */
 
 static const struct mm_walk_ops pagemap_ops = {
+	.pud_entry	= pagemap_pud_range,
 	.pmd_entry	= pagemap_pmd_range,
 	.pte_hole	= pagemap_pte_hole,
 	.hugetlb_entry	= pagemap_hugetlb_range,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 24/30] mm: madvise: add page size options to MADV_HUGEPAGE and MADV_NOHUGEPAGE.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (22 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 23/30] mm: support PUD THP pagemap support Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 25/30] mm: vma: add VM_HUGEPAGE_PUD to vm_flags at bit 37 Zi Yan
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

It allows user to specify up to what page size kernel will generate THPs
to back up the memory range in madvise. Because we now have PMD and PUD
THPs, they require different amount of kernel effort to be generated,
and we want to prevent user from getting long page fault latency if we
always try to allocate PUD THPs first.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/uapi/asm-generic/mman-common.h | 23 +++++++++++++++++++++++
 mm/khugepaged.c                        |  1 +
 mm/madvise.c                           | 17 +++++++++++++++--
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f94f65d429be..8009acb55fca 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -6,6 +6,7 @@
  Author: Michael S. Tsirkin <mst@mellanox.co.il>, Mellanox Technologies Ltd.
  Based on: asm-xxx/mman.h
 */
+#include <asm-generic/hugetlb_encode.h>
 
 #define PROT_READ	0x1		/* page can be read */
 #define PROT_WRITE	0x2		/* page can be written */
@@ -80,4 +81,26 @@
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
 				 PKEY_DISABLE_WRITE)
 
+
+/*
+ * Huge page size encoding when MADV_HUGEPAGE is specified, and a huge page
+ * size other than the default is desired.  See hugetlb_encode.h.
+ */
+#define MADV_HUGEPAGE_SHIFT	HUGETLB_FLAG_ENCODE_SHIFT
+#define MADV_HUGEPAGE_MASK	HUGETLB_FLAG_ENCODE_MASK
+#define MADV_BEHAVIOR_MASK	((1<<MADV_HUGEPAGE_SHIFT) - 1)
+
+#define MADV_HUGEPAGE_64KB	HUGETLB_FLAG_ENCODE_64KB
+#define MADV_HUGEPAGE_512KB	HUGETLB_FLAG_ENCODE_512KB
+#define MADV_HUGEPAGE_1MB	HUGETLB_FLAG_ENCODE_1MB
+#define MADV_HUGEPAGE_2MB	HUGETLB_FLAG_ENCODE_2MB
+#define MADV_HUGEPAGE_8MB	HUGETLB_FLAG_ENCODE_8MB
+#define MADV_HUGEPAGE_16MB	HUGETLB_FLAG_ENCODE_16MB
+#define MADV_HUGEPAGE_32MB	HUGETLB_FLAG_ENCODE_32MB
+#define MADV_HUGEPAGE_256MB	HUGETLB_FLAG_ENCODE_256MB
+#define MADV_HUGEPAGE_512MB	HUGETLB_FLAG_ENCODE_512MB
+#define MADV_HUGEPAGE_1GB	HUGETLB_FLAG_ENCODE_1GB
+#define MADV_HUGEPAGE_2GB	HUGETLB_FLAG_ENCODE_2GB
+#define MADV_HUGEPAGE_16GB	HUGETLB_FLAG_ENCODE_16GB
+
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 636a0f32b09e..b34c78085017 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -345,6 +345,7 @@ struct attribute_group khugepaged_attr_group = {
 int hugepage_madvise(struct vm_area_struct *vma,
 		     unsigned long *vm_flags, int advice)
 {
+	advice = advice & MADV_BEHAVIOR_MASK;
 	switch (advice) {
 	case MADV_HUGEPAGE:
 #ifdef CONFIG_S390
diff --git a/mm/madvise.c b/mm/madvise.c
index 16e7b8eadb13..32066cc0b34f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -40,6 +40,19 @@ struct madvise_walk_private {
 	bool pageout;
 };
 
+static inline int get_behavior(int behavior)
+{
+	int behavior_no_flags = behavior & MADV_BEHAVIOR_MASK;
+	/*
+	 * only MADV_HUGEPAGE and MADV_NOHUGEPAGE have extra huge page size
+	 * flags
+	 */
+	VM_BUG_ON(behavior_no_flags != MADV_HUGEPAGE &&
+		  behavior_no_flags != MADV_NOHUGEPAGE &&
+		  (behavior & (~MADV_BEHAVIOR_MASK)));
+	return behavior_no_flags;
+}
+
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
  * take mmap_lock for writing. Others, which simply traverse vmas, need
@@ -74,7 +87,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	pgoff_t pgoff;
 	unsigned long new_flags = vma->vm_flags;
 
-	switch (behavior) {
+	switch (get_behavior(behavior)) {
 	case MADV_NORMAL:
 		new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
 		break;
@@ -953,7 +966,7 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 static bool
 madvise_behavior_valid(int behavior)
 {
-	switch (behavior) {
+	switch (get_behavior(behavior)) {
 	case MADV_DOFORK:
 	case MADV_DONTFORK:
 	case MADV_NORMAL:
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 25/30] mm: vma: add VM_HUGEPAGE_PUD to vm_flags at bit 37.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (23 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 24/30] mm: madvise: add page size options to MADV_HUGEPAGE and MADV_NOHUGEPAGE Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 26/30] mm: thp: add a global knob to enable/disable PUD THPs Zi Yan
                   ` (5 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

madvise can set this bit via MADV_HUGEPAGE | MADV_HUGEPAGE_1GB and unset
it via MADV_NOHUGEPAGE | MADV_HUGEPAGE_1GB. Later, kernel will check
this bit to decide whether to allocate PUD THPs or not on a VMA when the
global PUD THP is set to madvise.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/mm.h | 6 ++++++
 mm/khugepaged.c    | 9 +++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 51b75ffa6a6c..78bee63c64da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -305,11 +305,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -325,6 +327,10 @@ extern unsigned int kobjsize(const void *objp);
 #endif
 #endif /* CONFIG_ARCH_HAS_PKEYS */
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+#define VM_HUGEPAGE_PUD VM_HIGH_ARCH_5
+#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b34c78085017..f085c218ea84 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -345,6 +345,9 @@ struct attribute_group khugepaged_attr_group = {
 int hugepage_madvise(struct vm_area_struct *vma,
 		     unsigned long *vm_flags, int advice)
 {
+	/* only support 1GB PUD THP on x86 now */
+	bool use_pud_page = advice & MADV_HUGEPAGE_1GB;
+
 	advice = advice & MADV_BEHAVIOR_MASK;
 	switch (advice) {
 	case MADV_HUGEPAGE:
@@ -359,6 +362,9 @@ int hugepage_madvise(struct vm_area_struct *vma,
 #endif
 		*vm_flags &= ~VM_NOHUGEPAGE;
 		*vm_flags |= VM_HUGEPAGE;
+
+		if (use_pud_page)
+			*vm_flags |= VM_HUGEPAGE_PUD;
 		/*
 		 * If the vma become good for khugepaged to scan,
 		 * register it here without waiting a page fault that
@@ -371,6 +377,9 @@ int hugepage_madvise(struct vm_area_struct *vma,
 	case MADV_NOHUGEPAGE:
 		*vm_flags &= ~VM_HUGEPAGE;
 		*vm_flags |= VM_NOHUGEPAGE;
+
+		if (use_pud_page)
+			*vm_flags &= ~VM_HUGEPAGE_PUD;
 		/*
 		 * Setting VM_NOHUGEPAGE will prevent khugepaged from scanning
 		 * this vma even if we leave the mm registered in khugepaged if
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 26/30] mm: thp: add a global knob to enable/disable PUD THPs.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (24 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 25/30] mm: vma: add VM_HUGEPAGE_PUD to vm_flags at bit 37 Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 27/30] mm: thp: make PUD THP size public Zi Yan
                   ` (4 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

Like the existing global PMD THP knob, it allows user to enable/disable
PUD THPs. PUD THP is disabled by default unless user knows the
performance tradeoff of using it, like longer first time page fault
due to larger page zeroing and longer page allocation time when memory
is fragmented. Experienced user can enable it and take advantage of its
benefit of suffering fewer page faults and TLB misses.

* always means PUD THPs will be allocated on all VMAs if possible.
* madvise means PUD THPs will be allocated if vm_flags has VM_HUGEPAGE_PUD
  set via madvise syscall using MADV_HUGEPAGE | MADV_HUGEPAGE_PUD.
* none means PUD THPs will not be allocated on any VMA.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h | 14 ++++++++++++++
 mm/huge_memory.c        | 38 ++++++++++++++++++++++++++++++++++++++
 mm/memory.c             |  2 +-
 3 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c7bc40c4a5e2..0d0f9cf25aeb 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -119,6 +119,8 @@ enum transparent_hugepage_flag {
 #ifdef CONFIG_DEBUG_VM
 	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
 #endif
+	TRANSPARENT_PUD_HUGEPAGE_FLAG,
+	TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG,
 };
 
 struct kobject;
@@ -184,6 +186,18 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 }
 
 bool transparent_hugepage_enabled(struct vm_area_struct *vma);
+static inline bool transparent_pud_hugepage_enabled(struct vm_area_struct *vma)
+{
+	if (transparent_hugepage_enabled(vma)) {
+		if (transparent_hugepage_flags & (1 << TRANSPARENT_PUD_HUGEPAGE_FLAG))
+			return true;
+		if (transparent_hugepage_flags &
+					(1 << TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG))
+			return !!(vma->vm_flags & VM_HUGEPAGE_PUD);
+	}
+
+	return false;
+}
 
 #define HPAGE_CACHE_INDEX_MASK (HPAGE_PMD_NR - 1)
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 61ae7a0ded84..1965753b31a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -199,6 +199,43 @@ static ssize_t enabled_store(struct kobject *kobj,
 static struct kobj_attribute enabled_attr =
 	__ATTR(enabled, 0644, enabled_show, enabled_store);
 
+static ssize_t enabled_pud_thp_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	if (test_bit(TRANSPARENT_PUD_HUGEPAGE_FLAG, &transparent_hugepage_flags))
+		return sprintf(buf, "[always] madvise never\n");
+	else if (test_bit(TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags))
+		return sprintf(buf, "always [madvise] never\n");
+	else
+		return sprintf(buf, "always madvise [never]\n");
+}
+
+static ssize_t enabled_pud_thp_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	ssize_t ret = count;
+
+	if (!memcmp("always", buf,
+		    min(sizeof("always")-1, count))) {
+		clear_bit(TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_PUD_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("madvise", buf,
+			   min(sizeof("madvise")-1, count))) {
+		clear_bit(TRANSPARENT_PUD_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("never", buf,
+			   min(sizeof("never")-1, count))) {
+		clear_bit(TRANSPARENT_PUD_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+	} else
+		ret = -EINVAL;
+
+	return ret;
+}
+static struct kobj_attribute enabled_pud_thp_attr =
+	__ATTR(enabled_pud_thp, 0644, enabled_pud_thp_show, enabled_pud_thp_store);
+
 ssize_t single_hugepage_flag_show(struct kobject *kobj,
 				struct kobj_attribute *attr, char *buf,
 				enum transparent_hugepage_flag flag)
@@ -305,6 +342,7 @@ static struct kobj_attribute hpage_pmd_size_attr =
 
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
+	&enabled_pud_thp_attr.attr,
 	&defrag_attr.attr,
 	&use_zero_page_attr.attr,
 	&hpage_pmd_size_attr.attr,
diff --git a/mm/memory.c b/mm/memory.c
index ab80d13807aa..9f7b509a3aa7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4282,7 +4282,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 	if (!vmf.pud)
 		return VM_FAULT_OOM;
 retry_pud:
-	if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) {
+	if (pud_none(*vmf.pud) && transparent_pud_hugepage_enabled(vma)) {
 		ret = create_huge_pud(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 27/30] mm: thp: make PUD THP size public.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (25 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 26/30] mm: thp: add a global knob to enable/disable PUD THPs Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 28/30] hugetlb: cma: move cma reserve function to cma.c Zi Yan
                   ` (3 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

User can access the PUD THP size via
`cat /sys/kernel/mm/transparent_hugepage/hpage_pud_size`. This is
similar to make PMD THP size public.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  1 +
 mm/huge_memory.c                           | 13 +++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index b2acd0d395ca..11b173c2650e 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -159,6 +159,7 @@ Some userspace (such as a test program, or an optimized memory allocation
 library) may want to know the size (in bytes) of a transparent hugepage::
 
 	cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
+	cat /sys/kernel/mm/transparent_hugepage/hpage_pud_size
 
 khugepaged will be automatically started when
 transparent_hugepage/enabled is set to "always" or "madvise, and it'll
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1965753b31a2..20ecffc27396 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -340,12 +340,25 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj,
 static struct kobj_attribute hpage_pmd_size_attr =
 	__ATTR_RO(hpage_pmd_size);
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static ssize_t hpage_pud_size_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", HPAGE_PUD_SIZE);
+}
+static struct kobj_attribute hpage_pud_size_attr =
+	__ATTR_RO(hpage_pud_size);
+#endif
+
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&enabled_pud_thp_attr.attr,
 	&defrag_attr.attr,
 	&use_zero_page_attr.attr,
 	&hpage_pmd_size_attr.attr,
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	&hpage_pud_size_attr.attr,
+#endif
 #ifdef CONFIG_SHMEM
 	&shmem_enabled_attr.attr,
 #endif
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 28/30] hugetlb: cma: move cma reserve function to cma.c.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (26 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 27/30] mm: thp: make PUD THP size public Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 29/30] mm: thp: use cma reservation for pud thp allocation Zi Yan
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

It will be used by other allocations, like 1GB THP allocation in the
upcoming commit.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 .../admin-guide/kernel-parameters.txt         |  2 +-
 arch/arm64/mm/hugetlbpage.c                   |  2 +-
 arch/powerpc/mm/hugetlbpage.c                 |  2 +-
 arch/x86/kernel/setup.c                       |  8 +-
 include/linux/cma.h                           | 15 +++
 include/linux/hugetlb.h                       | 12 ---
 mm/cma.c                                      | 88 ++++++++++++++++++
 mm/hugetlb.c                                  | 92 ++-----------------
 8 files changed, 120 insertions(+), 101 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 7fbfc1a3e1e1..3f8f3199f4fc 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1524,7 +1524,7 @@
 	hpet_mmap=	[X86, HPET_MMAP] Allow userspace to mmap HPET
 			registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
 
-	hugetlb_cma=	[HW] The size of a cma area used for allocation
+	hugepage_cma=	[HW] The size of a cma area used for allocation
 			of gigantic hugepages.
 			Format: nn[KMGTPE]
 
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 55ecf6de9ff7..8a3ad7eaae49 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -52,7 +52,7 @@ void __init arm64_hugetlb_cma_reserve(void)
 	 * breaking this assumption.
 	 */
 	WARN_ON(order <= MAX_ORDER);
-	hugetlb_cma_reserve(order);
+	hugepage_cma_reserve(order);
 }
 #endif /* CONFIG_CMA */
 
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 36c3800769fb..6c1e61251df2 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -713,6 +713,6 @@ void __init gigantic_hugetlb_cma_reserve(void)
 
 	if (order) {
 		VM_WARN_ON(order < MAX_ORDER);
-		hugetlb_cma_reserve(order);
+		hugepage_cma_reserve(order);
 	}
 }
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ad8f909b5dc8..a732ead4985a 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -16,7 +16,7 @@
 #include <linux/pci.h>
 #include <linux/root_dev.h>
 #include <linux/sfi.h>
-#include <linux/hugetlb.h>
+#include <linux/cma.h>
 #include <linux/tboot.h>
 #include <linux/usb/xhci-dbgp.h>
 #include <linux/static_call.h>
@@ -641,7 +641,7 @@ static void __init trim_snb_memory(void)
 	 * already been reserved.
 	 */
 	memblock_reserve(0, 1<<20);
-	
+
 	for (i = 0; i < ARRAY_SIZE(bad_pages); i++) {
 		if (memblock_reserve(bad_pages[i], PAGE_SIZE))
 			printk(KERN_WARNING "failed to reserve 0x%08lx\n",
@@ -733,7 +733,7 @@ static void __init trim_low_memory_range(void)
 {
 	memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
 }
-	
+
 /*
  * Dump out kernel offset information on panic.
  */
@@ -1144,7 +1144,7 @@ void __init setup_arch(char **cmdline_p)
 	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 
 	if (boot_cpu_has(X86_FEATURE_GBPAGES))
-		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
+		hugepage_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
 
 	/*
 	 * Reserve memory for crash kernel after SRAT is parsed so that it
diff --git a/include/linux/cma.h b/include/linux/cma.h
index 217999c8a762..9989d580c2a7 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -49,4 +49,19 @@ extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);
 
 extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
+
+extern void cma_reserve(int min_order, unsigned long requested_size,
+			const char *name, struct cma *cma_struct[N_MEMORY]);
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+extern void __init hugepage_cma_reserve(int order);
+extern void __init hugepage_cma_check(void);
+#else
+static inline void __init hugepage_cma_check(void)
+{
+}
+static inline void __init hugepage_cma_reserve(int order)
+{
+}
+#endif
+
 #endif
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d5cc5f802dd4..087d13a1dc24 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -935,16 +935,4 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
 	return ptl;
 }
 
-#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
-extern void __init hugetlb_cma_reserve(int order);
-extern void __init hugetlb_cma_check(void);
-#else
-static inline __init void hugetlb_cma_reserve(int order)
-{
-}
-static inline __init void hugetlb_cma_check(void)
-{
-}
-#endif
-
 #endif /* _LINUX_HUGETLB_H */
diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7cda9f..1a9d997fa5ab 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -38,6 +38,10 @@
 
 struct cma cma_areas[MAX_CMA_AREAS];
 unsigned cma_area_count;
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+struct cma *hugepage_cma[MAX_NUMNODES];
+#endif
+unsigned long hugepage_cma_size __initdata;
 static DEFINE_MUTEX(cma_mutex);
 
 phys_addr_t cma_get_base(const struct cma *cma)
@@ -541,3 +545,87 @@ int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data)
 
 	return 0;
 }
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+/*
+ * cma_reserve() - reserve CMA for gigantic pages on nodes with memory
+ *
+ * must be called after free_area_init() that updates N_MEMORY via node_set_state().
+ * cma_reserve() scans over N_MEMORY nodemask and hence expects the platforms
+ * to have initialized N_MEMORY state.
+ */
+void __init cma_reserve(int min_order, unsigned long requested_size, const char *name,
+		 struct cma *cma_struct[MAX_NUMNODES])
+{
+	unsigned long size, reserved, per_node;
+	int nid;
+
+	if (!requested_size)
+		return;
+
+	if (requested_size < (PAGE_SIZE << min_order)) {
+		pr_warn("%s_cma: cma area should be at least %lu MiB\n",
+			name, (PAGE_SIZE << min_order) / SZ_1M);
+		return;
+	}
+
+	/*
+	 * If 3 GB area is requested on a machine with 4 numa nodes,
+	 * let's allocate 1 GB on first three nodes and ignore the last one.
+	 */
+	per_node = DIV_ROUND_UP(requested_size, nr_online_nodes);
+	pr_info("%s_cma: reserve %lu MiB, up to %lu MiB per node\n",
+		name, requested_size / SZ_1M, per_node / SZ_1M);
+
+	reserved = 0;
+	for_each_node_state(nid, N_ONLINE) {
+		int res;
+		char node_name[CMA_MAX_NAME];
+
+		size = min(per_node, requested_size - reserved);
+		size = round_up(size, PAGE_SIZE << min_order);
+
+		snprintf(node_name, sizeof(name), "%s%d", name, nid);
+		res = cma_declare_contiguous_nid(0, size, 0,
+						 PAGE_SIZE << min_order,
+						 0, false, node_name,
+						 &cma_struct[nid], nid);
+		if (res) {
+			pr_warn("%s_cma: reservation failed: err %d, node %d",
+				name, res, nid);
+			continue;
+		}
+
+		reserved += size;
+		pr_info("%s_cma: reserved %lu MiB on node %d\n",
+			name, size / SZ_1M, nid);
+
+		if (reserved >= requested_size)
+			break;
+	}
+}
+
+static bool hugepage_cma_reserve_called __initdata;
+
+static int __init cmdline_parse_hugepage_cma(char *p)
+{
+	hugepage_cma_size = memparse(p, &p);
+	return 0;
+}
+
+early_param("hugepage_cma", cmdline_parse_hugepage_cma);
+
+void __init hugepage_cma_reserve(int order)
+{
+	hugepage_cma_reserve_called = true;
+	cma_reserve(order, hugepage_cma_size, "hugepage", hugepage_cma);
+}
+
+void __init hugepage_cma_check(void)
+{
+	if (!hugepage_cma_size || hugepage_cma_reserve_called)
+		return;
+
+	pr_warn("hugepage_cma: the option isn't supported by current arch\n");
+}
+#endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 25674d7b1e5f..871f1c315c48 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -48,9 +48,9 @@ unsigned int default_hstate_idx;
 struct hstate hstates[HUGE_MAX_HSTATE];
 
 #ifdef CONFIG_CMA
-static struct cma *hugetlb_cma[MAX_NUMNODES];
+extern struct cma *hugepage_cma[MAX_NUMNODES];
 #endif
-static unsigned long hugetlb_cma_size __initdata;
+extern unsigned long hugepage_cma_size __initdata;
 
 /*
  * Minimum page order among possible hugepage sizes, set to a proper value
@@ -1227,7 +1227,7 @@ static void free_gigantic_page(struct page *page, unsigned int order)
 	 * cma_release() returns false.
 	 */
 #ifdef CONFIG_CMA
-	if (cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order))
+	if (cma_release(hugepage_cma[page_to_nid(page)], page, 1 << order))
 		return;
 #endif
 
@@ -1247,8 +1247,8 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 		struct page *page;
 		int node;
 
-		if (hugetlb_cma[nid]) {
-			page = cma_alloc(hugetlb_cma[nid], nr_pages,
+		if (hugepage_cma[nid]) {
+			page = cma_alloc(hugepage_cma[nid], nr_pages,
 					huge_page_order(h), true);
 			if (page)
 				return page;
@@ -1256,10 +1256,10 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 
 		if (!(gfp_mask & __GFP_THISNODE)) {
 			for_each_node_mask(node, *nodemask) {
-				if (node == nid || !hugetlb_cma[node])
+				if (node == nid || !hugepage_cma[node])
 					continue;
 
-				page = cma_alloc(hugetlb_cma[node], nr_pages,
+				page = cma_alloc(hugepage_cma[node], nr_pages,
 						huge_page_order(h), true);
 				if (page)
 					return page;
@@ -2554,8 +2554,8 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (hstate_is_gigantic(h)) {
-			if (hugetlb_cma_size) {
-				pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
+			if (hugepage_cma_size) {
+				pr_warn_once("HugeTLB: hugepage_cma is enabled, skip boot time allocation\n");
 				break;
 			}
 			if (!alloc_bootmem_huge_page(h))
@@ -3231,7 +3231,7 @@ static int __init hugetlb_init(void)
 		}
 	}
 
-	hugetlb_cma_check();
+	hugepage_cma_check();
 	hugetlb_init_hstates();
 	gather_bootmem_prealloc();
 	report_hugepages();
@@ -5665,75 +5665,3 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
 		spin_unlock(&hugetlb_lock);
 	}
 }
-
-#ifdef CONFIG_CMA
-static bool cma_reserve_called __initdata;
-
-static int __init cmdline_parse_hugetlb_cma(char *p)
-{
-	hugetlb_cma_size = memparse(p, &p);
-	return 0;
-}
-
-early_param("hugetlb_cma", cmdline_parse_hugetlb_cma);
-
-void __init hugetlb_cma_reserve(int order)
-{
-	unsigned long size, reserved, per_node;
-	int nid;
-
-	cma_reserve_called = true;
-
-	if (!hugetlb_cma_size)
-		return;
-
-	if (hugetlb_cma_size < (PAGE_SIZE << order)) {
-		pr_warn("hugetlb_cma: cma area should be at least %lu MiB\n",
-			(PAGE_SIZE << order) / SZ_1M);
-		return;
-	}
-
-	/*
-	 * If 3 GB area is requested on a machine with 4 numa nodes,
-	 * let's allocate 1 GB on first three nodes and ignore the last one.
-	 */
-	per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes);
-	pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n",
-		hugetlb_cma_size / SZ_1M, per_node / SZ_1M);
-
-	reserved = 0;
-	for_each_node_state(nid, N_ONLINE) {
-		int res;
-		char name[CMA_MAX_NAME];
-
-		size = min(per_node, hugetlb_cma_size - reserved);
-		size = round_up(size, PAGE_SIZE << order);
-
-		snprintf(name, sizeof(name), "hugetlb%d", nid);
-		res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order,
-						 0, false, name,
-						 &hugetlb_cma[nid], nid);
-		if (res) {
-			pr_warn("hugetlb_cma: reservation failed: err %d, node %d",
-				res, nid);
-			continue;
-		}
-
-		reserved += size;
-		pr_info("hugetlb_cma: reserved %lu MiB on node %d\n",
-			size / SZ_1M, nid);
-
-		if (reserved >= hugetlb_cma_size)
-			break;
-	}
-}
-
-void __init hugetlb_cma_check(void)
-{
-	if (!hugetlb_cma_size || cma_reserve_called)
-		return;
-
-	pr_warn("hugetlb_cma: the option isn't supported by current arch\n");
-}
-
-#endif /* CONFIG_CMA */
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 29/30] mm: thp: use cma reservation for pud thp allocation.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (27 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 28/30] hugetlb: cma: move cma reserve function to cma.c Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-28 17:54 ` [RFC PATCH v2 30/30] mm: thp: enable anonymous PUD THP at page fault path Zi Yan
  2020-09-30 11:55 ` [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Michal Hocko
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

Sharing hugepage_cma reservation with hugetlb for pud thp allocaiton.
The reserved cma regions still can be used for moveable page allocations.

During 1GB page split, all subpages are cleared from the CMA bitmap,
since they are no more 1GB pages and will be freed via the normal path
instead of cma_release().

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/cma.h     |  3 +++
 include/linux/huge_mm.h | 10 ++++++++++
 mm/cma.c                | 31 +++++++++++++++++++++++++++++++
 mm/huge_memory.c        | 34 ++++++++++++++++++++++++++++++++++
 mm/hugetlb.c            | 21 +--------------------
 mm/mempolicy.c          | 14 +++++++++++++-
 mm/page_alloc.c         | 29 +++++++++++++++++++++++++++++
 7 files changed, 121 insertions(+), 21 deletions(-)

diff --git a/include/linux/cma.h b/include/linux/cma.h
index 9989d580c2a7..c299b62b3a7a 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -48,6 +48,9 @@ extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 			      bool no_warn);
 extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);
 
+extern bool cma_clear_bitmap_if_in_range(struct cma *cma, const struct page *page,
+					unsigned int count);
+
 extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
 
 extern void cma_reserve(int min_order, unsigned long requested_size,
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0d0f9cf25aeb..163b244d9acd 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -24,6 +24,8 @@ extern struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
 					  unsigned long addr,
 					  pud_t *pud,
 					  unsigned int flags);
+extern struct page *alloc_thp_pud_page(int nid);
+extern bool free_thp_pud_page(struct page *page, int order);
 #else
 static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 {
@@ -43,6 +45,14 @@ struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
 {
 	return NULL;
 }
+struct page *alloc_thp_pud_page(int nid)
+{
+	return NULL;
+}
+extern bool free_thp_pud_page(struct page *page, int order);
+{
+	return false;
+}
 #endif
 
 extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
diff --git a/mm/cma.c b/mm/cma.c
index 1a9d997fa5ab..c595aad61f58 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -532,6 +532,37 @@ bool cma_release(struct cma *cma, const struct page *pages, unsigned int count)
 	return true;
 }
 
+/**
+ * cma_clear_bitmap_if_in_range() - clear bitmap for a given page
+ * @cma:   Contiguous memory region for which the allocation is performed.
+ * @pages: Allocated pages.
+ * @count: Number of allocated pages.
+ *
+ * This function clears bitmap of memory allocated by cma_alloc().
+ * It returns false when provided pages do not belong to contiguous area and
+ * true otherwise.
+ */
+bool cma_clear_bitmap_if_in_range(struct cma *cma, const struct page *pages,
+				  unsigned int count)
+{
+	unsigned long pfn;
+
+	if (!cma || !pages)
+		return false;
+
+	pfn = page_to_pfn(pages);
+
+	if (pfn < cma->base_pfn || pfn >= cma->base_pfn + cma->count)
+		return false;
+
+	if (pfn + count > cma->base_pfn + cma->count)
+		return false;
+
+	cma_clear_bitmap(cma, pfn, count);
+
+	return true;
+}
+
 int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data)
 {
 	int i;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 20ecffc27396..910e51f35910 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -33,6 +33,7 @@
 #include <linux/oom.h>
 #include <linux/numa.h>
 #include <linux/page_owner.h>
+#include <linux/cma.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -62,6 +63,10 @@ static struct shrinker deferred_split_shrinker;
 static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 
+#ifdef CONFIG_CMA
+extern struct cma *hugepage_cma[MAX_NUMNODES];
+#endif
+
 bool transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
 	/* The addr is used to check if the vma size fits */
@@ -2498,6 +2503,17 @@ static void __split_huge_pud_page(struct page *page, struct list_head *list,
 	/* no file-back page support yet */
 	VM_BUG_ON(!PageAnon(page));
 
+	/*
+	 * clear cma bitmap when we split pud page so the subpages can be freed
+	 * as normal pages
+	 */
+	if (IS_ENABLED(CONFIG_CMA)) {
+		struct cma *cma = hugepage_cma[page_to_nid(head)];
+
+		VM_BUG_ON(!cma_clear_bitmap_if_in_range(cma, head,
+				thp_nr_pages(head)));
+	}
+
 	for (i = HPAGE_PUD_NR - HPAGE_PMD_NR; i >= 1; i -= HPAGE_PMD_NR)
 		__split_huge_pud_page_tail(head, i, lruvec, list);
 
@@ -3732,3 +3748,21 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	update_mmu_cache_pmd(vma, address, pvmw->pmd);
 }
 #endif
+
+struct page *alloc_thp_pud_page(int nid)
+{
+	struct page *page = NULL;
+#ifdef CONFIG_CMA
+	page = cma_alloc(hugepage_cma[nid], HPAGE_PUD_NR, HPAGE_PUD_ORDER, true);
+#endif
+	return page;
+}
+
+bool free_thp_pud_page(struct page *page, int order)
+{
+	bool ret = false;
+#ifdef CONFIG_CMA
+	ret = cma_release(hugepage_cma[page_to_nid(page)], page, 1<<order);
+#endif
+	return ret;
+}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 871f1c315c48..0282110c72b5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1200,26 +1200,7 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 		nr_nodes--)
 
 #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-static void destroy_compound_gigantic_page(struct page *page,
-					unsigned int order)
-{
-	int i;
-	int nr_pages = 1 << order;
-	struct page *p = page + 1;
-
-	atomic_set(compound_mapcount_ptr(page), 0);
-	if (hpage_pincount_available(page))
-		atomic_set(compound_pincount_ptr(page), 0);
-
-	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
-		clear_compound_head(p);
-		set_page_refcounted(p);
-	}
-
-	set_compound_order(page, 0);
-	__ClearPageHead(page);
-}
-
+extern void destroy_compound_gigantic_page(struct page *page, unsigned int order);
 static void free_gigantic_page(struct page *page, unsigned int order)
 {
 	/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 731a7710395f..dc3d6371195f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2143,7 +2143,12 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 {
 	struct page *page;
 
-	page = __alloc_pages(gfp, order, nid);
+	if (order == HPAGE_PUD_ORDER) {
+		page = alloc_thp_pud_page(nid);
+		if (page && (gfp & __GFP_COMP))
+			prep_compound_page(page, order);
+	} else
+		page = __alloc_pages(gfp, order, nid);
 	/* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
 	if (!static_branch_likely(&vm_numa_stat_key))
 		return page;
@@ -2217,6 +2222,13 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		nmask = policy_nodemask(gfp, pol);
 		if (!nmask || node_isset(hpage_node, *nmask)) {
 			mpol_cond_put(pol);
+
+			if (order == HPAGE_PUD_ORDER) {
+				page = alloc_thp_pud_page(hpage_node);
+				if (page && (gfp & __GFP_COMP))
+					prep_compound_page(page, order);
+				goto out;
+			}
 			/*
 			 * First, try to allocate THP only on local node, but
 			 * don't reclaim unnecessarily, just compact.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6bdb38a8fb48..5251ecb30465 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1481,6 +1481,25 @@ void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
 	}
 }
 
+void destroy_compound_gigantic_page(struct page *page, unsigned int order)
+{
+	int i;
+	int nr_pages = 1 << order;
+	struct page *p = page + 1;
+
+	atomic_set(compound_mapcount_ptr(page), 0);
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
+
+	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		clear_compound_head(p);
+		set_page_refcounted(p);
+	}
+
+	set_compound_order(page, 0);
+	__ClearPageHead(page);
+}
+
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
@@ -1490,6 +1509,16 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	if (!free_pages_prepare(page, order, true))
 		return;
 
+	if (order == HPAGE_PUD_ORDER) {
+		bool thp_pud_page_freed = false;
+
+		destroy_compound_gigantic_page(page, order);
+		set_page_refcounted(page);
+		thp_pud_page_freed = free_thp_pud_page(page, order);
+		VM_BUG_ON_PAGE(!thp_pud_page_freed, page);
+		return;
+	}
+
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH v2 30/30] mm: thp: enable anonymous PUD THP at page fault path.
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (28 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 29/30] mm: thp: use cma reservation for pud thp allocation Zi Yan
@ 2020-09-28 17:54 ` Zi Yan
  2020-09-30 11:55 ` [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Michal Hocko
  30 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

All previous commits have anonymous PUD THP support ready, so we can
enable anonymous PUD THP page fault now.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/memory.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 9f7b509a3aa7..dc285d9872fc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4122,16 +4122,15 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
 {
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) &&			\
 	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
-	/* No support for anonymous transparent PUD pages yet */
 	if (vma_is_anonymous(vmf->vma))
-		goto split;
+		return do_huge_pud_anonymous_page(vmf);
 	if (vmf->vma->vm_ops->huge_fault) {
 		vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
 
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
 	}
-split:
+
 	/* COW or write-notify not handled on PUD level: split pud.*/
 	__split_huge_pud(vmf->vma, vmf->pud, vmf->address, false, NULL);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 03/30] mm: thp: use single linked list for THP page table page deposit.
  2020-09-28 17:54 ` [RFC PATCH v2 03/30] mm: thp: use single linked list for THP page table page deposit Zi Yan
@ 2020-09-28 19:34   ` Matthew Wilcox
  2020-09-28 20:34     ` Zi Yan
  0 siblings, 1 reply; 56+ messages in thread
From: Matthew Wilcox @ 2020-09-28 19:34 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Shakeel Butt, Yang Shi, Jason Gunthorpe, Mike Kravetz,
	Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

On Mon, Sep 28, 2020 at 01:54:01PM -0400, Zi Yan wrote:
>  		struct {	/* Page table pages */
> -			unsigned long _pt_pad_1;	/* compound_head */
> -			pgtable_t pmd_huge_pte; /* protected by page->ptl */
> +			struct llist_head deposit_head; /* pgtable deposit list head */
> +			struct llist_node deposit_node; /* pgtable deposit list node */

If you're going to use two pointers anyway, you might as well use a
list_head.  But I don't think you need to; you could either use a union
of these or you could use the page_address() of the page to store as
much information as you like!


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 03/30] mm: thp: use single linked list for THP page table page deposit.
  2020-09-28 19:34   ` Matthew Wilcox
@ 2020-09-28 20:34     ` Zi Yan
  0 siblings, 0 replies; 56+ messages in thread
From: Zi Yan @ 2020-09-28 20:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Shakeel Butt, Yang Shi, Jason Gunthorpe, Mike Kravetz,
	Michal Hocko, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2100 bytes --]

On 28 Sep 2020, at 15:34, Matthew Wilcox wrote:

> On Mon, Sep 28, 2020 at 01:54:01PM -0400, Zi Yan wrote:
>>  		struct {	/* Page table pages */
>> -			unsigned long _pt_pad_1;	/* compound_head */
>> -			pgtable_t pmd_huge_pte; /* protected by page->ptl */
>> +			struct llist_head deposit_head; /* pgtable deposit list head */
>> +			struct llist_node deposit_node; /* pgtable deposit list node */
>
> If you're going to use two pointers anyway, you might as well use a
> list_head.  But I don't think you need to; you could either use a union
> of these or you could use the page_address() of the page to store as
> much information as you like!

This is intended for depositing pgtable pages hierarchically. PUD THP
pgtable page deposit uses it. For a PUD THP, we need to deposit 1 PMD
pgtable page and 512 PTE pgtable pages, totally 513 pages.

One way is to deposit all of them on a list, but when we split the PUD
THP, we need to pull them all out and use one for PMD pgtable page
and deposit the rest 512 PTE pgtable pages to PMD page’s pmd_huge_pte.
But this mixes PMD pgtable pages and PTE pgtable pages in one list,
which can be error prone and also requires extra pgtable page deposit
operations during page split.

This approach, at the high level, makes a pgtable page’s deposit_head
point to a list of lower level pgtable pages, which are linked using
deposit_node. For example, we link all 512 PTE pgtable pages using
deposit_node and use PMD pgtable page’s deposit_head to point to the
PTE page list. In addition, when we deposit the PMD pgtable page,
we just point a struct_llist_head to the PMD pgtable page’s deposit_node.
When it comes to PUD THP split, we can simply withdraw and use the PMD
pgtable page without additional operations, since PTE pgtable pages
have already been deposited at the beginning.

Let me know if it makes sense to you. I will add the paragraphs above
to the commit message. Swapping patch 4 and 5 might also make the change
easier to understand since patch 5 use this patch.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
                   ` (29 preceding siblings ...)
  2020-09-28 17:54 ` [RFC PATCH v2 30/30] mm: thp: enable anonymous PUD THP at page fault path Zi Yan
@ 2020-09-30 11:55 ` Michal Hocko
  2020-10-01 15:14   ` Zi Yan
  30 siblings, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2020-09-30 11:55 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

On Mon 28-09-20 13:53:58, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> This patchset adds support for 1GB PUD THP on x86_64. It is on top of
> v5.9-rc5-mmots-2020-09-18-21-23. It is also available at:
> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-09-18-21-23
> 
> Other than PUD THP, we had some discussion on generating THPs and contiguous
> physical memory via a synchronous system call [0]. I am planning to send out a
> separate patchset on it later, since I feel that it can be done independently of
> PUD THP support.

While the technical challenges for the kernel implementation can be
discussed before the user API is decided I believe we cannot simply add
something now and then decide about a proper interface. I have raised
few basic questions we should should find answers for before the any
interface is added. Let me copy them here for easier reference
- THP allocation time - #PF and/or madvise context
- lazy/sync instantiation
- huge page sizes controllable by the userspace?
- aggressiveness - how hard to try
- internal fragmentation - allow to create THPs on sparsely or unpopulated
  ranges
- do we need some sort of access control or privilege check as some THPs
  would be a really scarce (like those that require pre-reservation).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-09-30 11:55 ` [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Michal Hocko
@ 2020-10-01 15:14   ` Zi Yan
  2020-10-02  7:32     ` Michal Hocko
  0 siblings, 1 reply; 56+ messages in thread
From: Zi Yan @ 2020-10-01 15:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2998 bytes --]

On 30 Sep 2020, at 7:55, Michal Hocko wrote:

> On Mon 28-09-20 13:53:58, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Hi all,
>>
>> This patchset adds support for 1GB PUD THP on x86_64. It is on top of
>> v5.9-rc5-mmots-2020-09-18-21-23. It is also available at:
>> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-09-18-21-23
>>
>> Other than PUD THP, we had some discussion on generating THPs and contiguous
>> physical memory via a synchronous system call [0]. I am planning to send out a
>> separate patchset on it later, since I feel that it can be done independently of
>> PUD THP support.
>
> While the technical challenges for the kernel implementation can be
> discussed before the user API is decided I believe we cannot simply add
> something now and then decide about a proper interface. I have raised
> few basic questions we should should find answers for before the any
> interface is added. Let me copy them here for easier reference
Sure. Thank you for doing this.

For this new interface, I think it should generate THPs out of populated
memory regions synchronously. It would be complement to khugepaged, which
generate THPs asynchronously on the background.

> - THP allocation time - #PF and/or madvise context
I am not sure this is relevant, since the new interface is supposed to
operate on populated memory regions. For THP allocation, madvise and
the options from /sys/kernel/mm/transparent_hugepage/defrag should give
enough choices to users.

> - lazy/sync instantiation

I would say the new interface only does sync instantiation. madvise has
provided the lazy instantiation option by adding MADV_HUGEPAGE to populated
memory regions and letting khugepaged generate THPs from them.

> - huge page sizes controllable by the userspace?

It might be good to allow advanced users to choose the page sizes, so they
have better control of their applications. For normal users, we can provide
best-effort service. Different options can be provided for these two cases.
The new interface might want to inform user how many THPs are generated
after the call for them to decide what to do with the memory region.

> - aggressiveness - how hard to try

The new interface would try as hard as it can, since I assume users really
want THPs when they use this interface.

> - internal fragmentation - allow to create THPs on sparsely or unpopulated
>   ranges

The new interface would only operate on populated memory regions. MAP_POPULATE
like option can be added if necessary.


> - do we need some sort of access control or privilege check as some THPs
>   would be a really scarce (like those that require pre-reservation).

It seems too much to me. I suppose if we provide page size options to users
when generating THPs, users apps could coordinate themselves. BTW, do we have
access control for hugetlb pages? If yes, we could borrow their method.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-01 15:14   ` Zi Yan
@ 2020-10-02  7:32     ` Michal Hocko
  2020-10-02  7:50       ` David Hildenbrand
  0 siblings, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2020-10-02  7:32 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, David Hildenbrand, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

On Thu 01-10-20 11:14:14, Zi Yan wrote:
> On 30 Sep 2020, at 7:55, Michal Hocko wrote:
> 
> > On Mon 28-09-20 13:53:58, Zi Yan wrote:
> >> From: Zi Yan <ziy@nvidia.com>
> >>
> >> Hi all,
> >>
> >> This patchset adds support for 1GB PUD THP on x86_64. It is on top of
> >> v5.9-rc5-mmots-2020-09-18-21-23. It is also available at:
> >> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-09-18-21-23
> >>
> >> Other than PUD THP, we had some discussion on generating THPs and contiguous
> >> physical memory via a synchronous system call [0]. I am planning to send out a
> >> separate patchset on it later, since I feel that it can be done independently of
> >> PUD THP support.
> >
> > While the technical challenges for the kernel implementation can be
> > discussed before the user API is decided I believe we cannot simply add
> > something now and then decide about a proper interface. I have raised
> > few basic questions we should should find answers for before the any
> > interface is added. Let me copy them here for easier reference
> Sure. Thank you for doing this.
> 
> For this new interface, I think it should generate THPs out of populated
> memory regions synchronously. It would be complement to khugepaged, which
> generate THPs asynchronously on the background.
> 
> > - THP allocation time - #PF and/or madvise context
> I am not sure this is relevant, since the new interface is supposed to
> operate on populated memory regions. For THP allocation, madvise and
> the options from /sys/kernel/mm/transparent_hugepage/defrag should give
> enough choices to users.

OK, so no #PF, this makes things easier.

> > - lazy/sync instantiation
> 
> I would say the new interface only does sync instantiation. madvise has
> provided the lazy instantiation option by adding MADV_HUGEPAGE to populated
> memory regions and letting khugepaged generate THPs from them.

OK

> > - huge page sizes controllable by the userspace?
> 
> It might be good to allow advanced users to choose the page sizes, so they
> have better control of their applications.

Could you elaborate more? Those advanced users can use hugetlb, right?
They get a very good control over page size and pool preallocation etc.
So they can get what they need - assuming there is enough memory.

> For normal users, we can provide
> best-effort service. Different options can be provided for these two cases.

Do we really need two sync mechanisms to compact physical memory? This
adds an API complexity because it has to cover all possible huge pages
and that can be a large set of sizes. We already have that choice for
hugetlb mmap interface but that is needed to cover all existing setups.
I would argue this doesn't make the API particurarly easy to use.

> The new interface might want to inform user how many THPs are generated
> after the call for them to decide what to do with the memory region.

Why would that be useful? /proc/<pid>/smaps should give a good picture
already, right?

> > - aggressiveness - how hard to try
> 
> The new interface would try as hard as it can, since I assume users really
> want THPs when they use this interface.
> 
> > - internal fragmentation - allow to create THPs on sparsely or unpopulated
> >   ranges
> 
> The new interface would only operate on populated memory regions. MAP_POPULATE
> like option can be added if necessary.

OK, so initialy you do not want to populate more memory. How do you
envision a future extension to provide such a functionality. A different
API, modification to existing?

> > - do we need some sort of access control or privilege check as some THPs
> >   would be a really scarce (like those that require pre-reservation).
> 
> It seems too much to me. I suppose if we provide page size options to users
> when generating THPs, users apps could coordinate themselves. BTW, do we have
> access control for hugetlb pages? If yes, we could borrow their method.

We do not. Well, there is a hugetlb cgroup controller but I am not sure
this is the right method. A lack of hugetlb access controll is a serious
shortcoming which has turned this interface into "only first class
citizens" feature with a very closed coordination with an admin.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-02  7:32     ` Michal Hocko
@ 2020-10-02  7:50       ` David Hildenbrand
  2020-10-02  8:10         ` Michal Hocko
  2020-10-05 15:34         ` Zi Yan
  0 siblings, 2 replies; 56+ messages in thread
From: David Hildenbrand @ 2020-10-02  7:50 UTC (permalink / raw)
  To: Michal Hocko, Zi Yan
  Cc: linux-mm, Kirill A . Shutemov, Roman Gushchin, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, William Kucharski, Andrea Arcangeli, John Hubbard,
	David Nellans, linux-kernel

>>> - huge page sizes controllable by the userspace?
>>
>> It might be good to allow advanced users to choose the page sizes, so they
>> have better control of their applications.
> 
> Could you elaborate more? Those advanced users can use hugetlb, right?
> They get a very good control over page size and pool preallocation etc.
> So they can get what they need - assuming there is enough memory.
> 

I am still not convinced that 1G THP (TGP :) ) are really what we want
to support. I can understand that there are some use cases that might
benefit from it, especially:

"I want a lot of memory, give me memory in any granularity you have, I
absolutely don't care - but of course, more TGP might be good for
performance." Say, you want a 5GB region, but only have a single 1GB
hugepage lying around. hugetlbfs allocation will fail.


But then, do we really want to optimize for such (very special?) use
cases via " 58 files changed, 2396 insertions(+), 460 deletions(-)" ?

I think gigantic pages are a sparse resource. Only selected applications
*really* depend on them and benefit from them. Let these special
applications handle it explicitly.

Can we have a summary of use cases that would really benefit from this
change?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-02  7:50       ` David Hildenbrand
@ 2020-10-02  8:10         ` Michal Hocko
  2020-10-02  8:30           ` David Hildenbrand
  2020-10-05 15:34         ` Zi Yan
  1 sibling, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2020-10-02  8:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, linux-mm, Kirill A . Shutemov, Roman Gushchin,
	Rik van Riel, Matthew Wilcox, Shakeel Butt, Yang Shi,
	Jason Gunthorpe, Mike Kravetz, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
> >>> - huge page sizes controllable by the userspace?
> >>
> >> It might be good to allow advanced users to choose the page sizes, so they
> >> have better control of their applications.
> > 
> > Could you elaborate more? Those advanced users can use hugetlb, right?
> > They get a very good control over page size and pool preallocation etc.
> > So they can get what they need - assuming there is enough memory.
> > 
> 
> I am still not convinced that 1G THP (TGP :) ) are really what we want
> to support. I can understand that there are some use cases that might
> benefit from it, especially:

Well, I would say that internal support for larger huge pages (e.g. 1GB)
that can transparently split under memory pressure is a useful
funtionality. I cannot really judge how complex that would be
consideting that 2MB THP have turned out to be quite a pain but
situation has settled over time. Maybe our current code base is prepared
for that much better.

Exposing that interface to the userspace is a different story of course.
I do agree that we likely do not want to be very explicit about that.
E.g. an interface for address space defragmentation without any more
specifics sounds like a useful feature to me. It will be up to the
kernel to decide which huge pages to use.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-02  8:10         ` Michal Hocko
@ 2020-10-02  8:30           ` David Hildenbrand
  2020-10-05 15:03             ` Zi Yan
  0 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2020-10-02  8:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zi Yan, linux-mm, Kirill A . Shutemov, Roman Gushchin,
	Rik van Riel, Matthew Wilcox, Shakeel Butt, Yang Shi,
	Jason Gunthorpe, Mike Kravetz, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

On 02.10.20 10:10, Michal Hocko wrote:
> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
>>>>> - huge page sizes controllable by the userspace?
>>>>
>>>> It might be good to allow advanced users to choose the page sizes, so they
>>>> have better control of their applications.
>>>
>>> Could you elaborate more? Those advanced users can use hugetlb, right?
>>> They get a very good control over page size and pool preallocation etc.
>>> So they can get what they need - assuming there is enough memory.
>>>
>>
>> I am still not convinced that 1G THP (TGP :) ) are really what we want
>> to support. I can understand that there are some use cases that might
>> benefit from it, especially:
> 
> Well, I would say that internal support for larger huge pages (e.g. 1GB)
> that can transparently split under memory pressure is a useful
> funtionality. I cannot really judge how complex that would be

Right, but that's then something different than serving (scarce,
unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
wrong about *real* THP support, meaning, e.g., grouping consecutive
pages and converting them back and forth on demand. (E.g., 1GB ->
multiple 2MB -> multiple single pages), for example, when having to
migrate such a gigantic page. But that's very different from our
existing gigantic page code as far as I can tell.

> consideting that 2MB THP have turned out to be quite a pain but
> situation has settled over time. Maybe our current code base is prepared
> for that much better.
> 
> Exposing that interface to the userspace is a different story of course.
> I do agree that we likely do not want to be very explicit about that.
> E.g. an interface for address space defragmentation without any more
> specifics sounds like a useful feature to me. It will be up to the
> kernel to decide which huge pages to use.

Yes, I think one important feature would be that we don't end up placing
a gigantic page where only a handful of pages are actually populated
without green light from the application - because that's what some user
space applications care about (not consuming more memory than intended.
IIUC, this is also what this patch set does). I'm fine with placing
gigantic pages if it really just "defragments" the address space layout,
without filling unpopulated holes.

Then, this would be mostly invisible to user space, and we really
wouldn't have to care about any configuration.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-02  8:30           ` David Hildenbrand
@ 2020-10-05 15:03             ` Zi Yan
  2020-10-05 15:55               ` Matthew Wilcox
                                 ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Zi Yan @ 2020-10-05 15:03 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: linux-mm, Kirill A . Shutemov, Rik van Riel, Roman Gushchin,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, William Kucharski, Andrea Arcangeli, John Hubbard,
	David Nellans, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3653 bytes --]

On 2 Oct 2020, at 4:30, David Hildenbrand wrote:

> On 02.10.20 10:10, Michal Hocko wrote:
>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
>>>>>> - huge page sizes controllable by the userspace?
>>>>>
>>>>> It might be good to allow advanced users to choose the page sizes, so they
>>>>> have better control of their applications.
>>>>
>>>> Could you elaborate more? Those advanced users can use hugetlb, right?
>>>> They get a very good control over page size and pool preallocation etc.
>>>> So they can get what they need - assuming there is enough memory.
>>>>
>>>
>>> I am still not convinced that 1G THP (TGP :) ) are really what we want
>>> to support. I can understand that there are some use cases that might
>>> benefit from it, especially:
>>
>> Well, I would say that internal support for larger huge pages (e.g. 1GB)
>> that can transparently split under memory pressure is a useful
>> funtionality. I cannot really judge how complex that would be
>
> Right, but that's then something different than serving (scarce,
> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
> wrong about *real* THP support, meaning, e.g., grouping consecutive
> pages and converting them back and forth on demand. (E.g., 1GB ->
> multiple 2MB -> multiple single pages), for example, when having to
> migrate such a gigantic page. But that's very different from our
> existing gigantic page code as far as I can tell.

Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
which needs section size increase. In addition, unmoveable pages cannot
be allocated in CMA, so allocating 1GB pages has much higher chance from
it than from ZONE_NORMAL.


>> consideting that 2MB THP have turned out to be quite a pain but
>> situation has settled over time. Maybe our current code base is prepared
>> for that much better.

I am planning to refactor my code further to reduce the amount of
the added code, since PUD THP is very similar to PMD THP. One thing
I want to achieve is to enable split_huge_page to split any order of
pages to a group of any lower order of pages. A lot of code in this
patchset is replicating the same behavior of PMD THP at PUD level.
It might be possible to deduplicate most of the code.

>>
>> Exposing that interface to the userspace is a different story of course.
>> I do agree that we likely do not want to be very explicit about that.
>> E.g. an interface for address space defragmentation without any more
>> specifics sounds like a useful feature to me. It will be up to the
>> kernel to decide which huge pages to use.
>
> Yes, I think one important feature would be that we don't end up placing
> a gigantic page where only a handful of pages are actually populated
> without green light from the application - because that's what some user
> space applications care about (not consuming more memory than intended.
> IIUC, this is also what this patch set does). I'm fine with placing
> gigantic pages if it really just "defragments" the address space layout,
> without filling unpopulated holes.
>
> Then, this would be mostly invisible to user space, and we really
> wouldn't have to care about any configuration.


I agree that the interface should be as simple as no configuration to
most users. But I also wonder why we have hugetlbfs to allow users to
specify different kinds of page sizes, which seems against the discussion
above. Are we assuming advanced users should always use hugetlbfs instead
of THPs?


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-02  7:50       ` David Hildenbrand
  2020-10-02  8:10         ` Michal Hocko
@ 2020-10-05 15:34         ` Zi Yan
  2020-10-05 17:30           ` David Hildenbrand
  1 sibling, 1 reply; 56+ messages in thread
From: Zi Yan @ 2020-10-05 15:34 UTC (permalink / raw)
  To: David Hildenbrand, Roman Gushchin
  Cc: Michal Hocko, linux-mm, Kirill A . Shutemov, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, William Kucharski, Andrea Arcangeli, John Hubbard,
	David Nellans, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2218 bytes --]

On 2 Oct 2020, at 3:50, David Hildenbrand wrote:

>>>> - huge page sizes controllable by the userspace?
>>>
>>> It might be good to allow advanced users to choose the page sizes, so they
>>> have better control of their applications.
>>
>> Could you elaborate more? Those advanced users can use hugetlb, right?
>> They get a very good control over page size and pool preallocation etc.
>> So they can get what they need - assuming there is enough memory.
>>
>
> I am still not convinced that 1G THP (TGP :) ) are really what we want
> to support. I can understand that there are some use cases that might
> benefit from it, especially:
>
> "I want a lot of memory, give me memory in any granularity you have, I
> absolutely don't care - but of course, more TGP might be good for
> performance." Say, you want a 5GB region, but only have a single 1GB
> hugepage lying around. hugetlbfs allocation will fail.
>
>
> But then, do we really want to optimize for such (very special?) use
> cases via " 58 files changed, 2396 insertions(+), 460 deletions(-)" ?

I am planning to further refactor my code to reduce the size and make
it more general to support any size of THPs. As Matthew’s patchset[1]
is removing kernel’s THP size assumption, it might be a good time to
make THP support more general.

>
> I think gigantic pages are a sparse resource. Only selected applications
> *really* depend on them and benefit from them. Let these special
> applications handle it explicitly.
>
> Can we have a summary of use cases that would really benefit from this
> change?

For large machine learning applications, 1GB pages give good performance boost[2].
NVIDIA DGX A100 box now has 1TB memory, which means 1GB pages are not
that sparse in GPU-equipped infrastructure[3].

In addition, @Roman Gushchin should be able to provide a more concrete
story from his side.


[1] https://lore.kernel.org/linux-mm/20200908195539.25896-1-willy@infradead.org/
[2] http://learningsys.org/neurips19/assets/papers/18_CameraReadySubmission_MLSys_NeurIPS_2019.pdf
[3] https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-dgx-a100-datasheet.pdf

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-05 15:03             ` Zi Yan
@ 2020-10-05 15:55               ` Matthew Wilcox
  2020-10-05 17:04                 ` Roman Gushchin
  2020-10-05 19:12                 ` Zi Yan
  2020-10-05 17:16               ` Roman Gushchin
  2020-10-05 17:39               ` David Hildenbrand
  2 siblings, 2 replies; 56+ messages in thread
From: Matthew Wilcox @ 2020-10-05 15:55 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Michal Hocko, linux-mm, Kirill A . Shutemov,
	Rik van Riel, Roman Gushchin, Shakeel Butt, Yang Shi,
	Jason Gunthorpe, Mike Kravetz, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
> > Yes, I think one important feature would be that we don't end up placing
> > a gigantic page where only a handful of pages are actually populated
> > without green light from the application - because that's what some user
> > space applications care about (not consuming more memory than intended.
> > IIUC, this is also what this patch set does). I'm fine with placing
> > gigantic pages if it really just "defragments" the address space layout,
> > without filling unpopulated holes.
> >
> > Then, this would be mostly invisible to user space, and we really
> > wouldn't have to care about any configuration.
> 
> I agree that the interface should be as simple as no configuration to
> most users. But I also wonder why we have hugetlbfs to allow users to
> specify different kinds of page sizes, which seems against the discussion
> above. Are we assuming advanced users should always use hugetlbfs instead
> of THPs?

Evolution doesn't always produce the best outcomes ;-)

A perennial mistake we've made is "Oh, this is a strange & new & weird
feature that most applications will never care about, let's put it in
hugetlbfs where nobody will notice and we don't have to think about it
in the core VM"

And then what was initially strange & new & weird gradually becomes
something that most applications just want to have happen automatically,
and telling them all to go use hugetlbfs becomes untenable, so we move
the feature into the core VM.

It is absurd that my phone is attempting to manage a million 4kB pages.
I think even trying to manage a quarter-million 16kB pages is too much
work, and really it would be happier managing 65,000 64kB pages.

Extend that into the future a decade or two, and we'll be expecting
that it manages memory in megabyte sized units and uses PMD and PUD
mappings by default.  PTE mappings will still be used, but very much
on a "Oh you have a tiny file, OK, we'll fragment a megabyte page into
smaller pages to not waste too much memory when mapping it" basis.  So,
yeah, PUD sized mappings have problems today, but we should be writing
software now so a Pixel 15 in a decade can boot a kernel built five
years from now and have PUD mappings Just Work without requiring the
future userspace programmer to "use hugetlbfs".

One of the longer-term todo items is to support variable sized THPs for
anonymous memory, just like I've done for the pagecache.  With that in
place, I think scaling up from PMD sized pages to PUD sized pages starts
to look more natural.  Itanium and PA-RISC (two architectures that will
never be found in phones...) support 1MB, 4MB, 16MB, 64MB and upwards.
The RiscV spec you pointed me at the other day confines itself to adding
support for 16, 64 & 256kB today, but does note that 8MB, 32MB and 128MB
sizes would be possible additions in the future.


But, back to today, what to do with this patchset?  Even on my 16GB
laptop, let alone my 4GB phone, I'm uncertain that allocating a 1GB
page is ever the right decision to make.  But my laptop runs a "mixed"
workload, and if you could convince me that Firefox would run 10% faster
by using a 1GB page as its in-memory cache, well, I'd be sold.

I do like having the kernel figure out what's in the best interests of the
system as a whole.  Apps don't have enough information, and while they
can provide hints, they're often wrong.  So, let's say an app maps 8GB
of anonymous memory.  As the app accesses it, we should probably start
by allocating 4kB pages to back that memory.  As time goes on and that
memory continues to be accessed and more memory is accessed, it makes
sense to keep track of that, replacing the existing 4kB pages with, say,
16-64kB pages and allocating newly accessed memory with larger pages.
Eventually that should grow to 2MB allocations and PMD mappings.
And then continue on, all the way to 1GB pages.

We also need to be able to figure out that it's not being effective
any more.  One of the issues with tracing accessed/dirty at the 1GB level
is that writing an entire 1GB page is going to take 0.25 seconds on a x4
gen3 PCIe link.  I know swapping sucks, but that's extreme.  So to use
1GB pages effectively today, we need to fragment them before choosing to
swap them out (*)  Maybe that's the point where we can start to say "OK,
this sized mapping might not be effective any more".  On the other hand,
that might not work for some situations.  Imagine, eg, a matrix multiply
(everybody's favourite worst-case scenario).  C = A * B where each of A,
B and C is too large to fit in DRAM.  There are going to be points of the
calculation where each element of A is going to be walked sequentially,
and so it'd be nice to use larger PTEs to map it, but then we need to
destroy that almost immediately to allow other things to use the memory.


I think I'm leaning towards not merging this patchset yet.  I'm in
agreement with the goals (allowing systems to use PUD-sized pages
automatically), but I think we need to improve the infrastructure to
make it work well automatically.  Does that make sense?

(*) It would be nice if hardware provided a way to track D/A on a sub-PTE
level when using PMD/PUD sized mappings.  I don't know of any that does
that today.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-05 15:55               ` Matthew Wilcox
@ 2020-10-05 17:04                 ` Roman Gushchin
  2020-10-05 19:12                 ` Zi Yan
  1 sibling, 0 replies; 56+ messages in thread
From: Roman Gushchin @ 2020-10-05 17:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zi Yan, David Hildenbrand, Michal Hocko, linux-mm,
	Kirill A . Shutemov, Rik van Riel, Shakeel Butt, Yang Shi,
	Jason Gunthorpe, Mike Kravetz, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

On Mon, Oct 05, 2020 at 04:55:53PM +0100, Matthew Wilcox wrote:
> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
> > On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
> > > Yes, I think one important feature would be that we don't end up placing
> > > a gigantic page where only a handful of pages are actually populated
> > > without green light from the application - because that's what some user
> > > space applications care about (not consuming more memory than intended.
> > > IIUC, this is also what this patch set does). I'm fine with placing
> > > gigantic pages if it really just "defragments" the address space layout,
> > > without filling unpopulated holes.
> > >
> > > Then, this would be mostly invisible to user space, and we really
> > > wouldn't have to care about any configuration.
> > 
> > I agree that the interface should be as simple as no configuration to
> > most users. But I also wonder why we have hugetlbfs to allow users to
> > specify different kinds of page sizes, which seems against the discussion
> > above. Are we assuming advanced users should always use hugetlbfs instead
> > of THPs?
> 
> Evolution doesn't always produce the best outcomes ;-)
> 
> A perennial mistake we've made is "Oh, this is a strange & new & weird
> feature that most applications will never care about, let's put it in
> hugetlbfs where nobody will notice and we don't have to think about it
> in the core VM"
> 
> And then what was initially strange & new & weird gradually becomes
> something that most applications just want to have happen automatically,
> and telling them all to go use hugetlbfs becomes untenable, so we move
> the feature into the core VM.
> 
> It is absurd that my phone is attempting to manage a million 4kB pages.
> I think even trying to manage a quarter-million 16kB pages is too much
> work, and really it would be happier managing 65,000 64kB pages.
> 
> Extend that into the future a decade or two, and we'll be expecting
> that it manages memory in megabyte sized units and uses PMD and PUD
> mappings by default.  PTE mappings will still be used, but very much
> on a "Oh you have a tiny file, OK, we'll fragment a megabyte page into
> smaller pages to not waste too much memory when mapping it" basis.  So,
> yeah, PUD sized mappings have problems today, but we should be writing
> software now so a Pixel 15 in a decade can boot a kernel built five
> years from now and have PUD mappings Just Work without requiring the
> future userspace programmer to "use hugetlbfs".
> 
> One of the longer-term todo items is to support variable sized THPs for
> anonymous memory, just like I've done for the pagecache.  With that in
> place, I think scaling up from PMD sized pages to PUD sized pages starts
> to look more natural.  Itanium and PA-RISC (two architectures that will
> never be found in phones...) support 1MB, 4MB, 16MB, 64MB and upwards.
> The RiscV spec you pointed me at the other day confines itself to adding
> support for 16, 64 & 256kB today, but does note that 8MB, 32MB and 128MB
> sizes would be possible additions in the future.

+1

> But, back to today, what to do with this patchset?  Even on my 16GB
> laptop, let alone my 4GB phone, I'm uncertain that allocating a 1GB
> page is ever the right decision to make.  But my laptop runs a "mixed"
> workload, and if you could convince me that Firefox would run 10% faster
> by using a 1GB page as its in-memory cache, well, I'd be sold.
> 
> I do like having the kernel figure out what's in the best interests of the
> system as a whole.  Apps don't have enough information, and while they
> can provide hints, they're often wrong.

It's definitely true for many cases, but not true for some other cases.

For example, we're running hhvm ( https://hhvm.com/ ) on a large number
of machines. Hhvm is known to have a significant performance benefit
when using hugepages. Exact numbers depend on the exact workload and
configuration, but there is a noticeable difference (in single digits of
percents) between using 4k pages only, 4k pages and 2MB pages, and
4k, 2MB and some 1GB pages.

As now, we have to use hugetlbfs mostly because of the lack of 1GB THP support.
It has some significant downsides: e.g. hugetlb memory is not properly accounted
on a memory cgroup level, it requires additional "management", etc.
If we could allocate 1GB THPs with something like new madvise,
having all memcg stats working and destroy them transparently on the application
exit, it's already valuable.

> So, let's say an app maps 8GB
> of anonymous memory.  As the app accesses it, we should probably start
> by allocating 4kB pages to back that memory.  As time goes on and that
> memory continues to be accessed and more memory is accessed, it makes
> sense to keep track of that, replacing the existing 4kB pages with, say,
> 16-64kB pages and allocating newly accessed memory with larger pages.
> Eventually that should grow to 2MB allocations and PMD mappings.
> And then continue on, all the way to 1GB pages.
> 
> We also need to be able to figure out that it's not being effective
> any more.  One of the issues with tracing accessed/dirty at the 1GB level
> is that writing an entire 1GB page is going to take 0.25 seconds on a x4
> gen3 PCIe link.  I know swapping sucks, but that's extreme.  So to use
> 1GB pages effectively today, we need to fragment them before choosing to
> swap them out (*)  Maybe that's the point where we can start to say "OK,
> this sized mapping might not be effective any more".  On the other hand,
> that might not work for some situations.  Imagine, eg, a matrix multiply
> (everybody's favourite worst-case scenario).  C = A * B where each of A,
> B and C is too large to fit in DRAM.  There are going to be points of the
> calculation where each element of A is going to be walked sequentially,
> and so it'd be nice to use larger PTEs to map it, but then we need to
> destroy that almost immediately to allow other things to use the memory.
> 
> 
> I think I'm leaning towards not merging this patchset yet.

Please, correct me if I'm wrong, but in my understanding the effort
required for a proper 1 GB THP support can be roughly split into two parts:
1) technical support of PUD-sized THPs,
2) heuristics to create and destroy them automatically .

The second part will likely require a lot of experimenting and fine-tuning
and obviously depends on the working part 1. So I don't see why we should
postpone the part 1, if only it doesn't add too much overhead (which is not
the case, right?). If the problem is the introduction of a semi-dead code,
we can put it under a config option (I would prefer to not do it though).

> I'm in
> agreement with the goals (allowing systems to use PUD-sized pages
> automatically), but I think we need to improve the infrastructure to
> make it work well automatically.  Does that make sense?

Is there a plan for this? How can we make sure there we're making a forward
progress here?

Thank you!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-05 15:03             ` Zi Yan
  2020-10-05 15:55               ` Matthew Wilcox
@ 2020-10-05 17:16               ` Roman Gushchin
  2020-10-05 17:27                 ` David Hildenbrand
  2020-10-05 17:39               ` David Hildenbrand
  2 siblings, 1 reply; 56+ messages in thread
From: Roman Gushchin @ 2020-10-05 17:16 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Michal Hocko, linux-mm, Kirill A . Shutemov,
	Rik van Riel, Matthew Wilcox, Shakeel Butt, Yang Shi,
	Jason Gunthorpe, Mike Kravetz, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
> 
> > On 02.10.20 10:10, Michal Hocko wrote:
> >> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
> >>>>>> - huge page sizes controllable by the userspace?
> >>>>>
> >>>>> It might be good to allow advanced users to choose the page sizes, so they
> >>>>> have better control of their applications.
> >>>>
> >>>> Could you elaborate more? Those advanced users can use hugetlb, right?
> >>>> They get a very good control over page size and pool preallocation etc.
> >>>> So they can get what they need - assuming there is enough memory.
> >>>>
> >>>
> >>> I am still not convinced that 1G THP (TGP :) ) are really what we want
> >>> to support. I can understand that there are some use cases that might
> >>> benefit from it, especially:
> >>
> >> Well, I would say that internal support for larger huge pages (e.g. 1GB)
> >> that can transparently split under memory pressure is a useful
> >> funtionality. I cannot really judge how complex that would be
> >
> > Right, but that's then something different than serving (scarce,
> > unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
> > wrong about *real* THP support, meaning, e.g., grouping consecutive
> > pages and converting them back and forth on demand. (E.g., 1GB ->
> > multiple 2MB -> multiple single pages), for example, when having to
> > migrate such a gigantic page. But that's very different from our
> > existing gigantic page code as far as I can tell.
> 
> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
> which needs section size increase. In addition, unmoveable pages cannot
> be allocated in CMA, so allocating 1GB pages has much higher chance from
> it than from ZONE_NORMAL.

s/higher chances/non-zero chances

Currently we have nothing that prevents the fragmentation of the memory
with unmovable pages on the 1GB scale. It means that in a common case
it's highly unlikely to find a continuous GB without any unmovable page.
As now CMA seems to be the only working option.

However it seems there are other use cases for the allocation of continuous
1GB pages: e.g. secretfd ( https://lwn.net/Articles/831628/ ), where using
1GB pages can reduce the fragmentation of the direct mapping.

So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale.
E.g. something like a second level of pageblocks. That would allow to group
all unmovable memory in few 1GB blocks and have more 1GB regions available for
gigantic THPs and other use cases. I'm looking now into how it can be done.
If anybody has any ideas here, I'll appreciate a lot.

Thanks!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-05 17:16               ` Roman Gushchin
@ 2020-10-05 17:27                 ` David Hildenbrand
  2020-10-05 18:25                   ` Roman Gushchin
  0 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2020-10-05 17:27 UTC (permalink / raw)
  To: Roman Gushchin, Zi Yan
  Cc: Michal Hocko, linux-mm, Kirill A . Shutemov, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, William Kucharski, Andrea Arcangeli, John Hubbard,
	David Nellans, linux-kernel

On 05.10.20 19:16, Roman Gushchin wrote:
> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
>>
>>> On 02.10.20 10:10, Michal Hocko wrote:
>>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
>>>>>>>> - huge page sizes controllable by the userspace?
>>>>>>>
>>>>>>> It might be good to allow advanced users to choose the page sizes, so they
>>>>>>> have better control of their applications.
>>>>>>
>>>>>> Could you elaborate more? Those advanced users can use hugetlb, right?
>>>>>> They get a very good control over page size and pool preallocation etc.
>>>>>> So they can get what they need - assuming there is enough memory.
>>>>>>
>>>>>
>>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want
>>>>> to support. I can understand that there are some use cases that might
>>>>> benefit from it, especially:
>>>>
>>>> Well, I would say that internal support for larger huge pages (e.g. 1GB)
>>>> that can transparently split under memory pressure is a useful
>>>> funtionality. I cannot really judge how complex that would be
>>>
>>> Right, but that's then something different than serving (scarce,
>>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
>>> wrong about *real* THP support, meaning, e.g., grouping consecutive
>>> pages and converting them back and forth on demand. (E.g., 1GB ->
>>> multiple 2MB -> multiple single pages), for example, when having to
>>> migrate such a gigantic page. But that's very different from our
>>> existing gigantic page code as far as I can tell.
>>
>> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
>> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
>> which needs section size increase. In addition, unmoveable pages cannot
>> be allocated in CMA, so allocating 1GB pages has much higher chance from
>> it than from ZONE_NORMAL.
> 
> s/higher chances/non-zero chances

Well, the longer the system runs (and consumes a significant amount of
available main memory), the less likely it is.

> 
> Currently we have nothing that prevents the fragmentation of the memory
> with unmovable pages on the 1GB scale. It means that in a common case
> it's highly unlikely to find a continuous GB without any unmovable page.
> As now CMA seems to be the only working option.
> 

And I completely dislike the use of CMA in this context (for example,
allocating via CMA and freeing via the buddy by patching CMA when
splitting up PUDs ...).

> However it seems there are other use cases for the allocation of continuous
> 1GB pages: e.g. secretfd ( https://lwn.net/Articles/831628/ ), where using
> 1GB pages can reduce the fragmentation of the direct mapping.

Yes, see RFC v1 where I already cced Mike.

> 
> So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale.
> E.g. something like a second level of pageblocks. That would allow to group
> all unmovable memory in few 1GB blocks and have more 1GB regions available for
> gigantic THPs and other use cases. I'm looking now into how it can be done.

Anything bigger than sections is somewhat problematic: you have to track
that data somewhere. It cannot be the section (in contrast to pageblocks)

> If anybody has any ideas here, I'll appreciate a lot.

I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That
somewhat mimics what CMA does (when sized reasonably), works well with
memory hot(un)plug, and is immune to misconfiguration. Within such a
zone, we can try to optimize the placement of larger blocks.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-05 15:34         ` Zi Yan
@ 2020-10-05 17:30           ` David Hildenbrand
  0 siblings, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2020-10-05 17:30 UTC (permalink / raw)
  To: Zi Yan, Roman Gushchin
  Cc: Michal Hocko, linux-mm, Kirill A . Shutemov, Rik van Riel,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, William Kucharski, Andrea Arcangeli, John Hubbard,
	David Nellans, linux-kernel

>> I think gigantic pages are a sparse resource. Only selected applications
>> *really* depend on them and benefit from them. Let these special
>> applications handle it explicitly.
>>
>> Can we have a summary of use cases that would really benefit from this
>> change?
> 
> For large machine learning applications, 1GB pages give good performance boost[2].
> NVIDIA DGX A100 box now has 1TB memory, which means 1GB pages are not
> that sparse in GPU-equipped infrastructure[3].

Well, they *are* sparse and there are absolutely no grantees until you
reserve them via CMA, which is just plain ugly IMHO.

In the same setup, you can most probably use hugetlbfs and achieve a
similar result. Not saying it is very user-friendly.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-05 15:03             ` Zi Yan
  2020-10-05 15:55               ` Matthew Wilcox
  2020-10-05 17:16               ` Roman Gushchin
@ 2020-10-05 17:39               ` David Hildenbrand
  2020-10-05 18:05                 ` Zi Yan
  2 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2020-10-05 17:39 UTC (permalink / raw)
  To: Zi Yan, Michal Hocko
  Cc: linux-mm, Kirill A . Shutemov, Rik van Riel, Roman Gushchin,
	Matthew Wilcox, Shakeel Butt, Yang Shi, Jason Gunthorpe,
	Mike Kravetz, William Kucharski, Andrea Arcangeli, John Hubbard,
	David Nellans, linux-kernel

>>> consideting that 2MB THP have turned out to be quite a pain but
>>> situation has settled over time. Maybe our current code base is prepared
>>> for that much better.
> 
> I am planning to refactor my code further to reduce the amount of
> the added code, since PUD THP is very similar to PMD THP. One thing
> I want to achieve is to enable split_huge_page to split any order of
> pages to a group of any lower order of pages. A lot of code in this
> patchset is replicating the same behavior of PMD THP at PUD level.
> It might be possible to deduplicate most of the code.
> 
>>>
>>> Exposing that interface to the userspace is a different story of course.
>>> I do agree that we likely do not want to be very explicit about that.
>>> E.g. an interface for address space defragmentation without any more
>>> specifics sounds like a useful feature to me. It will be up to the
>>> kernel to decide which huge pages to use.
>>
>> Yes, I think one important feature would be that we don't end up placing
>> a gigantic page where only a handful of pages are actually populated
>> without green light from the application - because that's what some user
>> space applications care about (not consuming more memory than intended.
>> IIUC, this is also what this patch set does). I'm fine with placing
>> gigantic pages if it really just "defragments" the address space layout,
>> without filling unpopulated holes.
>>
>> Then, this would be mostly invisible to user space, and we really
>> wouldn't have to care about any configuration.
> 
> 
> I agree that the interface should be as simple as no configuration to
> most users. But I also wonder why we have hugetlbfs to allow users to
> specify different kinds of page sizes, which seems against the discussion
> above. Are we assuming advanced users should always use hugetlbfs instead
> of THPs?

Well, with hugetlbfs you get a real control over which pagesizes to use.
No mixture, guarantees.

In some environments you might want to control which application gets
which pagesize. I know of database applications and hypervisors that
sometimes really want 2MB huge pages instead of 1GB huge pages. And
sometimes you really want/need 1GB huge pages (e.g., low-latency
applications, real-time KVM, ...).

Simple example: KVM with postcopy live migration

While 2MB huge pages work reasonably fine, migrating 1GB gigantic pages
on demand (via userfaultdfd) is a painfully slow / impractical.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-05 17:39               ` David Hildenbrand
@ 2020-10-05 18:05                 ` Zi Yan
  2020-10-05 18:48                   ` David Hildenbrand
  2020-10-06 11:59                   ` Michal Hocko
  0 siblings, 2 replies; 56+ messages in thread
From: Zi Yan @ 2020-10-05 18:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, linux-mm, Kirill A . Shutemov, Rik van Riel,
	Roman Gushchin, Matthew Wilcox, Shakeel Butt, Yang Shi,
	Jason Gunthorpe, Mike Kravetz, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3324 bytes --]

On 5 Oct 2020, at 13:39, David Hildenbrand wrote:

>>>> consideting that 2MB THP have turned out to be quite a pain but
>>>> situation has settled over time. Maybe our current code base is prepared
>>>> for that much better.
>>
>> I am planning to refactor my code further to reduce the amount of
>> the added code, since PUD THP is very similar to PMD THP. One thing
>> I want to achieve is to enable split_huge_page to split any order of
>> pages to a group of any lower order of pages. A lot of code in this
>> patchset is replicating the same behavior of PMD THP at PUD level.
>> It might be possible to deduplicate most of the code.
>>
>>>>
>>>> Exposing that interface to the userspace is a different story of course.
>>>> I do agree that we likely do not want to be very explicit about that.
>>>> E.g. an interface for address space defragmentation without any more
>>>> specifics sounds like a useful feature to me. It will be up to the
>>>> kernel to decide which huge pages to use.
>>>
>>> Yes, I think one important feature would be that we don't end up placing
>>> a gigantic page where only a handful of pages are actually populated
>>> without green light from the application - because that's what some user
>>> space applications care about (not consuming more memory than intended.
>>> IIUC, this is also what this patch set does). I'm fine with placing
>>> gigantic pages if it really just "defragments" the address space layout,
>>> without filling unpopulated holes.
>>>
>>> Then, this would be mostly invisible to user space, and we really
>>> wouldn't have to care about any configuration.
>>
>>
>> I agree that the interface should be as simple as no configuration to
>> most users. But I also wonder why we have hugetlbfs to allow users to
>> specify different kinds of page sizes, which seems against the discussion
>> above. Are we assuming advanced users should always use hugetlbfs instead
>> of THPs?
>
> Well, with hugetlbfs you get a real control over which pagesizes to use.
> No mixture, guarantees.
>
> In some environments you might want to control which application gets
> which pagesize. I know of database applications and hypervisors that
> sometimes really want 2MB huge pages instead of 1GB huge pages. And
> sometimes you really want/need 1GB huge pages (e.g., low-latency
> applications, real-time KVM, ...).
>
> Simple example: KVM with postcopy live migration
>
> While 2MB huge pages work reasonably fine, migrating 1GB gigantic pages
> on demand (via userfaultdfd) is a painfully slow / impractical.


The real control of hugetlbfs comes from the interfaces provided by
the kernel. If kernel provides similar interfaces to control page sizes
of THPs, it should work the same as hugetlbfs. Mixing page sizes usually
comes from system memory fragmentation and hugetlbfs does not have this
mixture because of its special allocation pools not because of the code
itself. If THPs are allocated from the same pools, they would act
the same as hugetlbfs. What am I missing here?

I just do not get why hugetlbfs is so special that it can have pagesize
fine control when normal pages cannot get. The “it should be invisible
to userpsace” argument suddenly does not hold for hugetlbfs.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-05 17:27                 ` David Hildenbrand
@ 2020-10-05 18:25                   ` Roman Gushchin
  2020-10-05 18:33                     ` David Hildenbrand
  0 siblings, 1 reply; 56+ messages in thread
From: Roman Gushchin @ 2020-10-05 18:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, Michal Hocko, linux-mm, Kirill A . Shutemov,
	Rik van Riel, Matthew Wilcox, Shakeel Butt, Yang Shi,
	Jason Gunthorpe, Mike Kravetz, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote:
> On 05.10.20 19:16, Roman Gushchin wrote:
> > On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
> >> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
> >>
> >>> On 02.10.20 10:10, Michal Hocko wrote:
> >>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
> >>>>>>>> - huge page sizes controllable by the userspace?
> >>>>>>>
> >>>>>>> It might be good to allow advanced users to choose the page sizes, so they
> >>>>>>> have better control of their applications.
> >>>>>>
> >>>>>> Could you elaborate more? Those advanced users can use hugetlb, right?
> >>>>>> They get a very good control over page size and pool preallocation etc.
> >>>>>> So they can get what they need - assuming there is enough memory.
> >>>>>>
> >>>>>
> >>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want
> >>>>> to support. I can understand that there are some use cases that might
> >>>>> benefit from it, especially:
> >>>>
> >>>> Well, I would say that internal support for larger huge pages (e.g. 1GB)
> >>>> that can transparently split under memory pressure is a useful
> >>>> funtionality. I cannot really judge how complex that would be
> >>>
> >>> Right, but that's then something different than serving (scarce,
> >>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
> >>> wrong about *real* THP support, meaning, e.g., grouping consecutive
> >>> pages and converting them back and forth on demand. (E.g., 1GB ->
> >>> multiple 2MB -> multiple single pages), for example, when having to
> >>> migrate such a gigantic page. But that's very different from our
> >>> existing gigantic page code as far as I can tell.
> >>
> >> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
> >> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
> >> which needs section size increase. In addition, unmoveable pages cannot
> >> be allocated in CMA, so allocating 1GB pages has much higher chance from
> >> it than from ZONE_NORMAL.
> > 
> > s/higher chances/non-zero chances
> 
> Well, the longer the system runs (and consumes a significant amount of
> available main memory), the less likely it is.
> 
> > 
> > Currently we have nothing that prevents the fragmentation of the memory
> > with unmovable pages on the 1GB scale. It means that in a common case
> > it's highly unlikely to find a continuous GB without any unmovable page.
> > As now CMA seems to be the only working option.
> > 
> 
> And I completely dislike the use of CMA in this context (for example,
> allocating via CMA and freeing via the buddy by patching CMA when
> splitting up PUDs ...).
> 
> > However it seems there are other use cases for the allocation of continuous
> > 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e=  ), where using
> > 1GB pages can reduce the fragmentation of the direct mapping.
> 
> Yes, see RFC v1 where I already cced Mike.
> 
> > 
> > So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale.
> > E.g. something like a second level of pageblocks. That would allow to group
> > all unmovable memory in few 1GB blocks and have more 1GB regions available for
> > gigantic THPs and other use cases. I'm looking now into how it can be done.
> 
> Anything bigger than sections is somewhat problematic: you have to track
> that data somewhere. It cannot be the section (in contrast to pageblocks)

Well, it's not a large amount of data: the number of 1GB regions is not that
high even on very large machines.

> 
> > If anybody has any ideas here, I'll appreciate a lot.
> 
> I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That
> somewhat mimics what CMA does (when sized reasonably), works well with
> memory hot(un)plug, and is immune to misconfiguration. Within such a
> zone, we can try to optimize the placement of larger blocks.

Thank you for pointing at it!

The main problem with it is the same as with ZONE_MOVABLE: it does require
a boot-time educated guess on a good size. I admit that the CMA does too.

But I really hope that a long-term solution will not require a pre-configuration.
I do not see why fundamentally we can't group unmovable allocations in (few)
1GB regions. Basically all we need to do is to choose a nearby 2MB block if we
don't have enough free pages in the unmovable free list and going to steal a new
2MB block. I know, it doesn't work this way, but just as an illustration.
In the reality, when stealing a block, under some conditions we might want
to steal the whole 1GB region. In this case the following unmovable allocations
will not lead to stealing of new blocks from (potentially) different 1GB regions.
I have no working code yet, just thinking into this direction.

Thanks!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-05 18:25                   ` Roman Gushchin
@ 2020-10-05 18:33                     ` David Hildenbrand
  2020-10-05 19:11                       ` Roman Gushchin
  0 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2020-10-05 18:33 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Zi Yan, Michal Hocko, linux-mm, Kirill A . Shutemov,
	Rik van Riel, Matthew Wilcox, Shakeel Butt, Yang Shi,
	Jason Gunthorpe, Mike Kravetz, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

On 05.10.20 20:25, Roman Gushchin wrote:
> On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote:
>> On 05.10.20 19:16, Roman Gushchin wrote:
>>> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
>>>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
>>>>
>>>>> On 02.10.20 10:10, Michal Hocko wrote:
>>>>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
>>>>>>>>>> - huge page sizes controllable by the userspace?
>>>>>>>>>
>>>>>>>>> It might be good to allow advanced users to choose the page sizes, so they
>>>>>>>>> have better control of their applications.
>>>>>>>>
>>>>>>>> Could you elaborate more? Those advanced users can use hugetlb, right?
>>>>>>>> They get a very good control over page size and pool preallocation etc.
>>>>>>>> So they can get what they need - assuming there is enough memory.
>>>>>>>>
>>>>>>>
>>>>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want
>>>>>>> to support. I can understand that there are some use cases that might
>>>>>>> benefit from it, especially:
>>>>>>
>>>>>> Well, I would say that internal support for larger huge pages (e.g. 1GB)
>>>>>> that can transparently split under memory pressure is a useful
>>>>>> funtionality. I cannot really judge how complex that would be
>>>>>
>>>>> Right, but that's then something different than serving (scarce,
>>>>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
>>>>> wrong about *real* THP support, meaning, e.g., grouping consecutive
>>>>> pages and converting them back and forth on demand. (E.g., 1GB ->
>>>>> multiple 2MB -> multiple single pages), for example, when having to
>>>>> migrate such a gigantic page. But that's very different from our
>>>>> existing gigantic page code as far as I can tell.
>>>>
>>>> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
>>>> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
>>>> which needs section size increase. In addition, unmoveable pages cannot
>>>> be allocated in CMA, so allocating 1GB pages has much higher chance from
>>>> it than from ZONE_NORMAL.
>>>
>>> s/higher chances/non-zero chances
>>
>> Well, the longer the system runs (and consumes a significant amount of
>> available main memory), the less likely it is.
>>
>>>
>>> Currently we have nothing that prevents the fragmentation of the memory
>>> with unmovable pages on the 1GB scale. It means that in a common case
>>> it's highly unlikely to find a continuous GB without any unmovable page.
>>> As now CMA seems to be the only working option.
>>>
>>
>> And I completely dislike the use of CMA in this context (for example,
>> allocating via CMA and freeing via the buddy by patching CMA when
>> splitting up PUDs ...).
>>
>>> However it seems there are other use cases for the allocation of continuous
>>> 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e=  ), where using
>>> 1GB pages can reduce the fragmentation of the direct mapping.
>>
>> Yes, see RFC v1 where I already cced Mike.
>>
>>>
>>> So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale.
>>> E.g. something like a second level of pageblocks. That would allow to group
>>> all unmovable memory in few 1GB blocks and have more 1GB regions available for
>>> gigantic THPs and other use cases. I'm looking now into how it can be done.
>>
>> Anything bigger than sections is somewhat problematic: you have to track
>> that data somewhere. It cannot be the section (in contrast to pageblocks)
> 
> Well, it's not a large amount of data: the number of 1GB regions is not that
> high even on very large machines.

Yes, but then you can have very sparse systems. And some use cases would
actually want to avoid fragmentation on smaller levels (e.g., 128MB) -
optimizing memory efficiency by turning off banks and such ...

> 
>>
>>> If anybody has any ideas here, I'll appreciate a lot.
>>
>> I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That
>> somewhat mimics what CMA does (when sized reasonably), works well with
>> memory hot(un)plug, and is immune to misconfiguration. Within such a
>> zone, we can try to optimize the placement of larger blocks.
> 
> Thank you for pointing at it!
> 
> The main problem with it is the same as with ZONE_MOVABLE: it does require
> a boot-time educated guess on a good size. I admit that the CMA does too.

"Educated guess" of ratios like 1:1. 1:2, and even 1:4 (known from
highmem times) ares usually perfectly fine. And if you mess up - in
comparison to CMA - you won't shoot yourself in the foot, you get less
gigantic pages - which is usually better than before. I consider that a
clear win. Perfect? No. Can we be perfect? unlikely.

In comparison to CMA / ZONE_MOVABLE, a bad guess won't cause instabilities.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-05 18:05                 ` Zi Yan
@ 2020-10-05 18:48                   ` David Hildenbrand
  2020-10-06 11:59                   ` Michal Hocko
  1 sibling, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2020-10-05 18:48 UTC (permalink / raw)
  To: Zi Yan
  Cc: Michal Hocko, linux-mm, Kirill A . Shutemov, Rik van Riel,
	Roman Gushchin, Matthew Wilcox, Shakeel Butt, Yang Shi,
	Jason Gunthorpe, Mike Kravetz, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

> The real control of hugetlbfs comes from the interfaces provided by
> the kernel. If kernel provides similar interfaces to control page sizes
> of THPs, it should work the same as hugetlbfs. Mixing page sizes usually
> comes from system memory fragmentation and hugetlbfs does not have this
> mixture because of its special allocation pools not because of the code

With hugeltbfs, you have a guarantee that all pages within your VMA have
the same page size. This is an important property. With THP you have the
guarantee that any page can be operated on, as if it would be base-page
granularity.

Example: KVM on s390x

a) It cannot deal with THP. If you supply THP, the kernel will simply
split up all THP and prohibit new ones from getting formed. All works
well (well, no speedup because no THP).
b) It can deal with 1MB huge pages (in some configurations).
c) It cannot deal with 2G huge pages.

So user space really has to control which pagesize to use in case of
hugetlbfs.

> itself. If THPs are allocated from the same pools, they would act
> the same as hugetlbfs. What am I missing here?

Did I mention that I dislike taking THP from the CMA pool? ;)

> 
> I just do not get why hugetlbfs is so special that it can have pagesize
> fine control when normal pages cannot get. The “it should be invisible
> to userpsace” argument suddenly does not hold for hugetlbfs.

It's not about "cannot get", it's about "do we need it". We do have a
trigger "THP yes/no". I wonder in which cases that wouldn't be sufficient.


The name "Transparent" implies that they *should* be transparent to user
space. This, unfortunately, is not completely true:

1. Performance aspects: Breaking up THP is bad for performance. This can
be observed fairly easily by when using 4k-based memory ballooning in
virtualized environments. If we stick to the current THP size (e.g.,
2MB), we are mostly fine. Breaking up 1G THP into 2MB THP when required
 is completely acceptable.

2. Wasting memory: Touch a 4K page, get 2M populated. Somewhat
acceptable / controllable. Touch 4K, get 1G populated is not desirable.
And I think we mostly agree that we should operate only on
fully-populated ranges to replace by 1G THP.


But then, there is no observerable difference between 1G THP and 2M THP
from user space point of view except performance.

So we are debating about "Should the kernel tell us that we can use 1G
THP for a VMA".  What if we were suddenly to support 2G THP (look at
arm64 how they support all kinds of huge pages for hugetlbfs)? Do we
really need *another* trigger?

What Michal proposed (IIUC) is rather user space telling the kernel
"this large memory range here is *really* important for performance,
please try to optimize the memory layout, give me the best you've got".

MADV_HUGEPAGE_1GB is just ugly.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-05 18:33                     ` David Hildenbrand
@ 2020-10-05 19:11                       ` Roman Gushchin
  2020-10-06  8:25                         ` David Hildenbrand
  0 siblings, 1 reply; 56+ messages in thread
From: Roman Gushchin @ 2020-10-05 19:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, Michal Hocko, linux-mm, Kirill A . Shutemov,
	Rik van Riel, Matthew Wilcox, Shakeel Butt, Yang Shi,
	Jason Gunthorpe, Mike Kravetz, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

On Mon, Oct 05, 2020 at 08:33:44PM +0200, David Hildenbrand wrote:
> On 05.10.20 20:25, Roman Gushchin wrote:
> > On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote:
> >> On 05.10.20 19:16, Roman Gushchin wrote:
> >>> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
> >>>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
> >>>>
> >>>>> On 02.10.20 10:10, Michal Hocko wrote:
> >>>>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
> >>>>>>>>>> - huge page sizes controllable by the userspace?
> >>>>>>>>>
> >>>>>>>>> It might be good to allow advanced users to choose the page sizes, so they
> >>>>>>>>> have better control of their applications.
> >>>>>>>>
> >>>>>>>> Could you elaborate more? Those advanced users can use hugetlb, right?
> >>>>>>>> They get a very good control over page size and pool preallocation etc.
> >>>>>>>> So they can get what they need - assuming there is enough memory.
> >>>>>>>>
> >>>>>>>
> >>>>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want
> >>>>>>> to support. I can understand that there are some use cases that might
> >>>>>>> benefit from it, especially:
> >>>>>>
> >>>>>> Well, I would say that internal support for larger huge pages (e.g. 1GB)
> >>>>>> that can transparently split under memory pressure is a useful
> >>>>>> funtionality. I cannot really judge how complex that would be
> >>>>>
> >>>>> Right, but that's then something different than serving (scarce,
> >>>>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
> >>>>> wrong about *real* THP support, meaning, e.g., grouping consecutive
> >>>>> pages and converting them back and forth on demand. (E.g., 1GB ->
> >>>>> multiple 2MB -> multiple single pages), for example, when having to
> >>>>> migrate such a gigantic page. But that's very different from our
> >>>>> existing gigantic page code as far as I can tell.
> >>>>
> >>>> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
> >>>> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
> >>>> which needs section size increase. In addition, unmoveable pages cannot
> >>>> be allocated in CMA, so allocating 1GB pages has much higher chance from
> >>>> it than from ZONE_NORMAL.
> >>>
> >>> s/higher chances/non-zero chances
> >>
> >> Well, the longer the system runs (and consumes a significant amount of
> >> available main memory), the less likely it is.
> >>
> >>>
> >>> Currently we have nothing that prevents the fragmentation of the memory
> >>> with unmovable pages on the 1GB scale. It means that in a common case
> >>> it's highly unlikely to find a continuous GB without any unmovable page.
> >>> As now CMA seems to be the only working option.
> >>>
> >>
> >> And I completely dislike the use of CMA in this context (for example,
> >> allocating via CMA and freeing via the buddy by patching CMA when
> >> splitting up PUDs ...).
> >>
> >>> However it seems there are other use cases for the allocation of continuous
> >>> 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e=  ), where using
> >>> 1GB pages can reduce the fragmentation of the direct mapping.
> >>
> >> Yes, see RFC v1 where I already cced Mike.
> >>
> >>>
> >>> So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale.
> >>> E.g. something like a second level of pageblocks. That would allow to group
> >>> all unmovable memory in few 1GB blocks and have more 1GB regions available for
> >>> gigantic THPs and other use cases. I'm looking now into how it can be done.
> >>
> >> Anything bigger than sections is somewhat problematic: you have to track
> >> that data somewhere. It cannot be the section (in contrast to pageblocks)
> > 
> > Well, it's not a large amount of data: the number of 1GB regions is not that
> > high even on very large machines.
> 
> Yes, but then you can have very sparse systems. And some use cases would
> actually want to avoid fragmentation on smaller levels (e.g., 128MB) -
> optimizing memory efficiency by turning off banks and such ...

It's a definitely a good question.

> > 
> >>
> >>> If anybody has any ideas here, I'll appreciate a lot.
> >>
> >> I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That
> >> somewhat mimics what CMA does (when sized reasonably), works well with
> >> memory hot(un)plug, and is immune to misconfiguration. Within such a
> >> zone, we can try to optimize the placement of larger blocks.
> > 
> > Thank you for pointing at it!
> > 
> > The main problem with it is the same as with ZONE_MOVABLE: it does require
> > a boot-time educated guess on a good size. I admit that the CMA does too.
> 
> "Educated guess" of ratios like 1:1. 1:2, and even 1:4 (known from
> highmem times) ares usually perfectly fine. And if you mess up - in
> comparison to CMA - you won't shoot yourself in the foot, you get less
> gigantic pages - which is usually better than before. I consider that a
> clear win. Perfect? No. Can we be perfect? unlikely.

I'm not necessarily opposing your idea, I just think it will be tricky
to not introduce an additional overhead if the ratio is not perfectly
chosen. And there is simple a cost of adding a zone.

But fundamentally we're speaking about the same thing: grouping pages
by their movability on a smaller scale. With a new zone we'll split
pages into two parts with a fixed border, with new pageblock layer
in 1GB blocks.

I think the agreement is that we need such functionality.

Thanks!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-05 15:55               ` Matthew Wilcox
  2020-10-05 17:04                 ` Roman Gushchin
@ 2020-10-05 19:12                 ` Zi Yan
  2020-10-05 19:37                   ` Matthew Wilcox
  1 sibling, 1 reply; 56+ messages in thread
From: Zi Yan @ 2020-10-05 19:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Michal Hocko, linux-mm, Kirill A . Shutemov,
	Rik van Riel, Roman Gushchin, Shakeel Butt, Yang Shi,
	Jason Gunthorpe, Mike Kravetz, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6817 bytes --]

On 5 Oct 2020, at 11:55, Matthew Wilcox wrote:

> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
>>> Yes, I think one important feature would be that we don't end up placing
>>> a gigantic page where only a handful of pages are actually populated
>>> without green light from the application - because that's what some user
>>> space applications care about (not consuming more memory than intended.
>>> IIUC, this is also what this patch set does). I'm fine with placing
>>> gigantic pages if it really just "defragments" the address space layout,
>>> without filling unpopulated holes.
>>>
>>> Then, this would be mostly invisible to user space, and we really
>>> wouldn't have to care about any configuration.
>>
>> I agree that the interface should be as simple as no configuration to
>> most users. But I also wonder why we have hugetlbfs to allow users to
>> specify different kinds of page sizes, which seems against the discussion
>> above. Are we assuming advanced users should always use hugetlbfs instead
>> of THPs?
>
> Evolution doesn't always produce the best outcomes ;-)
>
> A perennial mistake we've made is "Oh, this is a strange & new & weird
> feature that most applications will never care about, let's put it in
> hugetlbfs where nobody will notice and we don't have to think about it
> in the core VM"
>
> And then what was initially strange & new & weird gradually becomes
> something that most applications just want to have happen automatically,
> and telling them all to go use hugetlbfs becomes untenable, so we move
> the feature into the core VM.
>
> It is absurd that my phone is attempting to manage a million 4kB pages.
> I think even trying to manage a quarter-million 16kB pages is too much
> work, and really it would be happier managing 65,000 64kB pages.
>
> Extend that into the future a decade or two, and we'll be expecting
> that it manages memory in megabyte sized units and uses PMD and PUD
> mappings by default.  PTE mappings will still be used, but very much
> on a "Oh you have a tiny file, OK, we'll fragment a megabyte page into
> smaller pages to not waste too much memory when mapping it" basis.  So,
> yeah, PUD sized mappings have problems today, but we should be writing
> software now so a Pixel 15 in a decade can boot a kernel built five
> years from now and have PUD mappings Just Work without requiring the
> future userspace programmer to "use hugetlbfs".

I agree.

>
> One of the longer-term todo items is to support variable sized THPs for
> anonymous memory, just like I've done for the pagecache.  With that in
> place, I think scaling up from PMD sized pages to PUD sized pages starts
> to look more natural.  Itanium and PA-RISC (two architectures that will
> never be found in phones...) support 1MB, 4MB, 16MB, 64MB and upwards.
> The RiscV spec you pointed me at the other day confines itself to adding
> support for 16, 64 & 256kB today, but does note that 8MB, 32MB and 128MB
> sizes would be possible additions in the future.

Just to understand the todo items clearly. With your pagecache patchset,
kernel should be able to understand variable sized THPs no matter they
are anonymous or not, right? For anonymous memory, we need kernel policies
to decide what THP sizes to use at allocation, what to do when under
memory pressure, and so on. In terms of implementation, THP split function
needs to support from any order to any lower order. Anything I am missing here?

>
> But, back to today, what to do with this patchset?  Even on my 16GB
> laptop, let alone my 4GB phone, I'm uncertain that allocating a 1GB
> page is ever the right decision to make.  But my laptop runs a "mixed"
> workload, and if you could convince me that Firefox would run 10% faster
> by using a 1GB page as its in-memory cache, well, I'd be sold.
>
> I do like having the kernel figure out what's in the best interests of the
> system as a whole.  Apps don't have enough information, and while they
> can provide hints, they're often wrong.  So, let's say an app maps 8GB
> of anonymous memory.  As the app accesses it, we should probably start
> by allocating 4kB pages to back that memory.  As time goes on and that
> memory continues to be accessed and more memory is accessed, it makes
> sense to keep track of that, replacing the existing 4kB pages with, say,
> 16-64kB pages and allocating newly accessed memory with larger pages.
> Eventually that should grow to 2MB allocations and PMD mappings.
> And then continue on, all the way to 1GB pages.
>
> We also need to be able to figure out that it's not being effective
> any more.  One of the issues with tracing accessed/dirty at the 1GB level
> is that writing an entire 1GB page is going to take 0.25 seconds on a x4
> gen3 PCIe link.  I know swapping sucks, but that's extreme.  So to use
> 1GB pages effectively today, we need to fragment them before choosing to
> swap them out (*)  Maybe that's the point where we can start to say "OK,
> this sized mapping might not be effective any more".  On the other hand,
> that might not work for some situations.  Imagine, eg, a matrix multiply
> (everybody's favourite worst-case scenario).  C = A * B where each of A,
> B and C is too large to fit in DRAM.  There are going to be points of the
> calculation where each element of A is going to be walked sequentially,
> and so it'd be nice to use larger PTEs to map it, but then we need to
> destroy that almost immediately to allow other things to use the memory.
>
>
> I think I'm leaning towards not merging this patchset yet.  I'm in
> agreement with the goals (allowing systems to use PUD-sized pages
> automatically), but I think we need to improve the infrastructure to
> make it work well automatically.  Does that make sense?

I agree that this patchset should not be merged in the current form.
I think PUD THP support is a part of variable sized THP support, but
current form of the patchset does not have the “variable sized THP”
spirit yet and is more like a special PUD case support. I guess some
changes to existing THP code to make PUD THP less a special case would
make the whole patchset more acceptable?

Can you elaborate more on the infrastructure part? Thanks.

>
> (*) It would be nice if hardware provided a way to track D/A on a sub-PTE
> level when using PMD/PUD sized mappings.  I don't know of any that does
> that today.

I agree it would be a nice hardware feature, but it also has a high cost.
Each TLB would support this with 1024 bits, which is about 16 TLB entry size,
assuming each entry takes 8B space. Now it becomes why not having a bigger
TLB. ;)



—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-05 19:12                 ` Zi Yan
@ 2020-10-05 19:37                   ` Matthew Wilcox
  0 siblings, 0 replies; 56+ messages in thread
From: Matthew Wilcox @ 2020-10-05 19:37 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Michal Hocko, linux-mm, Kirill A . Shutemov,
	Rik van Riel, Roman Gushchin, Shakeel Butt, Yang Shi,
	Jason Gunthorpe, Mike Kravetz, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

On Mon, Oct 05, 2020 at 03:12:55PM -0400, Zi Yan wrote:
> On 5 Oct 2020, at 11:55, Matthew Wilcox wrote:
> > One of the longer-term todo items is to support variable sized THPs for
> > anonymous memory, just like I've done for the pagecache.  With that in
> > place, I think scaling up from PMD sized pages to PUD sized pages starts
> > to look more natural.  Itanium and PA-RISC (two architectures that will
> > never be found in phones...) support 1MB, 4MB, 16MB, 64MB and upwards.
> > The RiscV spec you pointed me at the other day confines itself to adding
> > support for 16, 64 & 256kB today, but does note that 8MB, 32MB and 128MB
> > sizes would be possible additions in the future.
> 
> Just to understand the todo items clearly. With your pagecache patchset,
> kernel should be able to understand variable sized THPs no matter they
> are anonymous or not, right?

... yes ... modulo bugs and places I didn't fix because only anonymous
pages can get there ;-)  There are still quite a few references to
HPAGE_PMD_MASK / SIZE / NR and I couldn't swear that they're all related
to things which are actually PMD sized.  I did fix a couple of places
where the anonymous path assumed that pages were PMD sized because I
thought we'd probably want to do that sooner rather than later.

> For anonymous memory, we need kernel policies
> to decide what THP sizes to use at allocation, what to do when under
> memory pressure, and so on. In terms of implementation, THP split function
> needs to support from any order to any lower order. Anything I am missing here?

I think that's the bulk of the work.  The swap code also needs work so we
don't have to split pages to swap them out.

> > I think I'm leaning towards not merging this patchset yet.  I'm in
> > agreement with the goals (allowing systems to use PUD-sized pages
> > automatically), but I think we need to improve the infrastructure to
> > make it work well automatically.  Does that make sense?
> 
> I agree that this patchset should not be merged in the current form.
> I think PUD THP support is a part of variable sized THP support, but
> current form of the patchset does not have the “variable sized THP”
> spirit yet and is more like a special PUD case support. I guess some
> changes to existing THP code to make PUD THP less a special case would
> make the whole patchset more acceptable?
> 
> Can you elaborate more on the infrastructure part? Thanks.

Oh, this paragraph was just summarising the above.  We need to
be consistently using thp_size() instead of HPAGE_PMD_SIZE, etc.
I haven't put much effort yet into supporting pages which are larger than
PMD-size -- that is, if a page is mapped with a PMD entry, we assume
it's PMD-sized.  Once we can allocate a larger-than-PMD sized page,
that's off.  I assume a lot of that is dealt with in your patchset,
although I haven't audited it to check for that.

> > (*) It would be nice if hardware provided a way to track D/A on a sub-PTE
> > level when using PMD/PUD sized mappings.  I don't know of any that does
> > that today.
> 
> I agree it would be a nice hardware feature, but it also has a high cost.
> Each TLB would support this with 1024 bits, which is about 16 TLB entry size,
> assuming each entry takes 8B space. Now it becomes why not having a bigger
> TLB. ;)

Oh, we don't have to track at the individual-page level for this to be
useful.  Let's take the RISC-V Sv39 page table entry format as an example:

63-54 attributes
53-28 PPN2
27-19 PPN1
18-10 PPN0
9-8 RSW
7-0 DAGUXWRV

For a 2MB page, we currently insist that 18-10 are zero.  If we repurpose
eight of those nine bits as A/D bits, we can track at 512kB granularity.
For 1GB pages, we can use 16 of the 18 bits to track A/D at 128MB
granularity.  It's not great, but it is quite cheap!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-05 19:11                       ` Roman Gushchin
@ 2020-10-06  8:25                         ` David Hildenbrand
  0 siblings, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2020-10-06  8:25 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Zi Yan, Michal Hocko, linux-mm, Kirill A . Shutemov,
	Rik van Riel, Matthew Wilcox, Shakeel Butt, Yang Shi,
	Jason Gunthorpe, Mike Kravetz, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

On 05.10.20 21:11, Roman Gushchin wrote:
> On Mon, Oct 05, 2020 at 08:33:44PM +0200, David Hildenbrand wrote:
>> On 05.10.20 20:25, Roman Gushchin wrote:
>>> On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote:
>>>> On 05.10.20 19:16, Roman Gushchin wrote:
>>>>> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
>>>>>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
>>>>>>
>>>>>>> On 02.10.20 10:10, Michal Hocko wrote:
>>>>>>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
>>>>>>>>>>>> - huge page sizes controllable by the userspace?
>>>>>>>>>>>
>>>>>>>>>>> It might be good to allow advanced users to choose the page sizes, so they
>>>>>>>>>>> have better control of their applications.
>>>>>>>>>>
>>>>>>>>>> Could you elaborate more? Those advanced users can use hugetlb, right?
>>>>>>>>>> They get a very good control over page size and pool preallocation etc.
>>>>>>>>>> So they can get what they need - assuming there is enough memory.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want
>>>>>>>>> to support. I can understand that there are some use cases that might
>>>>>>>>> benefit from it, especially:
>>>>>>>>
>>>>>>>> Well, I would say that internal support for larger huge pages (e.g. 1GB)
>>>>>>>> that can transparently split under memory pressure is a useful
>>>>>>>> funtionality. I cannot really judge how complex that would be
>>>>>>>
>>>>>>> Right, but that's then something different than serving (scarce,
>>>>>>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
>>>>>>> wrong about *real* THP support, meaning, e.g., grouping consecutive
>>>>>>> pages and converting them back and forth on demand. (E.g., 1GB ->
>>>>>>> multiple 2MB -> multiple single pages), for example, when having to
>>>>>>> migrate such a gigantic page. But that's very different from our
>>>>>>> existing gigantic page code as far as I can tell.
>>>>>>
>>>>>> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
>>>>>> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
>>>>>> which needs section size increase. In addition, unmoveable pages cannot
>>>>>> be allocated in CMA, so allocating 1GB pages has much higher chance from
>>>>>> it than from ZONE_NORMAL.
>>>>>
>>>>> s/higher chances/non-zero chances
>>>>
>>>> Well, the longer the system runs (and consumes a significant amount of
>>>> available main memory), the less likely it is.
>>>>
>>>>>
>>>>> Currently we have nothing that prevents the fragmentation of the memory
>>>>> with unmovable pages on the 1GB scale. It means that in a common case
>>>>> it's highly unlikely to find a continuous GB without any unmovable page.
>>>>> As now CMA seems to be the only working option.
>>>>>
>>>>
>>>> And I completely dislike the use of CMA in this context (for example,
>>>> allocating via CMA and freeing via the buddy by patching CMA when
>>>> splitting up PUDs ...).
>>>>
>>>>> However it seems there are other use cases for the allocation of continuous
>>>>> 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e=  ), where using
>>>>> 1GB pages can reduce the fragmentation of the direct mapping.
>>>>
>>>> Yes, see RFC v1 where I already cced Mike.
>>>>
>>>>>
>>>>> So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale.
>>>>> E.g. something like a second level of pageblocks. That would allow to group
>>>>> all unmovable memory in few 1GB blocks and have more 1GB regions available for
>>>>> gigantic THPs and other use cases. I'm looking now into how it can be done.
>>>>
>>>> Anything bigger than sections is somewhat problematic: you have to track
>>>> that data somewhere. It cannot be the section (in contrast to pageblocks)
>>>
>>> Well, it's not a large amount of data: the number of 1GB regions is not that
>>> high even on very large machines.
>>
>> Yes, but then you can have very sparse systems. And some use cases would
>> actually want to avoid fragmentation on smaller levels (e.g., 128MB) -
>> optimizing memory efficiency by turning off banks and such ...
> 
> It's a definitely a good question.

Oh, and I forgot that there might be users that want bigger granularity
:) (primarily, memory hotunplug that wants to avoid ZONE_MOVABLE  but
still have higher chances to eventually unplug some memory)

> 
>>>
>>>>
>>>>> If anybody has any ideas here, I'll appreciate a lot.
>>>>
>>>> I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That
>>>> somewhat mimics what CMA does (when sized reasonably), works well with
>>>> memory hot(un)plug, and is immune to misconfiguration. Within such a
>>>> zone, we can try to optimize the placement of larger blocks.
>>>
>>> Thank you for pointing at it!
>>>
>>> The main problem with it is the same as with ZONE_MOVABLE: it does require
>>> a boot-time educated guess on a good size. I admit that the CMA does too.
>>
>> "Educated guess" of ratios like 1:1. 1:2, and even 1:4 (known from
>> highmem times) ares usually perfectly fine. And if you mess up - in
>> comparison to CMA - you won't shoot yourself in the foot, you get less
>> gigantic pages - which is usually better than before. I consider that a
>> clear win. Perfect? No. Can we be perfect? unlikely.
> 
> I'm not necessarily opposing your idea, I just think it will be tricky
> to not introduce an additional overhead if the ratio is not perfectly
> chosen. And there is simple a cost of adding a zone.

Not sure this will be really visible - and if your kernel requires more
than 20%..50% unmovable data than something is usually really
fishy/special. The nice thing is that Linux will try to "auto-optimize"
within each zone already.

My gut feeling is that it's way easier to teach Linux (add zone, add
mmop_type, build zonelists, split memory similar to movablecore) -
however, that doesn't imply that it's better. We'll have to see.

> 
> But fundamentally we're speaking about the same thing: grouping pages
> by their movability on a smaller scale. With a new zone we'll split
> pages into two parts with a fixed border, with new pageblock layer
> in 1GB blocks.

I also discussed moving the border on demand, which is way more tricky
and would definitely be stuff for the future.

There are some papers about similar fragmentation-avoidance techniques,
mostly in the context of energy efficiency IIRC. Especially:
- PALLOC: https://ieeexplore.ieee.org/document/6925999
- Adaptive-buddy:
https://ieeexplore.ieee.org/document/7397629?reload=true&arnumber=7397629

IIRC, the problem about such approaches is that they are quite invasive
and degrade some workloads due to overhead.

> 
> I think the agreement is that we need such functionality.

Yeah, on my long todo list. I'll be prototyping ZONE_RPEFER_MOVABLE
soon, to see how it looks/feels/performs.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
  2020-10-05 18:05                 ` Zi Yan
  2020-10-05 18:48                   ` David Hildenbrand
@ 2020-10-06 11:59                   ` Michal Hocko
  1 sibling, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2020-10-06 11:59 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, linux-mm, Kirill A . Shutemov, Rik van Riel,
	Roman Gushchin, Matthew Wilcox, Shakeel Butt, Yang Shi,
	Jason Gunthorpe, Mike Kravetz, William Kucharski,
	Andrea Arcangeli, John Hubbard, David Nellans, linux-kernel

On Mon 05-10-20 14:05:17, Zi Yan wrote:
> On 5 Oct 2020, at 13:39, David Hildenbrand wrote:
> 
> >>>> consideting that 2MB THP have turned out to be quite a pain but
> >>>> situation has settled over time. Maybe our current code base is prepared
> >>>> for that much better.
> >>
> >> I am planning to refactor my code further to reduce the amount of
> >> the added code, since PUD THP is very similar to PMD THP. One thing
> >> I want to achieve is to enable split_huge_page to split any order of
> >> pages to a group of any lower order of pages. A lot of code in this
> >> patchset is replicating the same behavior of PMD THP at PUD level.
> >> It might be possible to deduplicate most of the code.
> >>
> >>>>
> >>>> Exposing that interface to the userspace is a different story of course.
> >>>> I do agree that we likely do not want to be very explicit about that.
> >>>> E.g. an interface for address space defragmentation without any more
> >>>> specifics sounds like a useful feature to me. It will be up to the
> >>>> kernel to decide which huge pages to use.
> >>>
> >>> Yes, I think one important feature would be that we don't end up placing
> >>> a gigantic page where only a handful of pages are actually populated
> >>> without green light from the application - because that's what some user
> >>> space applications care about (not consuming more memory than intended.
> >>> IIUC, this is also what this patch set does). I'm fine with placing
> >>> gigantic pages if it really just "defragments" the address space layout,
> >>> without filling unpopulated holes.
> >>>
> >>> Then, this would be mostly invisible to user space, and we really
> >>> wouldn't have to care about any configuration.
> >>
> >>
> >> I agree that the interface should be as simple as no configuration to
> >> most users. But I also wonder why we have hugetlbfs to allow users to
> >> specify different kinds of page sizes, which seems against the discussion
> >> above. Are we assuming advanced users should always use hugetlbfs instead
> >> of THPs?
> >
> > Well, with hugetlbfs you get a real control over which pagesizes to use.
> > No mixture, guarantees.
> >
> > In some environments you might want to control which application gets
> > which pagesize. I know of database applications and hypervisors that
> > sometimes really want 2MB huge pages instead of 1GB huge pages. And
> > sometimes you really want/need 1GB huge pages (e.g., low-latency
> > applications, real-time KVM, ...).
> >
> > Simple example: KVM with postcopy live migration
> >
> > While 2MB huge pages work reasonably fine, migrating 1GB gigantic pages
> > on demand (via userfaultdfd) is a painfully slow / impractical.
> 
> 
> The real control of hugetlbfs comes from the interfaces provided by
> the kernel. If kernel provides similar interfaces to control page sizes
> of THPs, it should work the same as hugetlbfs. Mixing page sizes usually
> comes from system memory fragmentation and hugetlbfs does not have this
> mixture because of its special allocation pools not because of the code
> itself.

Not really. Hugetlb is defined to provide a consistent and single page
size access to the memory. To the degree that you fail early if you
cannot guarantee that. This is not an implementation detail. This is the
semantic of the feature. Control goes along with the interface.

> If THPs are allocated from the same pools, they would act
> the same as hugetlbfs. What am I missing here?

THPs are a completely different beast. They are aiming to be transparent
so that user doesn't really have to control them explicitly[1]. They should
be dynamically created and demoted as the system manages resources
behind users back. In short they optimize rather than guanratee. This is
also the reason why a complete control sounds quite alien to me. Say you
explicitly ask for THP_SIZEFOO but the kernel decides a completely
different size later on. What is the actual contract you as a user are
getting?

In an ideal world the kernel would pick up the best large page
automagically. I am a bit skeptical we will reach such an enlightment
soon (if ever) so a certain level of hinting is likely needed to prevent
2MB THP fiasco again [1]. But the control should correspond to the
functionality users are getting.

> I just do not get why hugetlbfs is so special that it can have pagesize
> fine control when normal pages cannot get. The “it should be invisible
> to userpsace” argument suddenly does not hold for hugetlbfs.

In short it provides a guarantee.

Does the above clarifies it a bit?


[1] this is not entirely true though because there is a non-trivial
admin interface around THP. Mostly because they turned out to be too
transparent and many people do care about internal fragmentation,
allocation latency, locality (small page on a local node or a large on a
slightly further one?) or simply follow a cargo cult - just have a look
how many admin guides recommend disabling THPs. We got seriously burned
by 2MB THP because of the way how they were enforced on users.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2020-10-06 11:59 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
2020-09-28 17:53 ` [RFC PATCH v2 01/30] mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 02/30] mm: pagewalk: use READ_ONCE when reading the PMD " Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 03/30] mm: thp: use single linked list for THP page table page deposit Zi Yan
2020-09-28 19:34   ` Matthew Wilcox
2020-09-28 20:34     ` Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 04/30] mm: add new helper functions to allocate one PMD page with 512 PTE pages Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 05/30] mm: thp: add page table deposit/withdraw functions for PUD THP Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 06/30] mm: change thp_order and thp_nr as we will have not just PMD THPs Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 07/30] mm: thp: add anonymous PUD THP page fault support without enabling it Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 08/30] mm: thp: add PUD THP support for copy_huge_pud Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 09/30] mm: thp: add PUD THP support to zap_huge_pud Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 10/30] fs: proc: add PUD THP kpageflag Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 11/30] mm: thp: handling PUD THP reference bit Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 12/30] mm: rmap: add mappped/unmapped page order to anonymous page rmap functions Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 13/30] mm: rmap: add map_order to page_remove_anon_compound_rmap Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 14/30] mm: thp: add PUD THP split_huge_pud_page() function Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 15/30] mm: thp: add PUD THP to deferred split list when PUD mapping is gone Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 16/30] mm: debug: adapt dump_page to PUD THP Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 17/30] mm: thp: PUD THP COW splits PUD page and falls back to PMD page Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 18/30] mm: thp: PUD THP follow_p*d_page() support Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 19/30] mm: stats: make smap stats understand PUD THPs Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 20/30] mm: page_vma_walk: teach it about PMD-mapped PUD THP Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 21/30] mm: thp: PUD THP support in try_to_unmap() Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 22/30] mm: thp: split PUD THPs at page reclaim Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 23/30] mm: support PUD THP pagemap support Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 24/30] mm: madvise: add page size options to MADV_HUGEPAGE and MADV_NOHUGEPAGE Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 25/30] mm: vma: add VM_HUGEPAGE_PUD to vm_flags at bit 37 Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 26/30] mm: thp: add a global knob to enable/disable PUD THPs Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 27/30] mm: thp: make PUD THP size public Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 28/30] hugetlb: cma: move cma reserve function to cma.c Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 29/30] mm: thp: use cma reservation for pud thp allocation Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 30/30] mm: thp: enable anonymous PUD THP at page fault path Zi Yan
2020-09-30 11:55 ` [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Michal Hocko
2020-10-01 15:14   ` Zi Yan
2020-10-02  7:32     ` Michal Hocko
2020-10-02  7:50       ` David Hildenbrand
2020-10-02  8:10         ` Michal Hocko
2020-10-02  8:30           ` David Hildenbrand
2020-10-05 15:03             ` Zi Yan
2020-10-05 15:55               ` Matthew Wilcox
2020-10-05 17:04                 ` Roman Gushchin
2020-10-05 19:12                 ` Zi Yan
2020-10-05 19:37                   ` Matthew Wilcox
2020-10-05 17:16               ` Roman Gushchin
2020-10-05 17:27                 ` David Hildenbrand
2020-10-05 18:25                   ` Roman Gushchin
2020-10-05 18:33                     ` David Hildenbrand
2020-10-05 19:11                       ` Roman Gushchin
2020-10-06  8:25                         ` David Hildenbrand
2020-10-05 17:39               ` David Hildenbrand
2020-10-05 18:05                 ` Zi Yan
2020-10-05 18:48                   ` David Hildenbrand
2020-10-06 11:59                   ` Michal Hocko
2020-10-05 15:34         ` Zi Yan
2020-10-05 17:30           ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.