linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/16] 1GB THP support on x86_64
@ 2020-09-02 18:06 Zi Yan
  2020-09-02 18:06 ` [RFC PATCH 01/16] mm: add pagechain container for storing multiple pages Zi Yan
                   ` (18 more replies)
  0 siblings, 19 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Hi all,

This patchset adds support for 1GB THP on x86_64. It is on top of
v5.9-rc2-mmots-2020-08-25-21-13.

1GB THP is more flexible for reducing translation overhead and increasing the
performance of applications with large memory footprint without application
changes compared to hugetlb.

Design
=======

1GB THP implementation looks similar to exiting THP code except some new designs
for the additional page table level.

1. Page table deposit and withdraw using a new pagechain data structure:
   instead of one PTE page table page, 1GB THP requires 513 page table pages
   (one PMD page table page and 512 PTE page table pages) to be deposited
   at the page allocaiton time, so that we can split the page later. Currently,
   the page table deposit is using ->lru, thus only one page can be deposited.
   A new pagechain data structure is added to enable multi-page deposit.

2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
   and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
   PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
   sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
   page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
   page[N*512 + 3].compound_mapcount.

3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
   to use something less intrusive. So all 1GB THPs are allocated from reserved
   CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
   THP is cleared as the resulting pages can be freed via normal page free path.
   We can fall back to alloc_contig_pages for 1GB THP if necessary.


Patch Organization
=======

Patch 01 adds the new pagechain data structure.

Patch 02 to 13 adds 1GB THP support in variable places.

Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.

Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.

Patch 16 use hugepage_cma reservation for 1GB THP allocation.


Any suggestions and comments are welcome.


Zi Yan (16):
  mm: add pagechain container for storing multiple pages.
  mm: thp: 1GB anonymous page implementation.
  mm: proc: add 1GB THP kpageflag.
  mm: thp: 1GB THP copy on write implementation.
  mm: thp: handling 1GB THP reference bit.
  mm: thp: add 1GB THP split_huge_pud_page() function.
  mm: stats: make smap stats understand PUD THPs.
  mm: page_vma_walk: teach it about PMD-mapped PUD THP.
  mm: thp: 1GB THP support in try_to_unmap().
  mm: thp: split 1GB THPs at page reclaim.
  mm: thp: 1GB THP follow_p*d_page() support.
  mm: support 1GB THP pagemap support.
  mm: thp: add a knob to enable/disable 1GB THPs.
  mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
  hugetlb: cma: move cma reserve function to cma.c.
  mm: thp: use cma reservation for pud thp allocation.

 .../admin-guide/kernel-parameters.txt         |   2 +-
 arch/arm64/mm/hugetlbpage.c                   |   2 +-
 arch/powerpc/mm/hugetlbpage.c                 |   2 +-
 arch/x86/include/asm/pgalloc.h                |  68 ++
 arch/x86/include/asm/pgtable.h                |  26 +
 arch/x86/kernel/setup.c                       |   8 +-
 arch/x86/mm/pgtable.c                         |  38 +
 drivers/base/node.c                           |   3 +
 fs/proc/meminfo.c                             |   2 +
 fs/proc/page.c                                |   2 +
 fs/proc/task_mmu.c                            | 122 ++-
 include/linux/cma.h                           |  18 +
 include/linux/huge_mm.h                       |  84 +-
 include/linux/hugetlb.h                       |  12 -
 include/linux/memcontrol.h                    |   5 +
 include/linux/mm.h                            |  29 +-
 include/linux/mm_types.h                      |   1 +
 include/linux/mmu_notifier.h                  |  13 +
 include/linux/mmzone.h                        |   1 +
 include/linux/page-flags.h                    |  47 +
 include/linux/pagechain.h                     |  73 ++
 include/linux/pgtable.h                       |  34 +
 include/linux/rmap.h                          |  10 +-
 include/linux/swap.h                          |   2 +
 include/linux/vm_event_item.h                 |   7 +
 include/uapi/linux/kernel-page-flags.h        |   2 +
 kernel/events/uprobes.c                       |   4 +-
 kernel/fork.c                                 |   5 +
 mm/cma.c                                      | 119 +++
 mm/gup.c                                      |  60 +-
 mm/huge_memory.c                              | 939 +++++++++++++++++-
 mm/hugetlb.c                                  | 114 +--
 mm/internal.h                                 |   2 +
 mm/khugepaged.c                               |   6 +-
 mm/ksm.c                                      |   4 +-
 mm/memcontrol.c                               |  13 +
 mm/memory.c                                   |  51 +-
 mm/mempolicy.c                                |  21 +-
 mm/migrate.c                                  |  12 +-
 mm/page_alloc.c                               |  57 +-
 mm/page_vma_mapped.c                          | 129 ++-
 mm/pgtable-generic.c                          |  56 ++
 mm/rmap.c                                     | 289 ++++--
 mm/swap.c                                     |  31 +
 mm/swap_slots.c                               |   2 +
 mm/swapfile.c                                 |   8 +-
 mm/userfaultfd.c                              |   2 +-
 mm/util.c                                     |  16 +-
 mm/vmscan.c                                   |  58 +-
 mm/vmstat.c                                   |   8 +
 50 files changed, 2270 insertions(+), 349 deletions(-)
 create mode 100644 include/linux/pagechain.h

--
2.28.0



^ permalink raw reply	[flat|nested] 82+ messages in thread

* [RFC PATCH 01/16] mm: add pagechain container for storing multiple pages.
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
@ 2020-09-02 18:06 ` Zi Yan
  2020-09-02 20:29   ` Randy Dunlap
                     ` (2 more replies)
  2020-09-02 18:06 ` [RFC PATCH 02/16] mm: thp: 1GB anonymous page implementation Zi Yan
                   ` (17 subsequent siblings)
  18 siblings, 3 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

When depositing page table pages for 1GB THPs, we need 512 PTE pages +
1 PMD page. Instead of counting and depositing 513 pages, we can use the
PMD page as a leader page and chain the rest 512 PTE pages with ->lru.
This, however, prevents us depositing PMD pages with ->lru, which is
currently used by depositing PTE pages for 2MB THPs. So add a new
pagechain container for PMD pages.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/pagechain.h | 73 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)
 create mode 100644 include/linux/pagechain.h

diff --git a/include/linux/pagechain.h b/include/linux/pagechain.h
new file mode 100644
index 000000000000..be536142b413
--- /dev/null
+++ b/include/linux/pagechain.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * include/linux/pagechain.h
+ *
+ * In many places it is efficient to batch an operation up against multiple
+ * pages. A pagechain is a multipage container which is used for that.
+ */
+
+#ifndef _LINUX_PAGECHAIN_H
+#define _LINUX_PAGECHAIN_H
+
+#include <linux/slab.h>
+
+/* 14 pointers + two long's align the pagechain structure to a power of two */
+#define PAGECHAIN_SIZE	13
+
+struct page;
+
+struct pagechain {
+	struct list_head list;
+	unsigned int nr;
+	struct page *pages[PAGECHAIN_SIZE];
+};
+
+static inline void pagechain_init(struct pagechain *pchain)
+{
+	pchain->nr = 0;
+	INIT_LIST_HEAD(&pchain->list);
+}
+
+static inline void pagechain_reinit(struct pagechain *pchain)
+{
+	pchain->nr = 0;
+}
+
+static inline unsigned int pagechain_count(struct pagechain *pchain)
+{
+	return pchain->nr;
+}
+
+static inline unsigned int pagechain_space(struct pagechain *pchain)
+{
+	return PAGECHAIN_SIZE - pchain->nr;
+}
+
+static inline bool pagechain_empty(struct pagechain *pchain)
+{
+	return pchain->nr == 0;
+}
+
+/*
+ * Add a page to a pagechain.  Returns the number of slots still available.
+ */
+static inline unsigned int pagechain_deposit(struct pagechain *pchain, struct page *page)
+{
+	VM_BUG_ON(!pagechain_space(pchain));
+	pchain->pages[pchain->nr++] = page;
+	return pagechain_space(pchain);
+}
+
+static inline struct page *pagechain_withdraw(struct pagechain *pchain)
+{
+	if (!pagechain_count(pchain))
+		return NULL;
+	return pchain->pages[--pchain->nr];
+}
+
+void __init pagechain_cache_init(void);
+struct pagechain *pagechain_alloc(void);
+void pagechain_free(struct pagechain *pchain);
+
+#endif /* _LINUX_PAGECHAIN_H */
+
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH 02/16] mm: thp: 1GB anonymous page implementation.
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
  2020-09-02 18:06 ` [RFC PATCH 01/16] mm: add pagechain container for storing multiple pages Zi Yan
@ 2020-09-02 18:06 ` Zi Yan
  2020-09-02 18:06 ` [RFC PATCH 03/16] mm: proc: add 1GB THP kpageflag Zi Yan
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

This adds 1GB THP support for anonymous pages. Applications can get 1GB
pages during page faults when their VMAs are larger than 1GB. For
read-only 1GB zero THP, a shared 1GB zero THP is created for all
readers.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/include/asm/pgalloc.h |  59 +++++++++++
 arch/x86/include/asm/pgtable.h |   2 +
 arch/x86/mm/pgtable.c          |  25 +++++
 drivers/base/node.c            |   3 +
 fs/proc/meminfo.c              |   2 +
 include/linux/huge_mm.h        |  13 ++-
 include/linux/mm.h             |   4 +
 include/linux/mm_types.h       |   1 +
 include/linux/mmzone.h         |   1 +
 include/linux/pgtable.h        |   3 +
 include/linux/vm_event_item.h  |   3 +
 kernel/fork.c                  |   5 +
 mm/huge_memory.c               | 188 +++++++++++++++++++++++++++++++--
 mm/memory.c                    |  29 ++++-
 mm/page_alloc.c                |   3 +-
 mm/pgtable-generic.c           |  45 ++++++++
 mm/rmap.c                      |  30 ++++--
 mm/vmstat.c                    |   4 +
 18 files changed, 396 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index 62ad61d6fefc..fae13467d3e1 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -52,6 +52,18 @@ extern pgd_t *pgd_alloc(struct mm_struct *);
 extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
 
 extern pgtable_t pte_alloc_one(struct mm_struct *);
+extern pgtable_t pte_alloc_order(struct mm_struct *, unsigned long, int);
+
+static inline void pte_free_order(struct mm_struct *mm, struct page *pte,
+		int order)
+{
+	int i;
+
+	for (i = 0; i < (1<<order); i++) {
+		pgtable_pte_page_dtor(&pte[i]);
+		__free_page(&pte[i]);
+	}
+}
 
 extern void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte);
 
@@ -87,6 +99,53 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 #define pmd_pgtable(pmd) pmd_page(pmd)
 
 #if CONFIG_PGTABLE_LEVELS > 2
+static inline pmd_t *pmd_alloc_one_page_with_ptes(struct mm_struct *mm, unsigned long addr)
+{
+	pgtable_t pte_pgtables;
+	pmd_t *pmd;
+	spinlock_t *pmd_ptl;
+	int i;
+
+	pte_pgtables = pte_alloc_order(mm, addr,
+		HPAGE_PUD_ORDER - HPAGE_PMD_ORDER);
+	if (!pte_pgtables)
+		return NULL;
+
+	pmd = pmd_alloc_one(mm, addr);
+	if (unlikely(!pmd)) {
+		pte_free_order(mm, pte_pgtables,
+			HPAGE_PUD_ORDER - HPAGE_PMD_ORDER);
+		return NULL;
+	}
+	pmd_ptl = pmd_lock(mm, pmd);
+
+	for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+		pgtable_trans_huge_deposit(mm, pmd, pte_pgtables + i);
+
+	spin_unlock(pmd_ptl);
+
+	return pmd;
+}
+
+static inline void pmd_free_page_with_ptes(struct mm_struct *mm, pmd_t *pmd)
+{
+	spinlock_t *pmd_ptl;
+	int i;
+
+	BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
+	pmd_ptl = pmd_lock(mm, pmd);
+
+	for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++) {
+		pgtable_t pte_pgtable;
+
+		pte_pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+		pte_free(mm, pte_pgtable);
+	}
+
+	spin_unlock(pmd_ptl);
+	pmd_free(mm, pmd);
+}
+
 extern void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd);
 
 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd,
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5e0dcc20614d..26255cac78c0 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1141,6 +1141,8 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned long
 	return native_pmdp_get_and_clear(pmdp);
 }
 
+#define mk_pud(page, pgprot)   pfn_pud(page_to_pfn(page), (pgprot))
+
 #define __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR
 static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 					unsigned long addr, pud_t *pudp)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index dfd82f51ba66..7be73aee6183 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -33,6 +33,31 @@ pgtable_t pte_alloc_one(struct mm_struct *mm)
 	return __pte_alloc_one(mm, __userpte_alloc_gfp);
 }
 
+pgtable_t pte_alloc_order(struct mm_struct *mm, unsigned long address, int order)
+{
+	struct page *pte;
+	int i;
+
+	pte = alloc_pages(__userpte_alloc_gfp, order);
+	if (!pte)
+		return NULL;
+	split_page(pte, order);
+	for (i = 1; i < (1 << order); i++)
+		set_page_private(pte + i, 0);
+
+	for (i = 0; i < (1<<order); i++) {
+		if (!pgtable_pte_page_ctor(&pte[i])) {
+			__free_page(&pte[i]);
+			while (--i >= 0) {
+				pgtable_pte_page_dtor(&pte[i]);
+				__free_page(&pte[i]);
+			}
+			return NULL;
+		}
+	}
+	return pte;
+}
+
 static int __init setup_userpte(char *arg)
 {
 	if (!arg)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 508b80f6329b..f11b4d88911c 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -428,6 +428,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d SUnreclaim:     %8lu kB\n"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       "Node %d AnonHugePages:  %8lu kB\n"
+		       "Node %d AnonHugePUDPages: %8lu kB\n"
 		       "Node %d ShmemHugePages: %8lu kB\n"
 		       "Node %d ShmemPmdMapped: %8lu kB\n"
 		       "Node %d FileHugePages: %8lu kB\n"
@@ -457,6 +458,8 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       ,
 		       nid, K(node_page_state(pgdat, NR_ANON_THPS) *
 				       HPAGE_PMD_NR),
+			   nid, K(node_page_state(pgdat, NR_ANON_THPS_PUD) *
+				       HPAGE_PUD_NR),
 		       nid, K(node_page_state(pgdat, NR_SHMEM_THPS) *
 				       HPAGE_PMD_NR),
 		       nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 887a5532e449..b60e0c241015 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -130,6 +130,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	show_val_kb(m, "AnonHugePages:  ",
 		    global_node_page_state(NR_ANON_THPS) * HPAGE_PMD_NR);
+	show_val_kb(m, "AnonHugePUDPages:  ",
+			global_node_page_state(NR_ANON_THPS_PUD) * HPAGE_PUD_NR);
 	show_val_kb(m, "ShmemHugePages: ",
 		    global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR);
 	show_val_kb(m, "ShmemPmdMapped: ",
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 8a8bc46a2432..7528652400e4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -18,10 +18,15 @@ extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 extern void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
+extern int do_huge_pud_anonymous_page(struct vm_fault *vmf);
 #else
 static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 {
 }
+extern int do_huge_pud_anonymous_page(struct vm_fault *vmf)
+{
+	return VM_FAULT_FALLBACK;
+}
 #endif
 
 extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
@@ -115,6 +120,9 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 
+#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define HPAGE_PMD_SHIFT PMD_SHIFT
 #define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
@@ -276,7 +284,7 @@ static inline unsigned int thp_order(struct page *page)
 {
 	VM_BUG_ON_PGFLAGS(PageTail(page), page);
 	if (PageHead(page))
-		return HPAGE_PMD_ORDER;
+		return page[1].compound_order;
 	return 0;
 }
 
@@ -288,7 +296,7 @@ static inline int thp_nr_pages(struct page *page)
 {
 	VM_BUG_ON_PGFLAGS(PageTail(page), page);
 	if (PageHead(page))
-		return HPAGE_PMD_NR;
+		return (1<<page[1].compound_order);
 	return 1;
 }
 
@@ -320,6 +328,7 @@ struct page *mm_get_huge_zero_page(struct mm_struct *mm);
 void mm_put_huge_zero_page(struct mm_struct *mm);
 
 #define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))
+#define mk_huge_pud(page, prot) pud_mkhuge(mk_pud(page, prot))
 
 static inline bool thp_migration_supported(void)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3a4f099fb1b..cb1ccf804404 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -31,6 +31,7 @@
 #include <linux/sizes.h>
 #include <linux/sched.h>
 #include <linux/pgtable.h>
+#include <linux/pagechain.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -2184,6 +2185,7 @@ static inline void pgtable_init(void)
 {
 	ptlock_cache_init();
 	pgtable_cache_init();
+	pagechain_cache_init();
 }
 
 static inline bool pgtable_pte_page_ctor(struct page *page)
@@ -2316,6 +2318,8 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
 	return ptl;
 }
 
+#define pud_huge_pte(mm, pud) ((mm)->pud_huge_pte)
+
 extern void __init pagecache_init(void);
 extern void __init free_area_init_memoryless_node(int nid);
 extern void free_initmem(void);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 496c3ff97cce..4c1839366af4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -513,6 +513,7 @@ struct mm_struct {
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 		pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
+		struct list_head pud_huge_pte; /* protected by page_table_lock */
 #ifdef CONFIG_NUMA_BALANCING
 		/*
 		 * numa_next_scan is the next time that the PTEs will be marked
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0a404552ecc1..3a8f54a2c5a7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -196,6 +196,7 @@ enum node_stat_item {
 	NR_FILE_THPS,
 	NR_FILE_PMDMAPPED,
 	NR_ANON_THPS,
+	NR_ANON_THPS_PUD,
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..255275d5b73e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -462,10 +462,13 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
+extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
+				       pgtable_t pgtable);
 #endif
 
 #ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
 extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
+extern pgtable_t pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
 #endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 2e6ca53b9bbd..a3f1093a55bb 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -92,6 +92,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_DEFERRED_SPLIT_PAGE,
 		THP_SPLIT_PMD,
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+		THP_FAULT_ALLOC_PUD,
+		THP_FAULT_FALLBACK_PUD,
+		THP_FAULT_FALLBACK_PUD_CHARGE,
 		THP_SPLIT_PUD,
 #endif
 		THP_ZERO_PAGE_ALLOC,
diff --git a/kernel/fork.c b/kernel/fork.c
index 3f281814a3d3..842fdc4ae5fc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -663,6 +663,10 @@ static void check_mm(struct mm_struct *mm)
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	VM_BUG_ON_MM(mm->pmd_huge_pte, mm);
 #endif
+	VM_BUG_ON_MM(!list_empty(&mm->pud_huge_pte) &&
+				 !pagechain_empty(list_first_entry(&mm->pud_huge_pte,
+					struct pagechain, list)),
+				mm);
 }
 
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
@@ -1023,6 +1027,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
 #endif
+	INIT_LIST_HEAD(&mm->pud_huge_pte);
 	mm_init_uprobes_state(mm);
 
 	if (current->mm) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 90733cefa528..ec3847392208 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -933,6 +933,112 @@ vm_fault_t vmf_insert_pfn_pud_prot(struct vm_fault *vmf, pfn_t pfn,
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud_prot);
+
+static int __do_huge_pud_anonymous_page(struct vm_fault *vmf, struct page *page,
+		gfp_t gfp)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	pmd_t *pmd_pgtable;
+	unsigned long haddr = vmf->address & HPAGE_PUD_MASK;
+	int ret = 0;
+
+	VM_BUG_ON_PAGE(!PageCompound(page), page);
+
+	if (mem_cgroup_charge(page, vma->vm_mm, gfp)) {
+		put_page(page);
+		count_vm_event(THP_FAULT_FALLBACK_PUD);
+		count_vm_event(THP_FAULT_FALLBACK_CHARGE);
+		return VM_FAULT_FALLBACK;
+	}
+	cgroup_throttle_swaprate(page, gfp);
+
+	pmd_pgtable = pmd_alloc_one_page_with_ptes(vma->vm_mm, haddr);
+	if (unlikely(!pmd_pgtable)) {
+		ret = VM_FAULT_OOM;
+		goto release;
+	}
+
+	clear_huge_page(page, vmf->address, HPAGE_PUD_NR);
+	/*
+	 * The memory barrier inside __SetPageUptodate makes sure that
+	 * clear_huge_page writes become visible before the set_pmd_at()
+	 * write.
+	 */
+	__SetPageUptodate(page);
+
+	vmf->ptl = pud_lock(vma->vm_mm, vmf->pud);
+	if (unlikely(!pud_none(*vmf->pud))) {
+		goto unlock_release;
+	} else {
+		pud_t entry;
+		int i;
+
+		ret = check_stable_address_space(vma->vm_mm);
+		if (ret)
+			goto unlock_release;
+
+		/* Deliver the page fault to userland */
+		if (userfaultfd_missing(vma)) {
+			vm_fault_t ret2;
+
+			spin_unlock(vmf->ptl);
+			put_page(page);
+			pmd_free_page_with_ptes(vma->vm_mm, pmd_pgtable);
+			ret2 = handle_userfault(vmf, VM_UFFD_MISSING);
+			VM_BUG_ON(ret2 & VM_FAULT_FALLBACK);
+			return ret2;
+		}
+
+		entry = mk_huge_pud(page, vma->vm_page_prot);
+		entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
+		page_add_new_anon_rmap(page, vma, haddr, true);
+		lru_cache_add_inactive_or_unevictable(page, vma);
+		pgtable_trans_huge_pud_deposit(vma->vm_mm, vmf->pud,
+				virt_to_page(pmd_pgtable));
+		set_pud_at(vma->vm_mm, haddr, vmf->pud, entry);
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PUD_NR);
+		mm_inc_nr_pmds(vma->vm_mm);
+		for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+			mm_inc_nr_ptes(vma->vm_mm);
+		spin_unlock(vmf->ptl);
+		count_vm_event(THP_FAULT_ALLOC_PUD);
+	}
+
+	return 0;
+unlock_release:
+	spin_unlock(vmf->ptl);
+release:
+	if (pmd_pgtable)
+		pmd_free_page_with_ptes(vma->vm_mm, pmd_pgtable);
+	put_page(page);
+	return ret;
+
+}
+
+int do_huge_pud_anonymous_page(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	gfp_t gfp;
+	struct page *page;
+	unsigned long haddr = vmf->address & HPAGE_PUD_MASK;
+
+	if (haddr < vma->vm_start || haddr + HPAGE_PUD_SIZE > vma->vm_end)
+		return VM_FAULT_FALLBACK;
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+	if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
+		return VM_FAULT_OOM;
+
+	gfp = alloc_hugepage_direct_gfpmask(vma);
+	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PUD_ORDER);
+	if (unlikely(!page)) {
+		count_vm_event(THP_FAULT_FALLBACK_PUD);
+		return VM_FAULT_FALLBACK;
+	}
+	prep_transhuge_page(page);
+	return __do_huge_pud_anonymous_page(vmf, page, gfp);
+}
+
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
@@ -1159,7 +1265,12 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 {
 	spinlock_t *dst_ptl, *src_ptl;
 	pud_t pud;
-	int ret;
+	pmd_t *pmd_pgtable = NULL;
+	int ret = -ENOMEM;
+
+	pmd_pgtable = pmd_alloc_one_page_with_ptes(vma->vm_mm, addr);
+	if (unlikely(!pmd_pgtable))
+		goto out;
 
 	dst_ptl = pud_lock(dst_mm, dst_pud);
 	src_ptl = pud_lockptr(src_mm, src_pud);
@@ -1167,16 +1278,28 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	ret = -EAGAIN;
 	pud = *src_pud;
+
+	/* only transparent huge pud page needs extra page table pages for
+	 * possible huge page split */
+	if (!pud_trans_huge(pud))
+		pmd_free_page_with_ptes(dst_mm, pmd_pgtable);
+
 	if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud)))
 		goto out_unlock;
 
-	/*
-	 * When page table lock is held, the huge zero pud should not be
-	 * under splitting since we don't split the page itself, only pud to
-	 * a page table.
-	 */
-	if (is_huge_zero_pud(pud)) {
-		/* No huge zero pud yet */
+	if (pud_trans_huge(pud)) {
+		struct page *src_page;
+		int i;
+
+		src_page = pud_page(pud);
+		VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+		get_page(src_page);
+		page_dup_rmap(src_page, true);
+		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PUD_NR);
+		mm_inc_nr_pmds(dst_mm);
+		for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+			mm_inc_nr_ptes(dst_mm);
+		pgtable_trans_huge_pud_deposit(dst_mm, dst_pud, virt_to_page(pmd_pgtable));
 	}
 
 	pudp_set_wrprotect(src_mm, addr, src_pud);
@@ -1187,6 +1310,7 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 out_unlock:
 	spin_unlock(src_ptl);
 	spin_unlock(dst_ptl);
+out:
 	return ret;
 }
 
@@ -1887,11 +2011,27 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static inline void zap_pud_deposited_table(struct mm_struct *mm, pud_t *pud)
+{
+	pgtable_t pgtable;
+	int i;
+
+	pgtable = pgtable_trans_huge_pud_withdraw(mm, pud);
+	pmd_free_page_with_ptes(mm, (pmd_t *)page_address(pgtable));
+
+	mm_dec_nr_pmds(mm);
+	for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+		mm_dec_nr_ptes(mm);
+}
+
 int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pud_t *pud, unsigned long addr)
 {
+	pud_t orig_pud;
 	spinlock_t *ptl;
 
+	tlb_change_page_size(tlb, HPAGE_PUD_SIZE);
+
 	ptl = __pud_trans_huge_lock(pud, vma);
 	if (!ptl)
 		return 0;
@@ -1901,14 +2041,40 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 * pgtable_trans_huge_withdraw after finishing pudp related
 	 * operations.
 	 */
-	pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm);
+	orig_pud = pudp_huge_get_and_clear_full(tlb->mm, addr, pud,
+			tlb->fullmm);
 	tlb_remove_pud_tlb_entry(tlb, pud, addr);
 	if (vma_is_special_huge(vma)) {
 		spin_unlock(ptl);
 		/* No zero page support yet */
+	} else if (is_huge_zero_pud(orig_pud)) {
+		zap_pud_deposited_table(tlb->mm, pud);
+		spin_unlock(ptl);
+		tlb_remove_page_size(tlb, pud_page(orig_pud), HPAGE_PUD_SIZE);
 	} else {
-		/* No support for anonymous PUD pages yet */
-		BUG();
+		struct page *page = NULL;
+		int flush_needed = 1;
+
+		if (pud_present(orig_pud)) {
+			page = pud_page(orig_pud);
+			page_remove_rmap(page, true);
+			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+			VM_BUG_ON_PAGE(!PageHead(page), page);
+		} else
+			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+
+		if (PageAnon(page)) {
+			zap_pud_deposited_table(tlb->mm, pud);
+			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PUD_NR);
+		} else {
+			if (arch_needs_pgtable_deposit())
+				zap_pud_deposited_table(tlb->mm, pud);
+			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PUD_NR);
+		}
+
+		spin_unlock(ptl);
+		if (flush_needed)
+			tlb_remove_page_size(tlb, page, HPAGE_PUD_SIZE);
 	}
 	return 1;
 }
diff --git a/mm/memory.c b/mm/memory.c
index fb5463153351..6f86294438fd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4147,14 +4147,13 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
 	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 	/* No support for anonymous transparent PUD pages yet */
 	if (vma_is_anonymous(vmf->vma))
-		goto split;
+		return do_huge_pud_anonymous_page(vmf);
 	if (vmf->vma->vm_ops->huge_fault) {
 		vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
 
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
 	}
-split:
 	/* COW or write-notify not handled on PUD level: split pud.*/
 	__split_huge_pud(vmf->vma, vmf->pud, vmf->address);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -5098,3 +5097,29 @@ void ptlock_free(struct page *page)
 	kmem_cache_free(page_ptl_cachep, page->ptl);
 }
 #endif
+
+static struct kmem_cache *pagechain_cachep;
+
+void __init pagechain_cache_init(void)
+{
+	pagechain_cachep = kmem_cache_create("pagechain",
+		sizeof(struct pagechain), 0, SLAB_PANIC, NULL);
+}
+
+struct pagechain *pagechain_alloc(void)
+{
+	struct pagechain *chain;
+
+	chain = kmem_cache_alloc(pagechain_cachep, GFP_ATOMIC);
+
+	if (!chain)
+		return NULL;
+
+	pagechain_init(chain);
+	return chain;
+}
+
+void pagechain_free(struct pagechain *pchain)
+{
+	kmem_cache_free(pagechain_cachep, pchain);
+}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d9f9bd0e06c..763acbed66f1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5443,7 +5443,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_SHMEM_THPS) * HPAGE_PMD_NR),
 			K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)
 					* HPAGE_PMD_NR),
-			K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR),
+			K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR +
+			  node_page_state(pgdat, NR_ANON_THPS_PUD) * HPAGE_PUD_NR),
 #endif
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 			node_page_state(pgdat, NR_KERNEL_STACK_KB),
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 9578db83e312..ef218b0f5d74 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -10,6 +10,7 @@
 #include <linux/pagemap.h>
 #include <linux/hugetlb.h>
 #include <linux/pgtable.h>
+#include <linux/pagechain.h>
 #include <asm/tlb.h>
 
 /*
@@ -170,6 +171,23 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 		list_add(&pgtable->lru, &pmd_huge_pte(mm, pmdp)->lru);
 	pmd_huge_pte(mm, pmdp) = pgtable;
 }
+
+void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
+				pgtable_t pgtable)
+{
+	struct pagechain *chain = NULL;
+
+	assert_spin_locked(pud_lockptr(mm, pudp));
+	/* FIFO */
+	chain = list_first_entry_or_null(&pud_huge_pte(mm, pudp),
+			struct pagechain, list);
+
+	if (!chain || !pagechain_space(chain)) {
+		chain = pagechain_alloc();
+		list_add(&chain->list, &pud_huge_pte(mm, pudp));
+	}
+	pagechain_deposit(chain, pgtable);
+}
 #endif
 
 #ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
@@ -188,6 +206,33 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 		list_del(&pgtable->lru);
 	return pgtable;
 }
+
+pgtable_t pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
+{
+	pgtable_t pgtable;
+	struct pagechain *chain = NULL;
+
+	assert_spin_locked(pud_lockptr(mm, pudp));
+
+	/* FIFO */
+retry:
+	chain = list_first_entry_or_null(&pud_huge_pte(mm, pudp),
+			struct pagechain, list);
+
+	if (!chain)
+		return NULL;
+
+	if (pagechain_empty(chain)) {
+		if (list_is_singular(&chain->list))
+			return NULL;
+		list_del(&chain->list);
+		pagechain_free(chain);
+		goto retry;
+	}
+
+	pgtable = pagechain_withdraw(chain);
+	return pgtable;
+}
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
diff --git a/mm/rmap.c b/mm/rmap.c
index 9425260774a1..10195a2421cf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -726,6 +726,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
+	pud_t pude;
 	pmd_t *pmd = NULL;
 	pmd_t pmde;
 
@@ -738,7 +739,10 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 		goto out;
 
 	pud = pud_offset(p4d, address);
-	if (!pud_present(*pud))
+
+	pude = *pud;
+	barrier();
+	if (!pud_present(pude) || pud_trans_huge(pude))
 		goto out;
 
 	pmd = pmd_offset(pud, address);
@@ -1033,7 +1037,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
  * __page_set_anon_rmap - set up new anonymous rmap
  * @page:	Page or Hugepage to add to rmap
  * @vma:	VM area to add page to.
- * @address:	User virtual address of the mapping	
+ * @address:	User virtual address of the mapping
  * @exclusive:	the page is exclusively owned by the current process
  */
 static void __page_set_anon_rmap(struct page *page,
@@ -1137,8 +1141,12 @@ void do_page_add_anon_rmap(struct page *page,
 		 * pte lock(a spinlock) is held, which implies preemption
 		 * disabled.
 		 */
-		if (compound)
-			__inc_lruvec_page_state(page, NR_ANON_THPS);
+		if (compound) {
+			if (nr == HPAGE_PMD_NR)
+				__inc_lruvec_page_state(page, NR_ANON_THPS);
+			else
+				__inc_lruvec_page_state(page, NR_ANON_THPS_PUD);
+		}
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
 	}
 
@@ -1180,7 +1188,10 @@ void page_add_new_anon_rmap(struct page *page,
 		if (hpage_pincount_available(page))
 			atomic_set(compound_pincount_ptr(page), 0);
 
-		__inc_lruvec_page_state(page, NR_ANON_THPS);
+		if (nr == HPAGE_PMD_NR)
+			__inc_lruvec_page_state(page, NR_ANON_THPS);
+		else
+			__inc_lruvec_page_state(page, NR_ANON_THPS_PUD);
 	} else {
 		/* Anon THP always mapped first with PMD */
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1286,14 +1297,17 @@ static void page_remove_anon_compound_rmap(struct page *page)
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 		return;
 
-	__dec_lruvec_page_state(page, NR_ANON_THPS);
+	if (thp_nr_pages(page) == HPAGE_PMD_NR)
+		__dec_lruvec_page_state(page, NR_ANON_THPS);
+	else
+		__dec_lruvec_page_state(page, NR_ANON_THPS_PUD);
 
 	if (TestClearPageDoubleMap(page)) {
 		/*
 		 * Subpages can be mapped with PTEs too. Check how many of
 		 * them are still mapped.
 		 */
-		for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+		for (i = 0, nr = 0; i < thp_nr_pages(page); i++) {
 			if (atomic_add_negative(-1, &page[i]._mapcount))
 				nr++;
 		}
@@ -1306,7 +1320,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
 		if (nr && nr < HPAGE_PMD_NR)
 			deferred_split_huge_page(page);
 	} else {
-		nr = HPAGE_PMD_NR;
+		nr = thp_nr_pages(page);
 	}
 
 	if (unlikely(PageMlocked(page)))
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 06fd13ebc2b8..3a01212b652c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1209,6 +1209,7 @@ const char * const vmstat_text[] = {
 	"nr_file_hugepages",
 	"nr_file_pmdmapped",
 	"nr_anon_transparent_hugepages",
+	"nr_anon_transparent_pud_hugepages",
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",
 	"nr_dirtied",
@@ -1325,6 +1326,9 @@ const char * const vmstat_text[] = {
 	"thp_deferred_split_page",
 	"thp_split_pmd",
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	"thp_fault_alloc_pud",
+	"thp_fault_fallback_pud",
+	"thp_fault_fallback_pud_charge",
 	"thp_split_pud",
 #endif
 	"thp_zero_page_alloc",
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH 03/16] mm: proc: add 1GB THP kpageflag.
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
  2020-09-02 18:06 ` [RFC PATCH 01/16] mm: add pagechain container for storing multiple pages Zi Yan
  2020-09-02 18:06 ` [RFC PATCH 02/16] mm: thp: 1GB anonymous page implementation Zi Yan
@ 2020-09-02 18:06 ` Zi Yan
  2020-09-09 13:46   ` Kirill A. Shutemov
  2020-09-02 18:06 ` [RFC PATCH 04/16] mm: thp: 1GB THP copy on write implementation Zi Yan
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Bit 27 is used to identify 1GB THP.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 fs/proc/page.c                         | 2 ++
 include/uapi/linux/kernel-page-flags.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/fs/proc/page.c b/fs/proc/page.c
index f3b39a7d2bf3..e4e2ad3612c9 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -161,6 +161,8 @@ u64 stable_page_flags(struct page *page)
 			u |= BIT_ULL(KPF_ZERO_PAGE);
 			u |= BIT_ULL(KPF_THP);
 		}
+		if (compound_order(head) == HPAGE_PUD_ORDER)
+			u |= 1 << KPF_PUD_THP;
 	} else if (is_zero_pfn(page_to_pfn(page)))
 		u |= BIT_ULL(KPF_ZERO_PAGE);
 
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index 6f2f2720f3ac..cdeb33ab655c 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -36,5 +36,7 @@
 #define KPF_ZERO_PAGE		24
 #define KPF_IDLE		25
 #define KPF_PGTABLE		26
+#define KPF_PUD_THP		27
+
 
 #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH 04/16] mm: thp: 1GB THP copy on write implementation.
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
                   ` (2 preceding siblings ...)
  2020-09-02 18:06 ` [RFC PATCH 03/16] mm: proc: add 1GB THP kpageflag Zi Yan
@ 2020-09-02 18:06 ` Zi Yan
  2020-09-02 18:06 ` [RFC PATCH 05/16] mm: thp: handling 1GB THP reference bit Zi Yan
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

COW on 1GB THPs will fall back to 2MB THPs if 1GB THP is not available.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/include/asm/pgalloc.h |  9 ++++++
 include/linux/huge_mm.h        |  5 ++++
 mm/huge_memory.c               | 54 ++++++++++++++++++++++++++++++++++
 mm/memory.c                    |  2 +-
 mm/swapfile.c                  |  4 ++-
 5 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index fae13467d3e1..31221269c387 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -98,6 +98,15 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 
 #define pmd_pgtable(pmd) pmd_page(pmd)
 
+static inline void pud_populate_with_pgtable(struct mm_struct *mm, pud_t *pud,
+				struct page *pte)
+{
+	unsigned long pfn = page_to_pfn(pte);
+
+	paravirt_alloc_pmd(mm, pfn);
+	set_pud(pud, __pud(((pteval_t)pfn << PAGE_SHIFT) | _PAGE_TABLE));
+}
+
 #if CONFIG_PGTABLE_LEVELS > 2
 static inline pmd_t *pmd_alloc_one_page_with_ptes(struct mm_struct *mm, unsigned long addr)
 {
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7528652400e4..0c20a8ea6911 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -19,6 +19,7 @@ extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 extern void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
 extern int do_huge_pud_anonymous_page(struct vm_fault *vmf);
+extern vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud);
 #else
 static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 {
@@ -27,6 +28,10 @@ extern int do_huge_pud_anonymous_page(struct vm_fault *vmf)
 {
 	return VM_FAULT_FALLBACK;
 }
+extern vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
+{
+	return VM_FAULT_FALLBACK;
+}
 #endif
 
 extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ec3847392208..6da9b02501b7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1334,6 +1334,60 @@ void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 unlock:
 	spin_unlock(vmf->ptl);
 }
+
+vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct page *page = NULL;
+	unsigned long haddr = vmf->address & HPAGE_PUD_MASK;
+
+	vmf->ptl = pud_lockptr(vma->vm_mm, vmf->pud);
+	VM_BUG_ON_VMA(!vma->anon_vma, vma);
+
+	if (is_huge_zero_pud(orig_pud))
+		goto fallback;
+
+	spin_lock(vmf->ptl);
+
+	if (unlikely(!pud_same(*vmf->pud, orig_pud))) {
+		spin_unlock(vmf->ptl);
+		return 0;
+	}
+
+	page = pud_page(orig_pud);
+	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
+
+	/* Lock page for reuse_swap_page() */
+	if (!trylock_page(page)) {
+		get_page(page);
+		spin_unlock(vmf->ptl);
+		lock_page(page);
+		spin_lock(vmf->ptl);
+		if (unlikely(!pud_same(*vmf->pud, orig_pud))) {
+			unlock_page(page);
+			put_page(page);
+			return 0;
+		}
+		put_page(page);
+	}
+	if (reuse_swap_page(page, NULL)) {
+		pud_t entry;
+
+		entry = pud_mkyoung(orig_pud);
+		entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
+		if (pudp_set_access_flags(vma, haddr, vmf->pud, entry,  1))
+			update_mmu_cache_pud(vma, vmf->address, vmf->pud);
+		unlock_page(page);
+		spin_unlock(vmf->ptl);
+		return VM_FAULT_WRITE;
+	}
+	unlock_page(page);
+	spin_unlock(vmf->ptl);
+fallback:
+	__split_huge_pud(vma, vmf->pud, vmf->address);
+	return VM_FAULT_FALLBACK;
+}
+
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd)
diff --git a/mm/memory.c b/mm/memory.c
index 6f86294438fd..b88587256bc1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4165,7 +4165,7 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	/* No support for anonymous transparent PUD pages yet */
 	if (vma_is_anonymous(vmf->vma))
-		return VM_FAULT_FALLBACK;
+		return do_huge_pud_wp_page(vmf, orig_pud);
 	if (vmf->vma->vm_ops->huge_fault)
 		return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 20012c0c0252..e3f771c2ad83 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1635,7 +1635,9 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
 	/* hugetlbfs shouldn't call it */
 	VM_BUG_ON_PAGE(PageHuge(page), page);
 
-	if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!PageTransCompound(page))) {
+	if (!IS_ENABLED(CONFIG_THP_SWAP) ||
+	    unlikely(compound_order(compound_head(page)) == HPAGE_PUD_ORDER) ||
+	    likely(!PageTransCompound(page))) {
 		mapcount = page_trans_huge_mapcount(page, total_mapcount);
 		if (PageSwapCache(page))
 			swapcount = page_swapcount(page);
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH 05/16] mm: thp: handling 1GB THP reference bit.
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
                   ` (3 preceding siblings ...)
  2020-09-02 18:06 ` [RFC PATCH 04/16] mm: thp: 1GB THP copy on write implementation Zi Yan
@ 2020-09-02 18:06 ` Zi Yan
  2020-09-09 14:09   ` Kirill A. Shutemov
  2020-09-02 18:06 ` [RFC PATCH 06/16] mm: thp: add 1GB THP split_huge_pud_page() function Zi Yan
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Add PUD-level TLB flush ops and teach page_vma_mapped_talk about 1GB
THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/include/asm/pgtable.h |  3 +++
 arch/x86/mm/pgtable.c          | 13 +++++++++++++
 include/linux/mmu_notifier.h   | 13 +++++++++++++
 include/linux/pgtable.h        | 14 ++++++++++++++
 include/linux/rmap.h           |  1 +
 mm/page_vma_mapped.c           | 33 +++++++++++++++++++++++++++++----
 mm/rmap.c                      | 12 +++++++++---
 7 files changed, 82 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 26255cac78c0..15334f5ba172 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1127,6 +1127,9 @@ extern int pudp_test_and_clear_young(struct vm_area_struct *vma,
 extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pmd_t *pmdp);
 
+#define __HAVE_ARCH_PUDP_CLEAR_YOUNG_FLUSH
+extern int pudp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pud_t *pudp);
 
 #define pmd_write pmd_write
 static inline int pmd_write(pmd_t pmd)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7be73aee6183..e4a2dffcc418 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -633,6 +633,19 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 
 	return young;
 }
+int pudp_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pud_t *pudp)
+{
+	int young;
+
+	VM_BUG_ON(address & ~HPAGE_PUD_MASK);
+
+	young = pudp_test_and_clear_young(vma, address, pudp);
+	if (young)
+		flush_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+
+	return young;
+}
 #endif
 
 /**
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b8200782dede..4ffa179e654f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -557,6 +557,19 @@ static inline void mmu_notifier_range_init_migrate(
 	__young;							\
 })
 
+#define pudp_clear_flush_young_notify(__vma, __address, __pudp)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = pudp_clear_flush_young(___vma, ___address, __pudp);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address,		\
+						  ___address +		\
+							PUD_SIZE);	\
+	__young;							\
+})
+
 #define ptep_clear_young_notify(__vma, __address, __ptep)		\
 ({									\
 	int __young;							\
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 255275d5b73e..8ef358c386af 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -240,6 +240,20 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
+#ifndef __HAVE_ARCH_PUDP_CLEAR_YOUNG_FLUSH
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+extern int pudp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pud_t *pudp);
+#else
+int pudp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pud_t *pudp)
+{
+	BUILD_BUG();
+	return 0;
+}
+#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD  */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 3a6adfa70fb0..0af61dd193d2 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -206,6 +206,7 @@ struct page_vma_mapped_walk {
 	struct page *page;
 	struct vm_area_struct *vma;
 	unsigned long address;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 	spinlock_t *ptl;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 5e77b269c330..d9d39ec06e21 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -145,9 +145,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	struct page *page = pvmw->page;
 	pgd_t *pgd;
 	p4d_t *p4d;
-	pud_t *pud;
+	pud_t pude;
 	pmd_t pmde;
 
+	if (!pvmw->pte && !pvmw->pmd && pvmw->pud)
+		return not_found(pvmw);
+
 	/* The only possible pmd mapping has been handled on last iteration */
 	if (pvmw->pmd && !pvmw->pte)
 		return not_found(pvmw);
@@ -174,10 +177,31 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	p4d = p4d_offset(pgd, pvmw->address);
 	if (!p4d_present(*p4d))
 		return false;
-	pud = pud_offset(p4d, pvmw->address);
-	if (!pud_present(*pud))
+	pvmw->pud = pud_offset(p4d, pvmw->address);
+
+	/*
+	 * Make sure the pud value isn't cached in a register by the
+	 * compiler and used as a stale value after we've observed a
+	 * subsequent update.
+	 */
+	pude = READ_ONCE(*pvmw->pud);
+	if (pud_trans_huge(pude)) {
+		pvmw->ptl = pud_lock(mm, pvmw->pud);
+		if (likely(pud_trans_huge(*pvmw->pud))) {
+			if (pvmw->flags & PVMW_MIGRATION)
+				return not_found(pvmw);
+			if (pud_page(*pvmw->pud) != page)
+				return not_found(pvmw);
+			return true;
+		} else {
+			/* THP pud was split under us: handle on pmd level */
+			spin_unlock(pvmw->ptl);
+			pvmw->ptl = NULL;
+		}
+	} else if (!pud_present(pude))
 		return false;
-	pvmw->pmd = pmd_offset(pud, pvmw->address);
+
+	pvmw->pmd = pmd_offset(pvmw->pud, pvmw->address);
 	/*
 	 * Make sure the pmd value isn't cached in a register by the
 	 * compiler and used as a stale value after we've observed a
@@ -213,6 +237,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	} else if (!pmd_present(pmde)) {
 		return false;
 	}
+
 	if (!map_pte(pvmw))
 		goto next_pte;
 	while (1) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 10195a2421cf..77cec0658b76 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -803,9 +803,15 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
 					referenced++;
 			}
 		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-			if (pmdp_clear_flush_young_notify(vma, address,
-						pvmw.pmd))
-				referenced++;
+			if (pvmw.pmd) {
+				if (pmdp_clear_flush_young_notify(vma, address,
+							pvmw.pmd))
+					referenced++;
+			} else if (pvmw.pud) {
+				if (pudp_clear_flush_young_notify(vma, address,
+							pvmw.pud))
+					referenced++;
+			}
 		} else {
 			/* unexpected pmd-mapped page? */
 			WARN_ON_ONCE(1);
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH 06/16] mm: thp: add 1GB THP split_huge_pud_page() function.
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
                   ` (4 preceding siblings ...)
  2020-09-02 18:06 ` [RFC PATCH 05/16] mm: thp: handling 1GB THP reference bit Zi Yan
@ 2020-09-02 18:06 ` Zi Yan
  2020-09-09 14:18   ` Kirill A. Shutemov
  2020-09-02 18:06 ` [RFC PATCH 07/16] mm: stats: make smap stats understand PUD THPs Zi Yan
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It mimics PMD-level THP split. In addition, to support PMD-mapped PUD
THP, PMDPageInPUD() is used. For the mapcount of PMD-mapped PUD THP,
sub_compound_mapcount() is used, which uses
(head_page+3).compound_mapcount, since each base page's mapcount is used
for PTE mapping. PagePUDDoubleMap() is used for both PUD-mapped and
PMD-mapped PUD THPs.

page_xxx_rmap() functions now have an extra page order parameter to
distinguish different THP sizes.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/include/asm/pgtable.h |  21 ++
 include/linux/huge_mm.h        |  31 +-
 include/linux/memcontrol.h     |   5 +
 include/linux/mm.h             |  25 +-
 include/linux/page-flags.h     |  47 +++
 include/linux/pgtable.h        |  17 ++
 include/linux/rmap.h           |   9 +-
 include/linux/swap.h           |   2 +
 include/linux/vm_event_item.h  |   4 +
 kernel/events/uprobes.c        |   4 +-
 mm/huge_memory.c               | 536 +++++++++++++++++++++++++++++++--
 mm/hugetlb.c                   |   4 +-
 mm/khugepaged.c                |   6 +-
 mm/ksm.c                       |   4 +-
 mm/memcontrol.c                |  13 +
 mm/memory.c                    |  18 +-
 mm/migrate.c                   |  10 +-
 mm/page_alloc.c                |  20 +-
 mm/pgtable-generic.c           |  11 +
 mm/rmap.c                      | 106 +++++--
 mm/swap.c                      |  31 ++
 mm/swapfile.c                  |   4 +-
 mm/userfaultfd.c               |   2 +-
 mm/util.c                      |  16 +-
 mm/vmstat.c                    |   4 +
 25 files changed, 852 insertions(+), 98 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 15334f5ba172..fe4600256bc7 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -630,6 +630,12 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)
 		      __pgprot(pmd_flags(pmd) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
 }
 
+static inline pud_t pud_mknotpresent(pud_t pud)
+{
+	return pfn_pud(pud_pfn(pud),
+	      __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
+}
+
 static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
 
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
@@ -1246,6 +1252,21 @@ static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
 }
 #endif /* CONFIG_PAGE_TABLE_ISOLATION */
 
+#ifndef pudp_establish
+#define pudp_establish pudp_establish
+static inline pud_t pudp_establish(struct vm_area_struct *vma,
+		unsigned long address, pud_t *pudp, pud_t pud)
+{
+	if (IS_ENABLED(CONFIG_SMP)) {
+		return xchg(pudp, pud);
+	} else {
+		pud_t old = *pudp;
+		*pudp = pud;
+		return old;
+	}
+}
+#endif
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0c20a8ea6911..589e5af5a1c2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -227,17 +227,27 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 		bool freeze, struct page *page);
 
+bool can_split_huge_pud_page(struct page *page, int *pextra_pins);
+int split_huge_pud_page_to_list(struct page *page, struct list_head *list);
+static inline int split_huge_pud_page(struct page *page)
+{
+	return split_huge_pud_page_to_list(page, NULL);
+}
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long address);
+		unsigned long address, bool freeze, struct page *page);
 
 #define split_huge_pud(__vma, __pud, __address)				\
 	do {								\
 		pud_t *____pud = (__pud);				\
 		if (pud_trans_huge(*____pud)				\
 					|| pud_devmap(*____pud))	\
-			__split_huge_pud(__vma, __pud, __address);	\
+			__split_huge_pud(__vma, __pud, __address,	\
+						false, NULL);		\
 	}  while (0)
 
+void split_huge_pud_address(struct vm_area_struct *vma, unsigned long address,
+		bool freeze, struct page *page);
+
 extern int hugepage_madvise(struct vm_area_struct *vma,
 			    unsigned long *vm_flags, int advice);
 extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
@@ -427,8 +437,25 @@ static inline void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 static inline void split_huge_pmd_address(struct vm_area_struct *vma,
 		unsigned long address, bool freeze, struct page *page) {}
 
+static inline bool
+can_split_huge_pud_page(struct page *page, int *pextra_pins)
+{
+	BUILD_BUG();
+	return false;
+}
+static inline int
+split_huge_pud_page_to_list(struct page *page, struct list_head *list)
+{
+	return 0;
+}
+static inline int split_huge_pud_page(struct page *page)
+{
+	return 0;
+}
 #define split_huge_pud(__vma, __pmd, __address)	\
 	do { } while (0)
+static inline void split_huge_pud_address(struct vm_area_struct *vma,
+		unsigned long address, bool freeze, struct page *page) {}
 
 static inline int hugepage_madvise(struct vm_area_struct *vma,
 				   unsigned long *vm_flags, int advice)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d0b036123c6a..3ccff298d4b2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -929,6 +929,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head);
+void mem_cgroup_split_huge_pud_fixup(struct page *head);
 #endif
 
 #else /* CONFIG_MEMCG */
@@ -1261,6 +1262,10 @@ static inline void mem_cgroup_split_huge_fixup(struct page *head)
 {
 }
 
+static inline void mem_cgroup_split_huge_pud_fixup(struct page *head)
+{
+}
+
 static inline void count_memcg_events(struct mem_cgroup *memcg,
 				      enum vm_event_item idx,
 				      unsigned long count)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cb1ccf804404..8a85d96ab7e5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -797,6 +797,24 @@ static inline int compound_mapcount(struct page *page)
 	return head_compound_mapcount(page);
 }
 
+static inline unsigned int compound_order(struct page *page);
+static inline atomic_t *sub_compound_mapcount_ptr(struct page *page, int sub_level)
+{
+	struct page *head = compound_head(page);
+
+	VM_BUG_ON_PAGE(!PageCompound(page), page);
+	VM_BUG_ON_PAGE(compound_order(head) != HPAGE_PUD_ORDER, page);
+	VM_BUG_ON_PAGE((page - head) % HPAGE_PMD_NR, page);
+	VM_BUG_ON_PAGE(sub_level != 1, page);
+	return &page[2 + sub_level].compound_mapcount;
+}
+
+/* Only works for PUD pages */
+static inline int sub_compound_mapcount(struct page *page)
+{
+	return atomic_read(sub_compound_mapcount_ptr(page, 1)) + 1;
+}
+
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
@@ -889,13 +907,6 @@ static inline void destroy_compound_page(struct page *page)
 	compound_page_dtors[page[1].compound_dtor](page);
 }
 
-static inline unsigned int compound_order(struct page *page)
-{
-	if (!PageHead(page))
-		return 0;
-	return page[1].compound_order;
-}
-
 static inline bool hpage_pincount_available(struct page *page)
 {
 	/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index fbbb841a9346..cdca0165d2db 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -235,6 +235,9 @@ static inline void page_init_poison(struct page *page, size_t size)
  *
  * PF_SECOND:
  *     the page flag is stored in the first tail page.
+ *
+ * PF_THIRD:
+ *     the page flag is stored in the second tail page.
  */
 #define PF_POISONED_CHECK(page) ({					\
 		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
@@ -253,6 +256,9 @@ static inline void page_init_poison(struct page *page, size_t size)
 #define PF_SECOND(page, enforce) ({					\
 		VM_BUG_ON_PGFLAGS(!PageHead(page), page);		\
 		PF_POISONED_CHECK(&page[1]); })
+#define PF_THIRD(page, enforce) ({					\
+		VM_BUG_ON_PGFLAGS(!PageHead(page), page);		\
+		PF_POISONED_CHECK(&page[2]); })
 
 /*
  * Macros to create function definitions for page flags
@@ -674,6 +680,29 @@ static inline int PageTransTail(struct page *page)
 	return PageTail(page);
 }
 
+#define HPAGE_PMD_SHIFT PMD_SHIFT
+#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
+
+#define HPAGE_PUD_SHIFT PUD_SHIFT
+#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
+
+static inline unsigned int compound_order(struct page *page)
+{
+	if (!PageHead(page))
+		return 0;
+	return page[1].compound_order;
+}
+
+
+static inline int PMDPageInPUD(struct page *page)
+{
+	struct page *head = compound_head(page);
+	return (PageCompound(page) && compound_order(head) == HPAGE_PUD_ORDER &&
+		((page - head) % HPAGE_PMD_NR == 0));
+}
+
 /*
  * PageDoubleMap indicates that the compound page is mapped with PTEs as well
  * as PMDs.
@@ -689,13 +718,31 @@ static inline int PageTransTail(struct page *page)
  */
 PAGEFLAG(DoubleMap, double_map, PF_SECOND)
 	TESTSCFLAG(DoubleMap, double_map, PF_SECOND)
+/*
+ * PagePUDDoubleMap indicates that the compound page is mapped with PMDs as well
+ * as PUDs.
+ *
+ * This is required for optimization of rmap operations for THP: we can postpone
+ * per small page mapcount accounting (and its overhead from atomic operations)
+ * until the first PUD split.
+ *
+ * For the page PagePUDDoubleMap means ->_mapcount in all sub-PMD pages is
+ * offset up by one. This reference will go away with last sub_compound_mapcount.
+ *
+ * See also __split_huge_pud_locked() and page_remove_anon_compound_rmap().
+ */
+PAGEFLAG(PUDDoubleMap, double_map, PF_THIRD)
+	TESTSCFLAG(PUDDoubleMap, double_map, PF_THIRD)
 #else
 TESTPAGEFLAG_FALSE(TransHuge)
 TESTPAGEFLAG_FALSE(TransCompound)
 TESTPAGEFLAG_FALSE(TransCompoundMap)
 TESTPAGEFLAG_FALSE(TransTail)
+TESTPAGEFLAG_FALSE(PMDPageInPUD)
 PAGEFLAG_FALSE(DoubleMap)
 	TESTSCFLAG_FALSE(DoubleMap)
+PAGEFLAG_FALSE(PUDDoubleMap)
+	TESTSETFLAG_FALSE(PUDDoubleMap)
 #endif
 
 /*
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8ef358c386af..7acf218a8879 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -505,6 +505,11 @@ extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 #endif
 
+#ifndef __HAVE_ARCH_PUDP_INVALIDATE
+extern pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+			    pud_t *pudp);
+#endif
+
 #ifndef __HAVE_ARCH_PTE_SAME
 static inline int pte_same(pte_t pte_a, pte_t pte_b)
 {
@@ -1158,6 +1163,18 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 }
 #endif
 
+#ifndef pud_read_atomic
+static inline pud_t pud_read_atomic(pud_t *pudp)
+{
+	/*
+	 * Depend on compiler for an atomic pmd read. NOTE: this is
+	 * only going to work, if the pmdval_t isn't larger than
+	 * an unsigned long.
+	 */
+	return *pudp;
+}
+#endif
+
 #ifndef arch_needs_pgtable_deposit
 #define arch_needs_pgtable_deposit() (false)
 #endif
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 0af61dd193d2..c43da5919354 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -99,6 +99,7 @@ enum ttu_flags {
 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
 					 * caller holds it */
 	TTU_SPLIT_FREEZE	= 0x100,		/* freeze pte under splitting thp */
+	TTU_SPLIT_HUGE_PUD	= 0x200,		/* split huge PUD if any */
 };
 
 #ifdef CONFIG_MMU
@@ -171,13 +172,13 @@ struct anon_vma *page_get_anon_vma(struct page *page);
  */
 void page_move_anon_rmap(struct page *, struct vm_area_struct *);
 void page_add_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
+		unsigned long, bool, int);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
-			   unsigned long, int);
+			   unsigned long, int, int);
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
+		unsigned long, bool, int);
 void page_add_file_rmap(struct page *, bool);
-void page_remove_rmap(struct page *, bool);
+void page_remove_rmap(struct page *, bool, int);
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 			    unsigned long);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5c48713221fe..871c62211ecd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -340,6 +340,8 @@ extern void lru_note_cost_page(struct page *);
 extern void lru_cache_add(struct page *);
 extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
+extern void lru_add_pud_page_tail(struct page *page, struct page *page_tail,
+			 struct lruvec *lruvec, struct list_head *head);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index a3f1093a55bb..b336de64586c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -96,6 +96,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK_PUD,
 		THP_FAULT_FALLBACK_PUD_CHARGE,
 		THP_SPLIT_PUD,
+		THP_SPLIT_PUD_PAGE,
+		THP_SPLIT_PUD_PAGE_FAILED,
+		THP_ZERO_PUD_PAGE_ALLOC,
+		THP_ZERO_PUD_PAGE_ALLOC_FAILED,
 #endif
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 0e18aaf23a7b..834b350a49f6 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -183,7 +183,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	if (new_page) {
 		get_page(new_page);
-		page_add_new_anon_rmap(new_page, vma, addr, false);
+		page_add_new_anon_rmap(new_page, vma, addr, false, 0);
 		lru_cache_add_inactive_or_unevictable(new_page, vma);
 	} else
 		/* no new page, just dec_mm_counter for old_page */
@@ -200,7 +200,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		set_pte_at_notify(mm, addr, pvmw.pte,
 				  mk_pte(new_page, vma->vm_page_prot));
 
-	page_remove_rmap(old_page, false);
+	page_remove_rmap(old_page, false, 0);
 	if (!page_mapped(old_page))
 		try_to_free_swap(old_page);
 	page_vma_mapped_walk_done(&pvmw);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6da9b02501b7..398f1b52f789 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -618,7 +618,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		page_add_new_anon_rmap(page, vma, haddr, true);
+		page_add_new_anon_rmap(page, vma, haddr, true, HPAGE_PMD_ORDER);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
@@ -991,7 +991,7 @@ static int __do_huge_pud_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 		entry = mk_huge_pud(page, vma->vm_page_prot);
 		entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
-		page_add_new_anon_rmap(page, vma, haddr, true);
+		page_add_new_anon_rmap(page, vma, haddr, true, HPAGE_PUD_ORDER);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 		pgtable_trans_huge_pud_deposit(vma->vm_mm, vmf->pud,
 				virt_to_page(pmd_pgtable));
@@ -1384,7 +1384,7 @@ vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
 	unlock_page(page);
 	spin_unlock(vmf->ptl);
 fallback:
-	__split_huge_pud(vma, vmf->pud, vmf->address);
+	__split_huge_pud(vma, vmf->pud, vmf->address, false, NULL);
 	return VM_FAULT_FALLBACK;
 }
 
@@ -1825,9 +1825,9 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		if (pmd_present(orig_pmd)) {
 			page = pmd_page(orig_pmd);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, true, HPAGE_PMD_ORDER);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-			VM_BUG_ON_PAGE(!PageHead(page), page);
+			VM_BUG_ON_PAGE(!PageHead(page) && !PMDPageInPUD(page), page);
 		} else if (thp_migration_supported()) {
 			swp_entry_t entry;
 
@@ -2111,7 +2111,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		if (pud_present(orig_pud)) {
 			page = pud_page(orig_pud);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, true, HPAGE_PUD_ORDER);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
 		} else
@@ -2134,8 +2134,16 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 }
 
 static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long haddr)
+		unsigned long haddr, bool freeze)
 {
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *page;
+	pgtable_t pgtable;
+	pud_t _pud, old_pud;
+	bool young, write, dirty, soft_dirty;
+	unsigned long addr;
+	int i;
+
 	VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
@@ -2143,23 +2151,141 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
 
 	count_vm_event(THP_SPLIT_PUD);
 
-	pudp_huge_clear_flush_notify(vma, haddr, pud);
+	if (!vma_is_anonymous(vma)) {
+		_pud = pudp_huge_clear_flush_notify(vma, haddr, pud);
+		/*
+		 * We are going to unmap this huge page. So
+		 * just go ahead and zap it
+		 */
+		if (arch_needs_pgtable_deposit())
+			zap_pud_deposited_table(mm, pud);
+		if (vma_is_dax(vma))
+			return;
+		page = pud_page(_pud);
+		if (!PageReferenced(page) && pud_young(_pud))
+			SetPageReferenced(page);
+		page_remove_rmap(page, true, HPAGE_PUD_ORDER);
+		put_page(page);
+		add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PUD_NR);
+		return;
+	}
+
+	/* See the comment above pmdp_invalidate() in __split_huge_pmd_locked() */
+	old_pud = pudp_invalidate(vma, haddr, pud);
+
+	page = pud_page(old_pud);
+	VM_BUG_ON_PAGE(!page_count(page), page);
+	page_ref_add(page, (1<<(HPAGE_PUD_ORDER-HPAGE_PMD_ORDER)) - 1);
+	if (pud_dirty(old_pud))
+		SetPageDirty(page);
+	write = pud_write(old_pud);
+	young = pud_young(old_pud);
+	dirty = pud_dirty(old_pud);
+	soft_dirty = pud_soft_dirty(old_pud);
+
+	pgtable = pgtable_trans_huge_pud_withdraw(mm, pud);
+	pud_populate_with_pgtable(mm, &_pud, pgtable);
+
+	for (i = 0, addr = haddr; i < HPAGE_PUD_NR;
+		 i += HPAGE_PMD_NR, addr += PMD_SIZE) {
+		pmd_t entry, *pmd;
+		/*
+		 * Note that NUMA hinting access restrictions are not
+		 * transferred to avoid any possibility of altering
+		 * permissions across VMAs.
+		 */
+		if (freeze) {
+			swp_entry_t swp_entry;
+
+			swp_entry = make_migration_entry(page + i, write);
+			entry = swp_entry_to_pmd(swp_entry);
+			if (soft_dirty)
+				entry = pmd_swp_mksoft_dirty(entry);
+		} else {
+			entry = mk_huge_pmd(page + i, READ_ONCE(vma->vm_page_prot));
+			entry = maybe_pmd_mkwrite(entry, vma);
+			if (!write)
+				entry = pmd_wrprotect(entry);
+			if (!young)
+				entry = pmd_mkold(entry);
+			if (soft_dirty)
+				entry = pmd_mksoft_dirty(entry);
+		}
+		pmd = pmd_offset(&_pud, addr);
+		VM_BUG_ON(!pmd_none(*pmd));
+		set_pmd_at(mm, addr, pmd, entry);
+		/* distinguish between pud compound_mapcount and pmd compound_mapcount */
+		if (atomic_inc_and_test(sub_compound_mapcount_ptr(&page[i], 1))) {
+			/* first pmd-mapped pud page */
+			lock_page_memcg(page);
+			__inc_lruvec_page_state(page, NR_ANON_THPS);
+			unlock_page_memcg(page);
+		}
+	}
+
+	/*
+	 * Set PG_double_map before dropping compound_mapcount to avoid
+	 * false-negative page_mapped().
+	 */
+	if (compound_mapcount(page) > 1 && !TestSetPagePUDDoubleMap(page)) {
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+		/* distinguish between pud compound_mapcount and pmd compound_mapcount */
+			atomic_inc(sub_compound_mapcount_ptr(&page[i], 1));
+	}
+
+	lock_page_memcg(page);
+	if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
+		/* Last compound_mapcount is gone. */
+		__dec_lruvec_page_state(page, NR_ANON_THPS_PUD);
+		if (TestClearPagePUDDoubleMap(page)) {
+			/* No need in mapcount reference anymore */
+			for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+		/* distinguish between pud compound_mapcount and pmd compound_mapcount */
+				atomic_dec(sub_compound_mapcount_ptr(&page[i], 1));
+		}
+	}
+	unlock_page_memcg(page);
+
+	smp_wmb(); /* make pte visible before pmd */
+	pud_populate_with_pgtable(mm, pud, pgtable);
+
+	if (freeze) {
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR) {
+			page_remove_rmap(page + i, true, HPAGE_PMD_ORDER);
+			put_page(page + i);
+		}
+	}
 }
 
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long address)
+		unsigned long address, bool freeze, struct page *page)
 {
 	spinlock_t *ptl;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long haddr = address & HPAGE_PUD_MASK;
 	struct mmu_notifier_range range;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
 				address & HPAGE_PUD_MASK,
 				(address & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
-	ptl = pud_lock(vma->vm_mm, pud);
-	if (unlikely(!pud_trans_huge(*pud) && !pud_devmap(*pud)))
+	ptl = pud_lock(mm, pud);
+
+	/*
+	 * If caller asks to setup a migration entries, we need a page to check
+	 * pmd against. Otherwise we can end up replacing wrong page.
+	 */
+	VM_BUG_ON(freeze && !page);
+	if (page && page != pud_page(*pud))
 		goto out;
-	__split_huge_pud_locked(vma, pud, range.start);
+
+	if (pud_trans_huge(*pud)) {
+		page = pud_page(*pud);
+		if (PageMlocked(page))
+			clear_page_mlock(page);
+	} else if (unlikely(!pud_devmap(*pud)))
+		goto out;
+	__split_huge_pud_locked(vma, pud, haddr, freeze);
 
 out:
 	spin_unlock(ptl);
@@ -2169,6 +2295,281 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 	 */
 	mmu_notifier_invalidate_range_only_end(&range);
 }
+
+void split_huge_pud_address(struct vm_area_struct *vma, unsigned long address,
+		bool freeze, struct page *page)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+
+	pgd = pgd_offset(vma->vm_mm, address);
+	if (!pgd_present(*pgd))
+		return;
+
+	p4d = p4d_offset(pgd, address);
+	if (!p4d_present(*p4d))
+		return;
+
+	pud = pud_offset(p4d, address);
+
+	__split_huge_pud(vma, pud, address, freeze, page);
+}
+
+static void unmap_pud_page(struct page *page)
+{
+	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
+		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PUD;
+	bool unmap_success;
+
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+
+	if (PageAnon(page))
+		ttu_flags |= TTU_SPLIT_FREEZE;
+
+	unmap_success = try_to_unmap(page, ttu_flags);
+	VM_BUG_ON_PAGE(!unmap_success, page);
+}
+
+static void remap_pud_page(struct page *page)
+{
+	int i;
+
+	VM_BUG_ON(!PageTransHuge(page));
+	if (compound_order(page) == HPAGE_PUD_ORDER) {
+		remove_migration_ptes(page, page, true);
+	} else if (compound_order(page) == HPAGE_PMD_ORDER) {
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+			remove_migration_ptes(page + i, page + i, true);
+	} else
+		VM_BUG_ON_PAGE(1, page);
+}
+
+static void __split_huge_pud_page_tail(struct page *head, int tail,
+		struct lruvec *lruvec, struct list_head *list)
+{
+	struct page *page_tail = head + tail;
+
+	VM_BUG_ON_PAGE(page_ref_count(page_tail) != 0, page_tail);
+
+	/*
+	 * Clone page flags before unfreezing refcount.
+	 *
+	 * After successful get_page_unless_zero() might follow flags change,
+	 * for example lock_page() which set PG_waiters.
+	 */
+
+	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+	page_tail->flags |= (head->flags &
+			((1L << PG_referenced) |
+			 (1L << PG_swapbacked) |
+			 (1L << PG_swapcache) |
+			 (1L << PG_mlocked) |
+			 (1L << PG_uptodate) |
+			 (1L << PG_active) |
+			 (1L << PG_locked) |
+			 (1L << PG_unevictable) |
+			 (1L << PG_dirty) |
+			 /* preserve THP */
+			 (1L << PG_head)));
+
+	/* ->mapping in first tail page is compound_mapcount */
+	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
+			page_tail);
+	page_tail->mapping = head->mapping;
+	page_tail->index = head->index + tail;
+
+	/* Page flags also must be visible before we make the page PMD-compound. */
+	smp_wmb();
+
+	clear_compound_head(page_tail);
+	prep_compound_page(page_tail, HPAGE_PMD_ORDER);
+	prep_transhuge_page(page_tail);
+
+	/* Finally unfreeze refcount. Additional reference from page cache. */
+	page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
+					  PageSwapCache(head)));
+
+	if (page_is_young(head))
+		set_page_young(page_tail);
+	if (page_is_idle(head))
+		set_page_idle(page_tail);
+
+	page_cpupid_xchg_last(page_tail, page_cpupid_last(head));
+	lru_add_pud_page_tail(head, page_tail, lruvec, list);
+}
+
+static void __split_huge_pud_page(struct page *page, struct list_head *list,
+		unsigned long flags)
+{
+	struct page *head = compound_head(page);
+	pg_data_t *pgdat = page_pgdat(head);
+	struct lruvec *lruvec;
+	int i;
+
+	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+
+	/* complete memcg works before add pages to LRU */
+	mem_cgroup_split_huge_pud_fixup(head);
+
+	/* no file-back page support yet */
+	VM_BUG_ON(!PageAnon(page));
+
+	for (i = HPAGE_PUD_NR - HPAGE_PMD_NR; i >= 1; i -= HPAGE_PMD_NR) {
+		__split_huge_pud_page_tail(head, i, lruvec, list);
+	}
+	/* reset head page order  */
+	prep_compound_page(head, HPAGE_PMD_ORDER);
+	prep_transhuge_page(head);
+
+	page_ref_inc(head);
+
+	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+
+	remap_pud_page(head);
+
+	for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR) {
+		struct page *subpage = head + i;
+
+		if (subpage == page)
+			continue;
+		unlock_page(subpage);
+
+		/*
+		 * Subpages may be freed if there wasn't any mapping
+		 * like if add_to_swap() is running on a lru page that
+		 * had its mapping zapped. And freeing these pages
+		 * requires taking the lru_lock so we do the put_page
+		 * of the tail pages after the split is complete.
+		 */
+		put_page(subpage);
+	}
+}
+/* Racy check whether the huge page can be split */
+bool can_split_huge_pud_page(struct page *page, int *pextra_pins)
+{
+	int extra_pins;
+
+	VM_BUG_ON(!PageAnon(page));
+
+	extra_pins = PageSwapCache(page) ? HPAGE_PUD_NR : 0;
+
+	if (pextra_pins)
+		*pextra_pins = extra_pins;
+	return total_mapcount(page) == page_count(page) - extra_pins - 1;
+}
+
+/*
+ * This function splits huge page into normal pages. @page can point to any
+ * subpage of huge page to split. Split doesn't change the position of @page.
+ *
+ * Only caller must hold pin on the @page, otherwise split fails with -EBUSY.
+ * The huge page must be locked.
+ *
+ * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
+ *
+ * Both head page and tail pages will inherit mapping, flags, and so on from
+ * the hugepage.
+ *
+ * GUP pin and PG_locked transferred to @page. Rest subpages can be freed if
+ * they are not mapped.
+ *
+ * Returns 0 if the hugepage is split successfully.
+ * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under
+ * us.
+ */
+int split_huge_pud_page_to_list(struct page *page, struct list_head *list)
+{
+	struct page *head = compound_head(page);
+	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
+	struct deferred_split *ds_queue = get_deferred_split_queue(head);
+	struct anon_vma *anon_vma = NULL;
+	struct address_space *mapping = NULL;
+	int count, mapcount, extra_pins, ret;
+	bool mlocked;
+	unsigned long flags;
+
+	VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageCompound(page), page);
+	VM_BUG_ON_PAGE(!PageAnon(page), page);
+
+	if (PageWriteback(page))
+		return -EBUSY;
+
+	/*
+	 * The caller does not necessarily hold an mmap_sem that would
+	 * prevent the anon_vma disappearing so we first we take a
+	 * reference to it and then lock the anon_vma for write. This
+	 * is similar to page_lock_anon_vma_read except the write lock
+	 * is taken to serialise against parallel split or collapse
+	 * operations.
+	 */
+	anon_vma = page_get_anon_vma(head);
+	if (!anon_vma) {
+		ret = -EBUSY;
+		goto out;
+	}
+	mapping = NULL;
+	anon_vma_lock_write(anon_vma);
+	/*
+	 * Racy check if we can split the page, before unmap_pud_page() will
+	 * split PUDs
+	 */
+	if (!can_split_huge_pud_page(head, &extra_pins)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	mlocked = PageMlocked(page);
+	unmap_pud_page(head);
+	VM_BUG_ON_PAGE(compound_mapcount(head), head);
+
+	/* Make sure the page is not on per-CPU pagevec as it takes pin */
+	if (mlocked)
+		lru_add_drain();
+
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock_irqsave(&pgdata->lru_lock, flags);
+
+	/* Prevent deferred_split_scan() touching ->_refcount */
+	spin_lock(&ds_queue->split_queue_lock);
+	count = page_count(head);
+	mapcount = total_mapcount(head);
+	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
+		if (!list_empty(page_deferred_list(head))) {
+			ds_queue->split_queue_len--;
+			list_del(page_deferred_list(head));
+		}
+		if (mapping) {
+			__dec_node_page_state(page, NR_SHMEM_THPS);
+		}
+		spin_unlock(&ds_queue->split_queue_lock);
+		__split_huge_pud_page(page, list, flags);
+		ret = 0;
+	} else {
+		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
+			pr_alert("total_mapcount: %u, page_count(): %u\n",
+					mapcount, count);
+			if (PageTail(page))
+				dump_page(head, NULL);
+			dump_page(page, "total_mapcount(head) > 0");
+		}
+		spin_unlock(&ds_queue->split_queue_lock);
+		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+		remap_pud_page(head);
+		ret = -EBUSY;
+	}
+
+out_unlock:
+	if (anon_vma) {
+		anon_vma_unlock_write(anon_vma);
+		put_anon_vma(anon_vma);
+	}
+out:
+	count_vm_event(!ret ? THP_SPLIT_PUD_PAGE : THP_SPLIT_PUD_PAGE_FAILED);
+	return ret;
+}
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
@@ -2209,7 +2610,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long haddr, bool freeze)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct page *page;
+	struct page *page, *head;
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
 	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
@@ -2239,7 +2640,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			set_page_dirty(page);
 		if (!PageReferenced(page) && pmd_young(_pmd))
 			SetPageReferenced(page);
-		page_remove_rmap(page, true);
+		page_remove_rmap(page, true, HPAGE_PMD_ORDER);
 		put_page(page);
 		add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
 		return;
@@ -2298,7 +2699,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		uffd_wp = pmd_uffd_wp(old_pmd);
 	}
 	VM_BUG_ON_PAGE(!page_count(page), page);
-	page_ref_add(page, HPAGE_PMD_NR - 1);
+	head = compound_head(page);
+	page_ref_add(head, HPAGE_PMD_NR - 1);
 
 	/*
 	 * Withdraw the table only after we mark the pmd entry invalid.
@@ -2344,14 +2746,24 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	/*
 	 * Set PG_double_map before dropping compound_mapcount to avoid
 	 * false-negative page_mapped().
+	 * Don't set it if the PUD page is mapped at PUD level, since
+	 * page_mapped() is true in that case.
 	 */
-	if (compound_mapcount(page) > 1 && !TestSetPageDoubleMap(page)) {
+	if (((PMDPageInPUD(page) &&
+		sub_compound_mapcount(page) >
+			(1 + PagePUDDoubleMap(compound_head(page)))) ||
+	    (!PMDPageInPUD(page) &&
+		compound_mapcount(page) > 1))
+		&& !TestSetPageDoubleMap(page)) {
 		for (i = 0; i < HPAGE_PMD_NR; i++)
 			atomic_inc(&page[i]._mapcount);
 	}
 
 	lock_page_memcg(page);
-	if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
+	if ((PMDPageInPUD(page) &&
+		atomic_add_negative(-1, sub_compound_mapcount_ptr(page, 1))) ||
+	    (!PMDPageInPUD(page) &&
+		atomic_add_negative(-1, compound_mapcount_ptr(page)))) {
 		/* Last compound_mapcount is gone. */
 		__dec_lruvec_page_state(page, NR_ANON_THPS);
 		if (TestClearPageDoubleMap(page)) {
@@ -2367,7 +2779,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (freeze) {
 		for (i = 0; i < HPAGE_PMD_NR; i++) {
-			page_remove_rmap(page + i, false);
+			page_remove_rmap(page + i, false, 0);
 			put_page(page + i);
 		}
 	}
@@ -2478,6 +2890,11 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 	 * previously contain an hugepage: check if we need to split
 	 * an huge pmd.
 	 */
+	if (start & ~HPAGE_PUD_MASK &&
+	    (start & HPAGE_PUD_MASK) >= vma->vm_start &&
+	    (start & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE <= vma->vm_end)
+		split_huge_pud_address(vma, start, false, NULL);
+
 	if (start & ~HPAGE_PMD_MASK &&
 	    (start & HPAGE_PMD_MASK) >= vma->vm_start &&
 	    (start & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
@@ -2488,6 +2905,11 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 	 * previously contain an hugepage: check if we need to split
 	 * an huge pmd.
 	 */
+	if (end & ~HPAGE_PUD_MASK &&
+	    (end & HPAGE_PUD_MASK) >= vma->vm_start &&
+	    (end & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE <= vma->vm_end)
+		split_huge_pud_address(vma, end, false, NULL);
+
 	if (end & ~HPAGE_PMD_MASK &&
 	    (end & HPAGE_PMD_MASK) >= vma->vm_start &&
 	    (end & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
@@ -2502,6 +2924,11 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 		struct vm_area_struct *next = vma->vm_next;
 		unsigned long nstart = next->vm_start;
 		nstart += adjust_next << PAGE_SHIFT;
+		if (nstart & ~HPAGE_PUD_MASK &&
+		    (nstart & HPAGE_PUD_MASK) >= next->vm_start &&
+		    (nstart & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE <= next->vm_end)
+			split_huge_pud_address(next, nstart, false, NULL);
+
 		if (nstart & ~HPAGE_PMD_MASK &&
 		    (nstart & HPAGE_PMD_MASK) >= next->vm_start &&
 		    (nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
@@ -2691,12 +3118,23 @@ int total_mapcount(struct page *page)
 	if (PageHuge(page))
 		return compound;
 	ret = compound;
-	for (i = 0; i < HPAGE_PMD_NR; i++)
-		ret += atomic_read(&page[i]._mapcount) + 1;
+	/* if PMD, read all base page, if PUD, read the sub_compound_mapcount()*/
+	if (compound_order(page) == HPAGE_PMD_ORDER) {
+		for (i = 0; i < thp_nr_pages(page); i++)
+			ret += atomic_read(&page[i]._mapcount) + 1;
+	} else if (compound_order(page) == HPAGE_PUD_ORDER) {
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+			ret += sub_compound_mapcount(&page[i]);
+		for (i = 0; i < thp_nr_pages(page); i++)
+			ret += atomic_read(&page[i]._mapcount) + 1;
+	} else
+		VM_BUG_ON_PAGE(1, page);
 	/* File pages has compound_mapcount included in _mapcount */
+	/* both PUD and PMD has HPAGE_PMD_NR sub pages */
 	if (!PageAnon(page))
 		return ret - compound * HPAGE_PMD_NR;
-	if (PageDoubleMap(page))
+	/* both PUD and PMD has HPAGE_PMD_NR sub pages */
+	if (PagePUDDoubleMap(page) || PageDoubleMap(page))
 		ret -= HPAGE_PMD_NR;
 	return ret;
 }
@@ -2742,13 +3180,38 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
 	page = compound_head(page);
 
 	_total_mapcount = ret = 0;
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
-		mapcount = atomic_read(&page[i]._mapcount) + 1;
-		ret = max(ret, mapcount);
-		_total_mapcount += mapcount;
-	}
-	if (PageDoubleMap(page)) {
+	/* if PMD, read all base page, if PUD, read the sub_compound_mapcount()*/
+	if (compound_order(page) == HPAGE_PMD_ORDER) {
+		for (i = 0; i < thp_nr_pages(page); i++) {
+			mapcount = atomic_read(&page[i]._mapcount) + 1;
+			ret = max(ret, mapcount);
+			_total_mapcount += mapcount;
+		}
+	} else if (compound_order(page) == HPAGE_PUD_ORDER) {
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR) {
+			int j;
+
+			mapcount = sub_compound_mapcount(&page[i]);
+			ret = max(ret, mapcount);
+			_total_mapcount += mapcount;
+
+			/* Triple mapped at base page size */
+			for (j = 0; j < HPAGE_PMD_NR; j++) {
+				mapcount = atomic_read(&page[i + j]._mapcount) + 1;
+				ret = max(ret, mapcount);
+				_total_mapcount += mapcount;
+			}
+
+			if (PageDoubleMap(&page[i])) {
+				ret -= 1;
+				_total_mapcount -= HPAGE_PMD_NR;
+			}
+		}
+	} else
+		VM_BUG_ON_PAGE(1, page);
+	if (PageDoubleMap(page) || PagePUDDoubleMap(page)) {
 		ret -= 1;
+		/* both PUD and PMD has HPAGE_PMD_NR sub pages */
 		_total_mapcount -= HPAGE_PMD_NR;
 	}
 	mapcount = compound_mapcount(page);
@@ -2994,6 +3457,9 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
 	return READ_ONCE(ds_queue->split_queue_len);
 }
 
+#define deferred_list_entry(x) (compound_head(list_entry((void *)x, \
+					struct page, mapping)))
+
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 		struct shrink_control *sc)
 {
@@ -3027,12 +3493,18 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
 	list_for_each_safe(pos, next, &list) {
-		page = list_entry((void *)pos, struct page, mapping);
+		page = deferred_list_entry(pos);
 		if (!trylock_page(page))
 			goto next;
 		/* split_huge_page() removes page from list on success */
-		if (!split_huge_page(page))
-			split++;
+		if (compound_order(page) == HPAGE_PUD_ORDER) {
+			if (!split_huge_pud_page(page))
+				split++;
+		} else if (compound_order(page) == HPAGE_PMD_ORDER) {
+			if (!split_huge_page(page))
+				split++;
+		} else
+			VM_BUG_ON_PAGE(1, page);
 		unlock_page(page);
 next:
 		put_page(page);
@@ -3135,7 +3607,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
 	set_pmd_at(mm, address, pvmw->pmd, pmdswp);
-	page_remove_rmap(page, true);
+	page_remove_rmap(page, true, HPAGE_PMD_ORDER);
 	put_page(page);
 }
 
@@ -3161,7 +3633,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 
 	flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
 	if (PageAnon(new))
-		page_add_anon_rmap(new, vma, mmun_start, true);
+		page_add_anon_rmap(new, vma, mmun_start, true, HPAGE_PMD_ORDER);
 	else
 		page_add_file_rmap(new, true);
 	set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 27a51b202d1f..4113d7b66fee 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3993,7 +3993,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			set_page_dirty(page);
 
 		hugetlb_count_sub(pages_per_huge_page(h), mm);
-		page_remove_rmap(page, true);
+		page_remove_rmap(page, true, huge_page_order(h));
 
 		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, page, huge_page_size(h));
@@ -4218,7 +4218,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		set_huge_pte_at(mm, haddr, ptep,
 				make_huge_pte(vma, new_page, 1));
-		page_remove_rmap(old_page, true);
+		page_remove_rmap(old_page, true, huge_page_order(h));
 		hugepage_add_new_anon_rmap(new_page, vma, haddr);
 		set_page_huge_active(new_page);
 		/* Make the old page be freed below */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e749e568e1ea..84ce39652282 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -762,7 +762,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			 * superfluous.
 			 */
 			pte_clear(vma->vm_mm, address, _pte);
-			page_remove_rmap(src_page, false);
+			page_remove_rmap(src_page, false, 0);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
 		}
@@ -1172,7 +1172,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	page_add_new_anon_rmap(new_page, vma, address, true);
+	page_add_new_anon_rmap(new_page, vma, address, true, HPAGE_PMD_ORDER);
 	lru_cache_add_inactive_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
@@ -1475,7 +1475,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 		if (pte_none(*pte))
 			continue;
 		page = vm_normal_page(vma, addr, *pte);
-		page_remove_rmap(page, false);
+		page_remove_rmap(page, false, HPAGE_PMD_ORDER);
 	}
 
 	pte_unmap_unlock(start_pte, ptl);
diff --git a/mm/ksm.c b/mm/ksm.c
index 0aa2247bddd7..d778b4d1b626 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1153,7 +1153,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	 */
 	if (!is_zero_pfn(page_to_pfn(kpage))) {
 		get_page(kpage);
-		page_add_anon_rmap(kpage, vma, addr, false);
+		page_add_anon_rmap(kpage, vma, addr, false, 0);
 		newpte = mk_pte(kpage, vma->vm_page_prot);
 	} else {
 		newpte = pte_mkspecial(pfn_pte(page_to_pfn(kpage),
@@ -1177,7 +1177,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	ptep_clear_flush(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, newpte);
 
-	page_remove_rmap(page, false);
+	page_remove_rmap(page, false, 0);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	put_page(page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dc892a3c4b17..5d5be3b7c739 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3232,6 +3232,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 		head[i].mem_cgroup = memcg;
 	}
 }
+
+void mem_cgroup_split_huge_pud_fixup(struct page *head)
+{
+	int i;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	for (i = HPAGE_PMD_NR; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+		head[i].mem_cgroup = head->mem_cgroup;
+
+	/*__mod_memcg_state(head->mem_cgroup, MEMCG_RSS_HUGE, -HPAGE_PUD_NR);*/
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #ifdef CONFIG_MEMCG_SWAP
diff --git a/mm/memory.c b/mm/memory.c
index b88587256bc1..184d8eb2d060 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1090,7 +1090,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, false, 0);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(__tlb_remove_page(tlb, page))) {
@@ -1118,7 +1118,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, false, 0);
 			put_page(page);
 			continue;
 		}
@@ -2725,7 +2725,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * thread doing COW.
 		 */
 		ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
-		page_add_new_anon_rmap(new_page, vma, vmf->address, false);
+		page_add_new_anon_rmap(new_page, vma, vmf->address, false, 0);
 		lru_cache_add_inactive_or_unevictable(new_page, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
@@ -2757,7 +2757,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			 * mapcount is visible. So transitively, TLBs to
 			 * old page will be flushed before it can be reused.
 			 */
-			page_remove_rmap(old_page, false);
+			page_remove_rmap(old_page, false, 0);
 		}
 
 		/* Free the old page.. */
@@ -3273,10 +3273,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	/* ksm created a completely new copy */
 	if (unlikely(page != swapcache && swapcache)) {
-		page_add_new_anon_rmap(page, vma, vmf->address, false);
+		page_add_new_anon_rmap(page, vma, vmf->address, false, 0);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
-		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
+		do_page_add_anon_rmap(page, vma, vmf->address, exclusive, 0);
 	}
 
 	swap_free(entry);
@@ -3420,7 +3420,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	}
 
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, vmf->address, false);
+	page_add_new_anon_rmap(page, vma, vmf->address, false, 0);
 	lru_cache_add_inactive_or_unevictable(page, vma);
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
@@ -3678,7 +3678,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
 	/* copy-on-write page */
 	if (write && !(vma->vm_flags & VM_SHARED)) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-		page_add_new_anon_rmap(page, vma, vmf->address, false);
+		page_add_new_anon_rmap(page, vma, vmf->address, false, 0);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
@@ -4155,7 +4155,7 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
 			return ret;
 	}
 	/* COW or write-notify not handled on PUD level: split pud.*/
-	__split_huge_pud(vmf->vma, vmf->pud, vmf->address);
+	split_huge_pud(vmf->vma, vmf->pud, vmf->address);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 	return VM_FAULT_FALLBACK;
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index 0b945c8031be..be0e80b32686 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -270,7 +270,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 
 			if (PageAnon(new))
-				page_add_anon_rmap(new, vma, pvmw.address, false);
+				page_add_anon_rmap(new, vma, pvmw.address, false, 0);
 			else
 				page_add_file_rmap(new, false);
 		}
@@ -2194,7 +2194,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	 * new page and page_add_new_anon_rmap guarantee the copy is
 	 * visible before the pagetable update.
 	 */
-	page_add_anon_rmap(new_page, vma, start, true);
+	page_add_anon_rmap(new_page, vma, start, true, HPAGE_PMD_ORDER);
 	/*
 	 * At this point the pmd is numa/protnone (i.e. non present) and the TLB
 	 * has already been flushed globally.  So no TLB can be currently
@@ -2211,7 +2211,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 	page_ref_unfreeze(page, 2);
 	mlock_migrate_page(new_page, page);
-	page_remove_rmap(page, true);
+	page_remove_rmap(page, true, HPAGE_PMD_ORDER);
 	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
 
 	spin_unlock(ptl);
@@ -2455,7 +2455,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			 * drop page refcount. Page won't be freed, as we took
 			 * a reference just above.
 			 */
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, false, 0);
 			put_page(page);
 
 			if (pte_present(pte))
@@ -2940,7 +2940,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		goto unlock_abort;
 
 	inc_mm_counter(mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, addr, false);
+	page_add_new_anon_rmap(page, vma, addr, false, 0);
 	if (!is_zone_device_page(page))
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	get_page(page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 763acbed66f1..97a4c7e4a579 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -679,6 +679,9 @@ void prep_compound_page(struct page *page, unsigned int order)
 	atomic_set(compound_mapcount_ptr(page), -1);
 	if (hpage_pincount_available(page))
 		atomic_set(compound_pincount_ptr(page), 0);
+	if (order == HPAGE_PUD_ORDER)
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+			atomic_set(sub_compound_mapcount_ptr(&page[i], 1), -1);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
@@ -1132,6 +1135,15 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
 		 */
 		break;
 	default:
+		/* sub_compound_map_ptr store here */
+		if (compound_order(head_page) == HPAGE_PUD_ORDER &&
+			(page - head_page) % HPAGE_PMD_NR == 3) {
+			if (unlikely(atomic_read(&page->compound_mapcount) != -1)) {
+				pr_err("sub_compound_mapcount: %d\n", atomic_read(&page->compound_mapcount) + 1);
+				bad_page(page, "nonzero sub_compound_mapcount");
+			}
+			break;
+		}
 		if (page->mapping != TAIL_MAPPING) {
 			bad_page(page, "corrupted mapping in tail page");
 			goto out;
@@ -1183,8 +1195,14 @@ static __always_inline bool free_pages_prepare(struct page *page,
 
 		VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
 
-		if (compound)
+		if (compound) {
 			ClearPageDoubleMap(page);
+			if (order == HPAGE_PUD_ORDER) {
+				ClearPagePUDDoubleMap(page);
+				for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+					ClearPageDoubleMap(&page[i]);
+			}
+		}
 		for (i = 1; i < (1 << order); i++) {
 			if (compound)
 				bad += free_tail_pages_check(page, page + i);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index ef218b0f5d74..a8529afc55e5 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -245,6 +245,17 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 }
 #endif
 
+#ifndef __HAVE_ARCH_PUDP_INVALIDATE
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		     pud_t *pudp)
+{
+	pud_t old = pudp_establish(vma, address, pudp, pud_mknotpresent(*pudp));
+
+	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+	return old;
+}
+#endif
+
 #ifndef pmdp_collapse_flush
 pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 			  pmd_t *pmdp)
diff --git a/mm/rmap.c b/mm/rmap.c
index 77cec0658b76..0bbaaa891b3c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1108,9 +1108,9 @@ static void __page_check_anon_rmap(struct page *page,
  * (but PageKsm is never downgraded to PageAnon).
  */
 void page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, bool compound)
+	struct vm_area_struct *vma, unsigned long address, bool compound, int order)
 {
-	do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0);
+	do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0, order);
 }
 
 /*
@@ -1119,7 +1119,7 @@ void page_add_anon_rmap(struct page *page,
  * Everybody else should continue to use page_add_anon_rmap above.
  */
 void do_page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, int flags)
+	struct vm_area_struct *vma, unsigned long address, int flags, int order)
 {
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
@@ -1130,10 +1130,21 @@ void do_page_add_anon_rmap(struct page *page,
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 
 	if (compound) {
-		atomic_t *mapcount;
+		atomic_t *mapcount = NULL;
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		mapcount = compound_mapcount_ptr(page);
+		if (compound_order(page) == HPAGE_PUD_ORDER) {
+			if (order == HPAGE_PUD_ORDER) {
+				mapcount = compound_mapcount_ptr(page);
+			} else if (order == HPAGE_PMD_ORDER) {
+				VM_BUG_ON(!PMDPageInPUD(page));
+				mapcount = sub_compound_mapcount_ptr(page, 1);
+			} else
+				VM_BUG_ON(1);
+		} else if (compound_order(page) == HPAGE_PMD_ORDER) {
+			mapcount = compound_mapcount_ptr(page);
+		} else
+			VM_BUG_ON(1);
 		first = atomic_inc_and_test(mapcount);
 	} else {
 		first = atomic_inc_and_test(&page->_mapcount);
@@ -1148,7 +1159,7 @@ void do_page_add_anon_rmap(struct page *page,
 		 * disabled.
 		 */
 		if (compound) {
-			if (nr == HPAGE_PMD_NR)
+			if (order == HPAGE_PMD_ORDER)
 				__inc_lruvec_page_state(page, NR_ANON_THPS);
 			else
 				__inc_lruvec_page_state(page, NR_ANON_THPS_PUD);
@@ -1181,7 +1192,7 @@ void do_page_add_anon_rmap(struct page *page,
  * Page does not have to be locked.
  */
 void page_add_new_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, bool compound)
+	struct vm_area_struct *vma, unsigned long address, bool compound, int order)
 {
 	int nr = compound ? thp_nr_pages(page) : 1;
 
@@ -1194,10 +1205,15 @@ void page_add_new_anon_rmap(struct page *page,
 		if (hpage_pincount_available(page))
 			atomic_set(compound_pincount_ptr(page), 0);
 
-		if (nr == HPAGE_PMD_NR)
-			__inc_lruvec_page_state(page, NR_ANON_THPS);
-		else
+		if (order == HPAGE_PUD_ORDER) {
+			VM_BUG_ON(compound_order(page) != HPAGE_PUD_ORDER);
+			/* Anon THP always mapped first with PMD */
 			__inc_lruvec_page_state(page, NR_ANON_THPS_PUD);
+		} else if (order == HPAGE_PMD_ORDER) {
+			VM_BUG_ON(compound_order(page) != HPAGE_PMD_ORDER);
+			__inc_lruvec_page_state(page, NR_ANON_THPS);
+		} else
+			VM_BUG_ON(1);
 	} else {
 		/* Anon THP always mapped first with PMD */
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1289,12 +1305,40 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 		clear_page_mlock(page);
 }
 
-static void page_remove_anon_compound_rmap(struct page *page)
+static void page_remove_anon_compound_rmap(struct page *page, int order)
 {
-	int i, nr;
-
-	if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
-		return;
+	int i, nr = 0;
+	struct page *head = compound_head(page);
+
+	if (compound_order(head) == HPAGE_PUD_ORDER) {
+		if (order == HPAGE_PMD_ORDER) {
+			VM_BUG_ON(!PMDPageInPUD(page));
+			if (atomic_add_negative(-1, sub_compound_mapcount_ptr(page, 1))) {
+				if (TestClearPageDoubleMap(page)) {
+					/*
+					 * Subpages can be mapped with PTEs too. Check how many of
+					 * themi are still mapped.
+					 */
+					for (i = 0; i < thp_nr_pages(head); i++) {
+						if (atomic_add_negative(-1, &head[i]._mapcount))
+							nr++;
+					}
+				}
+				__dec_node_page_state(page, NR_ANON_THPS);
+			}
+			nr += HPAGE_PMD_NR;
+			__mod_node_page_state(page_pgdat(head), NR_ANON_MAPPED, -nr);
+			return;
+		} else {
+			VM_BUG_ON(order != HPAGE_PUD_ORDER);
+			if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+				return;
+		}
+	} else if (compound_order(head) == HPAGE_PMD_ORDER) {
+		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+			return;
+	} else
+		VM_BUG_ON_PAGE(1, page);
 
 	/* Hugepages are not counted in NR_ANON_PAGES for now. */
 	if (unlikely(PageHuge(page)))
@@ -1303,12 +1347,26 @@ static void page_remove_anon_compound_rmap(struct page *page)
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 		return;
 
-	if (thp_nr_pages(page) == HPAGE_PMD_NR)
+	if (order == HPAGE_PMD_ORDER)
 		__dec_lruvec_page_state(page, NR_ANON_THPS);
-	else
+	else if (order == HPAGE_PUD_ORDER)
 		__dec_lruvec_page_state(page, NR_ANON_THPS_PUD);
+	else
+		VM_BUG_ON(1);
 
-	if (TestClearPageDoubleMap(page)) {
+	/* PMD-mapped PUD THP is handled above */
+	if (TestClearPagePUDDoubleMap(head)) {
+		VM_BUG_ON(!(compound_order(head) == HPAGE_PUD_ORDER || head == page));
+		/*
+		 * Subpages can be mapped with PMDs too. Check how many of
+		 * themi are still mapped.
+		 */
+		for (i = 0, nr = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR) {
+			if (atomic_add_negative(-1, sub_compound_mapcount_ptr(&head[i], 1)))
+				nr += HPAGE_PMD_NR;
+		}
+	} else if (TestClearPageDoubleMap(head)) {
+		VM_BUG_ON(compound_order(head) != HPAGE_PMD_ORDER);
 		/*
 		 * Subpages can be mapped with PTEs too. Check how many of
 		 * them are still mapped.
@@ -1332,8 +1390,10 @@ static void page_remove_anon_compound_rmap(struct page *page)
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
 
-	if (nr)
-		__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
+	if (nr) {
+		__mod_lruvec_page_state(head, NR_ANON_MAPPED, -nr);
+		deferred_split_huge_page(head);
+	}
 }
 
 /**
@@ -1343,7 +1403,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page, bool compound)
+void page_remove_rmap(struct page *page, bool compound, int order)
 {
 	lock_page_memcg(page);
 
@@ -1353,7 +1413,7 @@ void page_remove_rmap(struct page *page, bool compound)
 	}
 
 	if (compound) {
-		page_remove_anon_compound_rmap(page);
+		page_remove_anon_compound_rmap(page, order);
 		goto out;
 	}
 
@@ -1734,7 +1794,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, PageHuge(page));
+		page_remove_rmap(subpage, PageHuge(page), 0);
 		put_page(page);
 	}
 
diff --git a/mm/swap.c b/mm/swap.c
index 999a84dbe12c..b70631c71171 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -964,6 +964,37 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 					  page_lru(page_tail));
 	}
 }
+
+/* used by __split_pud_huge_page_tail() */
+void lru_add_pud_page_tail(struct page *page, struct page *page_tail,
+		       struct lruvec *lruvec, struct list_head *list)
+{
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+	VM_BUG_ON(NR_CPUS != 1 &&
+		  !spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));
+
+	if (!list)
+		SetPageLRU(page_tail);
+
+	if (likely(PageLRU(page)))
+		list_add_tail(&page_tail->lru, &page->lru);
+	else if (list) {
+		/* page reclaim is reclaiming a huge page */
+		get_page(page_tail);
+		list_add_tail(&page_tail->lru, list);
+	} else {
+		/*
+		 * Head page has not yet been counted, as an hpage,
+		 * so we must account for each subpage individually.
+		 *
+		 * Put page_tail on the list at the correct position
+		 * so they all end up in order.
+		 */
+		add_page_to_lru_list_tail(page_tail, lruvec,
+					  page_lru(page_tail));
+	}
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e3f771c2ad83..285edbcb5e22 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1921,9 +1921,9 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	if (page == swapcache) {
-		page_add_anon_rmap(page, vma, addr, false);
+		page_add_anon_rmap(page, vma, addr, false, 0);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, addr, false);
+		page_add_new_anon_rmap(page, vma, addr, false, 0);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	}
 	swap_free(entry);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9a3d451402d7..9b31d9beaa46 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -122,7 +122,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release_uncharge_unlock;
 
 	inc_mm_counter(dst_mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+	page_add_new_anon_rmap(page, dst_vma, dst_addr, false, 0);
 	lru_cache_add_inactive_or_unevictable(page, dst_vma);
 
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
diff --git a/mm/util.c b/mm/util.c
index bb902f5a6582..410f1ca0932a 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -713,17 +713,27 @@ struct address_space *page_mapping_file(struct page *page)
 int __page_mapcount(struct page *page)
 {
 	int ret;
+	struct page *head = compound_head(page);
 
+	/* base page mapping */
 	ret = atomic_read(&page->_mapcount) + 1;
+
+	/* PMDInPUD mapping */
+	if (compound_order(head) == HPAGE_PUD_ORDER) {
+		struct page *sub_compound_page = head +
+			(((page - head) / HPAGE_PMD_NR) * HPAGE_PMD_NR);
+
+		ret += sub_compound_mapcount(sub_compound_page);
+	}
 	/*
 	 * For file THP page->_mapcount contains total number of mapping
 	 * of the page: no need to look into compound_mapcount.
 	 */
 	if (!PageAnon(page) && !PageHuge(page))
 		return ret;
-	page = compound_head(page);
-	ret += atomic_read(compound_mapcount_ptr(page)) + 1;
-	if (PageDoubleMap(page))
+	/* highest compound mapping */
+	ret += atomic_read(compound_mapcount_ptr(head)) + 1;
+	if (PageDoubleMap(head))
 		ret--;
 	return ret;
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3a01212b652c..dc7c2cec9102 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1330,6 +1330,10 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback_pud",
 	"thp_fault_fallback_pud_charge",
 	"thp_split_pud",
+	"thp_split_pud_page",
+	"thp_split_pud_page_failed",
+	"thp_zero_pud_page_alloc",
+	"thp_zero_pud_page_alloc_failed",
 #endif
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH 07/16] mm: stats: make smap stats understand PUD THPs.
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
                   ` (5 preceding siblings ...)
  2020-09-02 18:06 ` [RFC PATCH 06/16] mm: thp: add 1GB THP split_huge_pud_page() function Zi Yan
@ 2020-09-02 18:06 ` Zi Yan
  2020-09-02 18:06 ` [RFC PATCH 08/16] mm: page_vma_walk: teach it about PMD-mapped PUD THP Zi Yan
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 fs/proc/task_mmu.c | 63 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 58 insertions(+), 5 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 7fc9b3cc48d3..2ff80a9c8b57 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -430,10 +430,9 @@ static void smaps_page_accumulate(struct mem_size_stats *mss,
 }
 
 static void smaps_account(struct mem_size_stats *mss, struct page *page,
-		bool compound, bool young, bool dirty, bool locked)
+		unsigned long size, bool young, bool dirty, bool locked)
 {
-	int i, nr = compound ? compound_nr(page) : 1;
-	unsigned long size = nr * PAGE_SIZE;
+	int i, nr = size / PAGE_SIZE;
 
 	/*
 	 * First accumulate quantities that depend only on |size| and the type
@@ -536,7 +535,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 	if (!page)
 		return;
 
-	smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte), locked);
+	smaps_account(mss, page, PAGE_SIZE, pte_young(*pte), pte_dirty(*pte), locked);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -567,8 +566,44 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 		/* pass */;
 	else
 		mss->file_thp += HPAGE_PMD_SIZE;
-	smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd), locked);
+	smaps_account(mss, page, HPAGE_PMD_SIZE, pmd_young(*pmd),
+		      pmd_dirty(*pmd), locked);
 }
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static void smaps_pud_entry(pud_t *pud, unsigned long addr,
+		struct mm_walk *walk)
+{
+	struct mem_size_stats *mss = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	bool locked = !!(vma->vm_flags & VM_LOCKED);
+	struct page *page = NULL;
+
+	if (pud_present(*pud)) {
+		/* FOLL_DUMP will return -EFAULT on huge zero page */
+		page = follow_trans_huge_pud(vma, addr, pud, FOLL_DUMP);
+	}
+	if (IS_ERR_OR_NULL(page))
+		return;
+	if (PageAnon(page))
+		mss->anonymous_thp += HPAGE_PUD_SIZE;
+	else if (PageSwapBacked(page))
+		mss->shmem_thp += HPAGE_PUD_SIZE;
+	else if (is_zone_device_page(page))
+		/* pass */;
+	else
+		mss->file_thp += HPAGE_PUD_SIZE;
+	smaps_account(mss, page, HPAGE_PUD_SIZE, pud_young(*pud),
+		      pud_dirty(*pud), locked);
+}
+#else
+static void smaps_pud_entry(pud_t *pud, unsigned long addr,
+		struct mm_walk *walk)
+{
+}
+#endif
+
+
 #else
 static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 		struct mm_walk *walk)
@@ -576,6 +611,23 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 }
 #endif
 
+static int smaps_pud_range(pud_t *pud, unsigned long addr, unsigned long end,
+			   struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+	spinlock_t *ptl;
+
+	ptl = pud_trans_huge_lock(pud, vma);
+	if (ptl) {
+		smaps_pud_entry(pud, addr, walk);
+		spin_unlock(ptl);
+		walk->action = ACTION_CONTINUE;
+	}
+
+	cond_resched();
+	return 0;
+}
+
 static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			   struct mm_walk *walk)
 {
@@ -713,6 +765,7 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 #endif /* HUGETLB_PAGE */
 
 static const struct mm_walk_ops smaps_walk_ops = {
+	.pud_entry		= smaps_pud_range,
 	.pmd_entry		= smaps_pte_range,
 	.hugetlb_entry		= smaps_hugetlb_range,
 };
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH 08/16] mm: page_vma_walk: teach it about PMD-mapped PUD THP.
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
                   ` (6 preceding siblings ...)
  2020-09-02 18:06 ` [RFC PATCH 07/16] mm: stats: make smap stats understand PUD THPs Zi Yan
@ 2020-09-02 18:06 ` Zi Yan
  2020-09-02 18:06 ` [RFC PATCH 09/16] mm: thp: 1GB THP support in try_to_unmap() Zi Yan
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

We now have PMD-mapped PUD THP and PTE-mapped PUD THP, page_vma_walk
should handle them properly.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_vma_mapped.c | 116 ++++++++++++++++++++++++++++++-------------
 1 file changed, 82 insertions(+), 34 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index d9d39ec06e21..549e296287fd 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -52,6 +52,22 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
 	return true;
 }
 
+static bool map_pmd(struct page_vma_mapped_walk *pvmw)
+{
+	pmd_t pmde;
+
+	pvmw->pmd = pmd_offset(pvmw->pud, pvmw->address);
+	pmde = READ_ONCE(*pvmw->pmd);
+	if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
+		pvmw->ptl = pmd_lock(pvmw->vma->vm_mm, pvmw->pmd);
+		return true;
+	} else if (!pmd_present(pmde))
+		return false;
+
+	pvmw->ptl = pmd_lock(pvmw->vma->vm_mm, pvmw->pmd);
+	return true;
+}
+
 static inline bool pfn_is_match(struct page *page, unsigned long pfn)
 {
 	unsigned long page_pfn = page_to_pfn(page);
@@ -115,6 +131,38 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 	return pfn_is_match(pvmw->page, pfn);
 }
 
+/* 0: not mapped, 1: pmd_page, 2: pmd  */
+static int check_pmd(struct page_vma_mapped_walk *pvmw)
+{
+	unsigned long pfn;
+
+	if (likely(pmd_trans_huge(*pvmw->pmd))) {
+		if (pvmw->flags & PVMW_MIGRATION)
+			return 0;
+		pfn = pmd_pfn(*pvmw->pmd);
+		if (!pfn_is_match(pvmw->page, pfn))
+			return 0;
+		return 1;
+	} else if (!pmd_present(*pvmw->pmd)) {
+		if (thp_migration_supported()) {
+			if (!(pvmw->flags & PVMW_MIGRATION))
+				return 0;
+			if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
+				swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
+
+				pfn = migration_entry_to_pfn(entry);
+				if (!pfn_is_match(pvmw->page, pfn))
+					return 0;
+				return 1;
+			}
+		}
+		return 0;
+	}
+	/* THP pmd was split under us: handle on pte level */
+	spin_unlock(pvmw->ptl);
+	pvmw->ptl = NULL;
+	return 2;
+}
 /**
  * page_vma_mapped_walk - check if @pvmw->page is mapped in @pvmw->vma at
  * @pvmw->address
@@ -146,14 +194,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t pude;
-	pmd_t pmde;
+	int pmd_res;
 
 	if (!pvmw->pte && !pvmw->pmd && pvmw->pud)
 		return not_found(pvmw);
 
 	/* The only possible pmd mapping has been handled on last iteration */
 	if (pvmw->pmd && !pvmw->pte)
-		return not_found(pvmw);
+		goto next_pmd;
 
 	if (pvmw->pte)
 		goto next_pte;
@@ -201,43 +249,43 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	} else if (!pud_present(pude))
 		return false;
 
-	pvmw->pmd = pmd_offset(pvmw->pud, pvmw->address);
-	/*
-	 * Make sure the pmd value isn't cached in a register by the
-	 * compiler and used as a stale value after we've observed a
-	 * subsequent update.
-	 */
-	pmde = READ_ONCE(*pvmw->pmd);
-	if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
-		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
-		if (likely(pmd_trans_huge(*pvmw->pmd))) {
-			if (pvmw->flags & PVMW_MIGRATION)
-				return not_found(pvmw);
-			if (pmd_page(*pvmw->pmd) != page)
-				return not_found(pvmw);
+	if (!map_pmd(pvmw))
+		goto next_pmd;
+	/* pmd locked after map_pmd  */
+	while (1) {
+		pmd_res = check_pmd(pvmw);
+		if (pmd_res == 1) /* pmd_page */
 			return true;
-		} else if (!pmd_present(*pvmw->pmd)) {
-			if (thp_migration_supported()) {
-				if (!(pvmw->flags & PVMW_MIGRATION))
-					return not_found(pvmw);
-				if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
-					swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
-
-					if (migration_entry_to_page(entry) != page)
-						return not_found(pvmw);
-					return true;
+		else if (pmd_res == 2) /* pmd entry  */
+			goto pte_level;
+next_pmd:
+		/* Only PMD-mapped PUD THP has next pmd  */
+		if (!(PageTransHuge(pvmw->page) && compound_order(pvmw->page) == HPAGE_PUD_ORDER))
+			return not_found(pvmw);
+		do {
+			pvmw->address += HPAGE_PMD_SIZE;
+			if (pvmw->address >= pvmw->vma->vm_end ||
+			    pvmw->address >=
+					__vma_address(pvmw->page, pvmw->vma) +
+					thp_nr_pages(pvmw->page) * PAGE_SIZE)
+				return not_found(pvmw);
+			/* Did we cross page table boundary? */
+			if (pvmw->address % PUD_SIZE == 0) {
+				if (pvmw->ptl) {
+					spin_unlock(pvmw->ptl);
+					pvmw->ptl = NULL;
 				}
+				goto restart;
+			} else {
+				pvmw->pmd++;
 			}
-			return not_found(pvmw);
-		} else {
-			/* THP pmd was split under us: handle on pte level */
-			spin_unlock(pvmw->ptl);
-			pvmw->ptl = NULL;
-		}
-	} else if (!pmd_present(pmde)) {
-		return false;
+		} while (pmd_none(*pvmw->pmd));
+
+		if (!pvmw->ptl)
+			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
 	}
 
+pte_level:
 	if (!map_pte(pvmw))
 		goto next_pte;
 	while (1) {
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH 09/16] mm: thp: 1GB THP support in try_to_unmap().
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
                   ` (7 preceding siblings ...)
  2020-09-02 18:06 ` [RFC PATCH 08/16] mm: page_vma_walk: teach it about PMD-mapped PUD THP Zi Yan
@ 2020-09-02 18:06 ` Zi Yan
  2020-09-02 18:06 ` [RFC PATCH 10/16] mm: thp: split 1GB THPs at page reclaim Zi Yan
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Unmap different subpages in different sized THPs properly in the
try_to_unmap() function.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/migrate.c |   2 +-
 mm/rmap.c    | 159 +++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 116 insertions(+), 45 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index be0e80b32686..df069a55722e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -225,7 +225,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 		/* PMD-mapped THP migration entry */
-		if (!pvmw.pte) {
+		if (!pvmw.pte && pvmw.pmd) {
 			VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);
 			remove_migration_pmd(&pvmw, new);
 			continue;
diff --git a/mm/rmap.c b/mm/rmap.c
index 0bbaaa891b3c..6c788abdb0b9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1123,6 +1123,7 @@ void do_page_add_anon_rmap(struct page *page,
 {
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
+	struct page *head = compound_head(page);
 
 	if (unlikely(PageKsm(page)))
 		lock_page_memcg(page);
@@ -1132,8 +1133,8 @@ void do_page_add_anon_rmap(struct page *page,
 	if (compound) {
 		atomic_t *mapcount = NULL;
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		if (compound_order(page) == HPAGE_PUD_ORDER) {
+		VM_BUG_ON_PAGE(!PMDPageInPUD(page) && !PageTransHuge(page), page);
+		if (compound_order(head) == HPAGE_PUD_ORDER) {
 			if (order == HPAGE_PUD_ORDER) {
 				mapcount = compound_mapcount_ptr(page);
 			} else if (order == HPAGE_PMD_ORDER) {
@@ -1141,7 +1142,7 @@ void do_page_add_anon_rmap(struct page *page,
 				mapcount = sub_compound_mapcount_ptr(page, 1);
 			} else
 				VM_BUG_ON(1);
-		} else if (compound_order(page) == HPAGE_PMD_ORDER) {
+		} else if (compound_order(head) == HPAGE_PMD_ORDER) {
 			mapcount = compound_mapcount_ptr(page);
 		} else
 			VM_BUG_ON(1);
@@ -1151,7 +1152,8 @@ void do_page_add_anon_rmap(struct page *page,
 	}
 
 	if (first) {
-		int nr = compound ? thp_nr_pages(page) : 1;
+		/* int nr = compound ? thp_nr_pages(page) : 1; */
+		int nr = 1<<order;
 		/*
 		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
 		 * these counters are not modified in interrupt context, and
@@ -1460,10 +1462,13 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		.address = address,
 	};
 	pte_t pteval;
-	struct page *subpage;
+	pmd_t pmdval;
+	pud_t pudval;
+	struct page *subpage = NULL;
 	bool ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
+	int order = 0;
 
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
@@ -1473,6 +1478,11 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	    is_zone_device_page(page) && !is_device_private_page(page))
 		return true;
 
+	if (flags & TTU_SPLIT_HUGE_PUD) {
+		split_huge_pud_address(vma, address,
+				flags & TTU_SPLIT_FREEZE, page);
+	}
+
 	if (flags & TTU_SPLIT_HUGE_PMD) {
 		split_huge_pmd_address(vma, address,
 				flags & TTU_SPLIT_FREEZE, page);
@@ -1505,7 +1515,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	while (page_vma_mapped_walk(&pvmw)) {
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 		/* PMD-mapped THP migration entry */
-		if (!pvmw.pte && (flags & TTU_MIGRATION)) {
+		if (!pvmw.pte && pvmw.pmd && (flags & TTU_MIGRATION)) {
 			VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);
 
 			set_pmd_migration_entry(&pvmw, page);
@@ -1537,9 +1547,18 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		/* Unexpected PMD-mapped THP? */
-		VM_BUG_ON_PAGE(!pvmw.pte, page);
 
-		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
+		if (pvmw.pte) {
+			subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
+			order = 0;
+		} else if (!pvmw.pte && pvmw.pmd) {
+			subpage = page - page_to_pfn(page) + pmd_pfn(*pvmw.pmd);
+			order = HPAGE_PMD_ORDER;
+		} else if (!pvmw.pte && !pvmw.pmd && pvmw.pud) {
+			subpage = page - page_to_pfn(page) + pud_pfn(*pvmw.pud);
+			order = HPAGE_PUD_ORDER;
+		}
+		VM_BUG_ON(!subpage);
 		address = pvmw.address;
 
 		if (PageHuge(page)) {
@@ -1617,16 +1636,26 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		if (!(flags & TTU_IGNORE_ACCESS)) {
-			if (ptep_clear_flush_young_notify(vma, address,
-						pvmw.pte)) {
-				ret = false;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
+			if ((pvmw.pte &&
+				 ptep_clear_flush_young_notify(vma, address, pvmw.pte)) ||
+				((!pvmw.pte && pvmw.pmd) &&
+				 pmdp_clear_flush_young_notify(vma, address, pvmw.pmd)) ||
+				((!pvmw.pte && !pvmw.pmd && pvmw.pud) &&
+				 pudp_clear_flush_young_notify(vma, address, pvmw.pud))
+				) {
+					ret = false;
+					page_vma_mapped_walk_done(&pvmw);
+					break;
 			}
 		}
 
 		/* Nuke the page table entry. */
-		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+		if (pvmw.pte)
+			flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+		else if (!pvmw.pte && pvmw.pmd)
+			flush_cache_page(vma, address, pmd_pfn(*pvmw.pmd));
+		else if (!pvmw.pte && !pvmw.pmd && pvmw.pud)
+			flush_cache_page(vma, address, pud_pfn(*pvmw.pud));
 		if (should_defer_flush(mm, flags)) {
 			/*
 			 * We clear the PTE but do not flush so potentially
@@ -1636,16 +1665,34 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 * transition on a cached TLB entry is written through
 			 * and traps if the PTE is unmapped.
 			 */
-			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+			if (pvmw.pte) {
+				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+
+				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+			} else if (!pvmw.pte && pvmw.pmd) {
+				pmdval = pmdp_huge_get_and_clear(mm, address, pvmw.pmd);
 
-			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+				set_tlb_ubc_flush_pending(mm, pmd_dirty(pmdval));
+			} else if (!pvmw.pte && !pvmw.pmd && pvmw.pud) {
+				pudval = pudp_huge_get_and_clear(mm, address, pvmw.pud);
+
+				set_tlb_ubc_flush_pending(mm, pud_dirty(pudval));
+			}
 		} else {
-			pteval = ptep_clear_flush(vma, address, pvmw.pte);
+			if (pvmw.pte)
+				pteval = ptep_clear_flush(vma, address, pvmw.pte);
+			else if (!pvmw.pte && pvmw.pmd)
+				pmdval = pmdp_huge_clear_flush(vma, address, pvmw.pmd);
+			else if (!pvmw.pte && !pvmw.pmd && pvmw.pud)
+				pudval = pudp_huge_clear_flush(vma, address, pvmw.pud);
 		}
 
 		/* Move the dirty bit to the page. Now the pte is gone. */
-		if (pte_dirty(pteval))
-			set_page_dirty(page);
+			if ((pvmw.pte && pte_dirty(pteval)) ||
+				((!pvmw.pte && pvmw.pmd) && pmd_dirty(pmdval)) ||
+				((!pvmw.pte && !pvmw.pmd && pvmw.pud) && pud_dirty(pudval))
+				)
+				set_page_dirty(page);
 
 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
@@ -1680,35 +1727,59 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
 				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
 			swp_entry_t entry;
-			pte_t swp_pte;
 
-			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				ret = false;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
-			}
+			if (pvmw.pte) {
+				pte_t swp_pte;
 
-			/*
-			 * Store the pfn of the page in a special migration
-			 * pte. do_swap_page() will wait until the migration
-			 * pte is removed and then restart fault handling.
-			 */
-			entry = make_migration_entry(subpage,
-					pte_write(pteval));
-			swp_pte = swp_entry_to_pte(entry);
-			if (pte_soft_dirty(pteval))
-				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-			if (pte_uffd_wp(pteval))
-				swp_pte = pte_swp_mkuffd_wp(swp_pte);
-			set_pte_at(mm, address, pvmw.pte, swp_pte);
-			/*
-			 * No need to invalidate here it will synchronize on
-			 * against the special swap migration pte.
-			 */
+				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+					set_pte_at(mm, address, pvmw.pte, pteval);
+					ret = false;
+					page_vma_mapped_walk_done(&pvmw);
+					break;
+				}
+
+				/*
+				 * Store the pfn of the page in a special migration
+				 * pte. do_swap_page() will wait until the migration
+				 * pte is removed and then restart fault handling.
+				 */
+				entry = make_migration_entry(subpage,
+						pte_write(pteval));
+				swp_pte = swp_entry_to_pte(entry);
+				if (pte_soft_dirty(pteval))
+					swp_pte = pte_swp_mksoft_dirty(swp_pte);
+				if (pte_uffd_wp(pteval))
+					swp_pte = pte_swp_mkuffd_wp(swp_pte);
+				set_pte_at(mm, address, pvmw.pte, swp_pte);
+				/*
+				 * No need to invalidate here it will synchronize on
+				 * against the special swap migration pte.
+				 */
+			} else if (!pvmw.pte && pvmw.pmd) {
+				pmd_t swp_pmd;
+				/*
+				 * Store the pfn of the page in a special migration
+				 * pte. do_swap_page() will wait until the migration
+				 * pte is removed and then restart fault handling.
+				 */
+				entry = make_migration_entry(subpage,
+						pmd_write(pmdval));
+				swp_pmd = swp_entry_to_pmd(entry);
+				if (pmd_soft_dirty(pmdval))
+					swp_pmd = pmd_swp_mksoft_dirty(swp_pmd);
+				set_pmd_at(mm, address, pvmw.pmd, swp_pmd);
+				/*
+				 * No need to invalidate here it will synchronize on
+				 * against the special swap migration pte.
+				 */
+			} else if (!pvmw.pte && !pvmw.pmd && pvmw.pud) {
+				VM_BUG_ON(1);
+			}
 		} else if (PageAnon(page)) {
 			swp_entry_t entry = { .val = page_private(subpage) };
 			pte_t swp_pte;
+
+			VM_BUG_ON(!pvmw.pte);
 			/*
 			 * Store the swap location in the pte.
 			 * See handle_pte_fault() ...
@@ -1794,7 +1865,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, PageHuge(page), 0);
+		page_remove_rmap(subpage, PageHuge(page) || order >= HPAGE_PMD_ORDER, order);
 		put_page(page);
 	}
 
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH 10/16] mm: thp: split 1GB THPs at page reclaim.
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
                   ` (8 preceding siblings ...)
  2020-09-02 18:06 ` [RFC PATCH 09/16] mm: thp: 1GB THP support in try_to_unmap() Zi Yan
@ 2020-09-02 18:06 ` Zi Yan
  2020-09-02 18:06 ` [RFC PATCH 11/16] mm: thp: 1GB THP follow_p*d_page() support Zi Yan
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

We cannot swap 1GB THPs, so split them before swap them out.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/swap_slots.c |  2 ++
 mm/vmscan.c     | 58 +++++++++++++++++++++++++++++++++++++------------
 2 files changed, 46 insertions(+), 14 deletions(-)

diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 3e6453573a89..65b8742a0446 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -312,6 +312,8 @@ swp_entry_t get_swap_page(struct page *page)
 	entry.val = 0;
 
 	if (PageTransHuge(page)) {
+		if (compound_order(page) == HPAGE_PUD_ORDER)
+			return entry;
 		if (IS_ENABLED(CONFIG_THP_SWAP))
 			get_swap_pages(1, &entry, HPAGE_PMD_NR);
 		goto out;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99e1796eb833..617d15a041f8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1240,23 +1240,49 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 				if (!(sc->gfp_mask & __GFP_IO))
 					goto keep_locked;
 				if (PageTransHuge(page)) {
-					/* cannot split THP, skip it */
-					if (!can_split_huge_page(page, NULL))
-						goto activate_locked;
-					/*
-					 * Split pages without a PMD map right
-					 * away. Chances are some or all of the
-					 * tail pages can be freed without IO.
-					 */
-					if (!compound_mapcount(page) &&
-					    split_huge_page_to_list(page,
-								    page_list))
+					if (compound_order(page) == HPAGE_PUD_ORDER) {
+						/* cannot split THP, skip it */
+						if (!can_split_huge_pud_page(page, NULL))
+							goto activate_locked;
+						/*
+						 * Split pages without a PUD map right
+						 * away. Chances are some or all of the
+						 * tail pages can be freed without IO.
+						 */
+						if (!compound_mapcount(page) &&
+							split_huge_pud_page_to_list(page,
+										page_list))
+							goto activate_locked;
+					}
+					if (compound_order(page) == HPAGE_PMD_ORDER) {
+						/* cannot split THP, skip it */
+						if (!can_split_huge_page(page, NULL))
+							goto activate_locked;
+						/*
+						 * Split pages without a PMD map right
+						 * away. Chances are some or all of the
+						 * tail pages can be freed without IO.
+						 */
+						if (!compound_mapcount(page) &&
+							split_huge_page_to_list(page,
+										page_list))
+							goto activate_locked;
+					}
+				}
+				/* Split PUD THPs before swapping */
+				if (compound_order(page) == HPAGE_PUD_ORDER) {
+					if (split_huge_pud_page_to_list(page, page_list))
 						goto activate_locked;
+					else {
+						sc->nr_scanned -= (nr_pages - HPAGE_PMD_NR);
+						nr_pages = HPAGE_PMD_NR;
+					}
 				}
 				if (!add_to_swap(page)) {
 					if (!PageTransHuge(page))
 						goto activate_locked_split;
 					/* Fallback to swap normal pages */
+					VM_BUG_ON_PAGE(compound_order(page) != HPAGE_PMD_ORDER, page);
 					if (split_huge_page_to_list(page,
 								    page_list))
 						goto activate_locked;
@@ -1273,6 +1299,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 				mapping = page_mapping(page);
 			}
 		} else if (unlikely(PageTransHuge(page))) {
+			VM_BUG_ON_PAGE(compound_order(page) != HPAGE_PMD_ORDER, page);
 			/* Split file THP */
 			if (split_huge_page_to_list(page, page_list))
 				goto keep_locked;
@@ -1298,9 +1325,12 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 			enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH;
 			bool was_swapbacked = PageSwapBacked(page);
 
-			if (unlikely(PageTransHuge(page)))
-				flags |= TTU_SPLIT_HUGE_PMD;
-
+			if (unlikely(PageTransHuge(page))) {
+				if (compound_order(page) == HPAGE_PMD_ORDER)
+					flags |= TTU_SPLIT_HUGE_PMD;
+				else if (compound_order(page) == HPAGE_PUD_ORDER)
+					flags |= TTU_SPLIT_HUGE_PUD;
+			}
 			if (!try_to_unmap(page, flags)) {
 				stat->nr_unmap_fail += nr_pages;
 				if (!was_swapbacked && PageSwapBacked(page))
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH 11/16] mm: thp: 1GB THP follow_p*d_page() support.
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
                   ` (9 preceding siblings ...)
  2020-09-02 18:06 ` [RFC PATCH 10/16] mm: thp: split 1GB THPs at page reclaim Zi Yan
@ 2020-09-02 18:06 ` Zi Yan
  2020-09-02 18:06 ` [RFC PATCH 12/16] mm: support 1GB THP pagemap support Zi Yan
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Add follow_page support for 1GB THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h | 11 +++++++
 mm/gup.c                | 60 ++++++++++++++++++++++++++++++++-
 mm/huge_memory.c        | 73 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 142 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 589e5af5a1c2..c7bc40c4a5e2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -20,6 +20,10 @@ extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 extern void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
 extern int do_huge_pud_anonymous_page(struct vm_fault *vmf);
 extern vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud);
+extern struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
+					  unsigned long addr,
+					  pud_t *pud,
+					  unsigned int flags);
 #else
 static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 {
@@ -32,6 +36,13 @@ extern vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
 {
 	return VM_FAULT_FALLBACK;
 }
+struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
+					  unsigned long addr,
+					  pud_t *pud,
+					  unsigned int flags)
+{
+	return NULL;
+}
 #endif
 
 extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
diff --git a/mm/gup.c b/mm/gup.c
index bd883a112724..4b32ae3c5fa2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -698,10 +698,68 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 		if (page)
 			return page;
 	}
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	if (likely(!pud_trans_huge(*pud))) {
+		if (unlikely(pud_bad(*pud)))
+			return no_page_table(vma, flags);
+		return follow_pmd_mask(vma, address, pud, flags, ctx);
+	}
+
+	ptl = pud_lock(mm, pud);
+
+	if (unlikely(!pud_trans_huge(*pud))) {
+		spin_unlock(ptl);
+		if (unlikely(pud_bad(*pud)))
+			return no_page_table(vma, flags);
+		return follow_pmd_mask(vma, address, pud, flags, ctx);
+	}
+
+	if (flags & FOLL_SPLIT) {
+		int ret;
+		pmd_t *pmd = NULL;
+
+		page = pud_page(*pud);
+		if (is_huge_zero_page(page)) {
+
+			spin_unlock(ptl);
+			ret = 0;
+			split_huge_pud(vma, pud, address);
+			pmd = pmd_offset(pud, address);
+			split_huge_pmd(vma, pmd, address);
+			if (pmd_trans_unstable(pmd))
+				ret = -EBUSY;
+		} else {
+			get_page(page);
+			spin_unlock(ptl);
+			lock_page(page);
+			ret = split_huge_pud_page(page);
+			if (!ret)
+				ret = split_huge_page(page);
+			else {
+				unlock_page(page);
+				put_page(page);
+				goto out;
+			}
+			unlock_page(page);
+			put_page(page);
+			if (pud_none(*pud))
+				return no_page_table(vma, flags);
+			pmd = pmd_offset(pud, address);
+		}
+out:
+		return ret ? ERR_PTR(ret) :
+			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
+	}
+	page = follow_trans_huge_pud(vma, address, pud, flags);
+	spin_unlock(ptl);
+	ctx->page_mask = HPAGE_PUD_NR - 1;
+	return page;
+#else
 	if (unlikely(pud_bad(*pud)))
 		return no_page_table(vma, flags);
-
 	return follow_pmd_mask(vma, address, pud, flags, ctx);
+#endif
 }
 
 static struct page *follow_p4d_mask(struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 398f1b52f789..e209c2dfc5b7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1259,6 +1259,77 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	return page;
 }
 
+/*
+ * FOLL_FORCE can write to even unwritable pmd's, but only
+ * after we've gone through a COW cycle and they are dirty.
+ */
+static inline bool can_follow_write_pud(pud_t pud, unsigned int flags)
+{
+	return pud_write(pud) ||
+	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pud_dirty(pud));
+}
+
+struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
+				   unsigned long addr,
+				   pud_t *pud,
+				   unsigned int flags)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *page = NULL;
+
+	assert_spin_locked(pud_lockptr(mm, pud));
+
+	if (flags & FOLL_WRITE && !can_follow_write_pud(*pud, flags))
+		goto out;
+
+	/* Avoid dumping huge zero page */
+	if ((flags & FOLL_DUMP) && is_huge_zero_pud(*pud))
+		return ERR_PTR(-EFAULT);
+
+	/* Full NUMA hinting faults to serialise migration in fault paths */
+	/*&& pud_protnone(*pmd)*/
+	if ((flags & FOLL_NUMA))
+		goto out;
+
+	page = pud_page(*pud);
+	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+	if (flags & FOLL_TOUCH)
+		touch_pud(vma, addr, pud, flags);
+	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
+		/*
+		 * We don't mlock() pte-mapped THPs. This way we can avoid
+		 * leaking mlocked pages into non-VM_LOCKED VMAs.
+		 *
+		 * For anon THP:
+		 *
+		 * We do the same thing as PMD-level THP.
+		 *
+		 * For file THP:
+		 *
+		 * No support yet.
+		 *
+		 */
+
+		if (PageAnon(page) && compound_mapcount(page) != 1)
+			goto skip_mlock;
+		if (PagePUDDoubleMap(page) || !page->mapping)
+			goto skip_mlock;
+		if (!trylock_page(page))
+			goto skip_mlock;
+		lru_add_drain();
+		if (page->mapping && !PagePUDDoubleMap(page))
+			mlock_vma_page(page);
+		unlock_page(page);
+	}
+skip_mlock:
+	page += (addr & ~HPAGE_PUD_MASK) >> PAGE_SHIFT;
+	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
+	if (flags & FOLL_GET)
+		get_page(page);
+
+out:
+	return page;
+}
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
 		  struct vm_area_struct *vma)
@@ -1501,7 +1572,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 		goto out;
 
 	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page) && !PMDPageInPUD(page), page);
 
 	if (!try_grab_page(page, flags))
 		return ERR_PTR(-ENOMEM);
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH 12/16] mm: support 1GB THP pagemap support.
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
                   ` (10 preceding siblings ...)
  2020-09-02 18:06 ` [RFC PATCH 11/16] mm: thp: 1GB THP follow_p*d_page() support Zi Yan
@ 2020-09-02 18:06 ` Zi Yan
  2020-09-02 18:06 ` [RFC PATCH 13/16] mm: thp: add a knob to enable/disable 1GB THPs Zi Yan
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Print page flags properly.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 fs/proc/task_mmu.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2ff80a9c8b57..7254c7ecf659 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1557,6 +1557,64 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	return err;
 }
 
+static int pagemap_pud_range(pud_t *pudp, unsigned long addr, unsigned long end,
+			     struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+	struct pagemapread *pm = walk->private;
+	spinlock_t *ptl;
+	int err = 0;
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	ptl = pud_trans_huge_lock(pudp, vma);
+	if (ptl) {
+		u64 flags = 0, frame = 0;
+		pud_t pud = *pudp;
+		struct page *page = NULL;
+
+		if (vma->vm_flags & VM_SOFTDIRTY)
+			flags |= PM_SOFT_DIRTY;
+
+		if (pud_present(pud)) {
+			page = pud_page(pud);
+
+			flags |= PM_PRESENT;
+			if (pud_soft_dirty(pud))
+				flags |= PM_SOFT_DIRTY;
+			if (pm->show_pfn)
+				frame = pud_pfn(pud) +
+					((addr & ~PUD_MASK) >> PAGE_SHIFT);
+		}
+
+		if (page && page_mapcount(page) == 1)
+			flags |= PM_MMAP_EXCLUSIVE;
+
+		for (; addr != end; addr += PAGE_SIZE) {
+			pagemap_entry_t pme = make_pme(frame, flags);
+
+			err = add_to_pagemap(addr, &pme, pm);
+			if (err)
+				break;
+			if (pm->show_pfn) {
+				if (flags & PM_PRESENT)
+					frame++;
+				else if (flags & PM_SWAP)
+					frame += (1 << MAX_SWAPFILES_SHIFT);
+			}
+		}
+		spin_unlock(ptl);
+		walk->action = ACTION_CONTINUE;
+		return err;
+	}
+
+	if (pud_trans_unstable(pudp)) {
+		walk->action = ACTION_AGAIN;
+		return 0;
+	}
+#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+	return err;
+}
+
 #ifdef CONFIG_HUGETLB_PAGE
 /* This function walks within one hugetlb entry in the single call */
 static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
@@ -1607,6 +1665,7 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
 #endif /* HUGETLB_PAGE */
 
 static const struct mm_walk_ops pagemap_ops = {
+	.pud_entry	= pagemap_pud_range,
 	.pmd_entry	= pagemap_pmd_range,
 	.pte_hole	= pagemap_pte_hole,
 	.hugetlb_entry	= pagemap_hugetlb_range,
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH 13/16] mm: thp: add a knob to enable/disable 1GB THPs.
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
                   ` (11 preceding siblings ...)
  2020-09-02 18:06 ` [RFC PATCH 12/16] mm: support 1GB THP pagemap support Zi Yan
@ 2020-09-02 18:06 ` Zi Yan
  2020-09-02 18:06 ` [RFC PATCH 14/16] mm: page_alloc: >=MAX_ORDER pages allocation an deallocation Zi Yan
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It does not affect existing 1GB THPs. It is similar to the knob for
2MB THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h | 14 ++++++++++++++
 mm/huge_memory.c        | 40 ++++++++++++++++++++++++++++++++++++++++
 mm/memory.c             |  2 +-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c7bc40c4a5e2..3bf8d8a09f08 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -119,6 +119,8 @@ enum transparent_hugepage_flag {
 #ifdef CONFIG_DEBUG_VM
 	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
 #endif
+	TRANSPARENT_PUD_HUGEPAGE_FLAG,
+	TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG,
 };
 
 struct kobject;
@@ -184,6 +186,18 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 }
 
 bool transparent_hugepage_enabled(struct vm_area_struct *vma);
+static inline bool transparent_pud_hugepage_enabled(struct vm_area_struct *vma)
+{
+	if (transparent_hugepage_enabled(vma)) {
+		if (transparent_hugepage_flags & (1 << TRANSPARENT_PUD_HUGEPAGE_FLAG))
+			return true;
+		if (transparent_hugepage_flags &
+					(1 << TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG))
+			return !!(vma->vm_flags & VM_HUGEPAGE);
+	}
+
+	return false;
+}
 
 #define HPAGE_CACHE_INDEX_MASK (HPAGE_PMD_NR - 1)
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e209c2dfc5b7..e1440a13da63 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -49,9 +49,11 @@
 unsigned long transparent_hugepage_flags __read_mostly =
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS
 	(1<<TRANSPARENT_HUGEPAGE_FLAG)|
+	(1<<TRANSPARENT_PUD_HUGEPAGE_FLAG)|
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
 	(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
+	(1<<TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG)|
 #endif
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
@@ -199,6 +201,43 @@ static ssize_t enabled_store(struct kobject *kobj,
 static struct kobj_attribute enabled_attr =
 	__ATTR(enabled, 0644, enabled_show, enabled_store);
 
+static ssize_t enabled_1gb_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	if (test_bit(TRANSPARENT_PUD_HUGEPAGE_FLAG, &transparent_hugepage_flags))
+		return sprintf(buf, "[always] madvise never\n");
+	else if (test_bit(TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags))
+		return sprintf(buf, "always [madvise] never\n");
+	else
+		return sprintf(buf, "always madvise [never]\n");
+}
+
+static ssize_t enabled_1gb_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	ssize_t ret = count;
+
+	if (!memcmp("always", buf,
+		    min(sizeof("always")-1, count))) {
+		clear_bit(TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_PUD_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("madvise", buf,
+			   min(sizeof("madvise")-1, count))) {
+		clear_bit(TRANSPARENT_PUD_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("never", buf,
+			   min(sizeof("never")-1, count))) {
+		clear_bit(TRANSPARENT_PUD_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+	} else
+		ret = -EINVAL;
+
+	return ret;
+}
+static struct kobj_attribute enabled_1gb_attr =
+	__ATTR(enabled_1gb, 0644, enabled_1gb_show, enabled_1gb_store);
+
 ssize_t single_hugepage_flag_show(struct kobject *kobj,
 				struct kobj_attribute *attr, char *buf,
 				enum transparent_hugepage_flag flag)
@@ -305,6 +344,7 @@ static struct kobj_attribute hpage_pmd_size_attr =
 
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
+	&enabled_1gb_attr.attr,
 	&defrag_attr.attr,
 	&use_zero_page_attr.attr,
 	&hpage_pmd_size_attr.attr,
diff --git a/mm/memory.c b/mm/memory.c
index 184d8eb2d060..518f29a5903e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4305,7 +4305,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 	if (!vmf.pud)
 		return VM_FAULT_OOM;
 retry_pud:
-	if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) {
+	if (pud_none(*vmf.pud) && transparent_pud_hugepage_enabled(vma)) {
 		ret = create_huge_pud(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH 14/16] mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
                   ` (12 preceding siblings ...)
  2020-09-02 18:06 ` [RFC PATCH 13/16] mm: thp: add a knob to enable/disable 1GB THPs Zi Yan
@ 2020-09-02 18:06 ` Zi Yan
  2020-09-02 18:06 ` [RFC PATCH 15/16] hugetlb: cma: move cma reserve function to cma.c Zi Yan
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Use alloc_contig_pages for allocation and destroy_compound_gigantic_page
for deallocation, so 1GB THP can be created and destroyed without
changing MAX_ORDER.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/hugetlb.c    | 22 ----------------------
 mm/internal.h   |  2 ++
 mm/mempolicy.c  | 15 ++++++++++++++-
 mm/page_alloc.c | 33 ++++++++++++++++++++++++++++-----
 4 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4113d7b66fee..d5357778b026 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1211,26 +1211,6 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 		nr_nodes--)
 
 #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-static void destroy_compound_gigantic_page(struct page *page,
-					unsigned int order)
-{
-	int i;
-	int nr_pages = 1 << order;
-	struct page *p = page + 1;
-
-	atomic_set(compound_mapcount_ptr(page), 0);
-	if (hpage_pincount_available(page))
-		atomic_set(compound_pincount_ptr(page), 0);
-
-	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
-		clear_compound_head(p);
-		set_page_refcounted(p);
-	}
-
-	set_compound_order(page, 0);
-	__ClearPageHead(page);
-}
-
 static void free_gigantic_page(struct page *page, unsigned int order)
 {
 	/*
@@ -1288,8 +1268,6 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 	return NULL;
 }
 static inline void free_gigantic_page(struct page *page, unsigned int order) { }
-static inline void destroy_compound_gigantic_page(struct page *page,
-						unsigned int order) { }
 #endif
 
 static void update_and_free_page(struct hstate *h, struct page *page)
diff --git a/mm/internal.h b/mm/internal.h
index 10c677655912..520fd9b5e18a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -620,4 +620,6 @@ struct migration_target_control {
 	gfp_t gfp_mask;
 };
 
+void destroy_compound_gigantic_page(struct page *page,
+					unsigned int order);
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eddbe4e56c73..4bae089e7a89 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2138,7 +2138,12 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 {
 	struct page *page;
 
-	page = __alloc_pages(gfp, order, nid);
+	if (order > MAX_ORDER) {
+		page = alloc_contig_pages(1UL<<order, gfp, nid, NULL);
+		if (page && (gfp & __GFP_COMP))
+			prep_compound_page(page, order);
+	} else
+		page = __alloc_pages(gfp, order, nid);
 	/* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
 	if (!static_branch_likely(&vm_numa_stat_key))
 		return page;
@@ -2212,6 +2217,14 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		nmask = policy_nodemask(gfp, pol);
 		if (!nmask || node_isset(hpage_node, *nmask)) {
 			mpol_cond_put(pol);
+
+			if (order > MAX_ORDER) {
+				page = alloc_contig_pages(1UL<<order, gfp,
+							  hpage_node, NULL);
+				if (page && (gfp & __GFP_COMP))
+					prep_compound_page(page, order);
+				goto out;
+			}
 			/*
 			 * First, try to allocate THP only on local node, but
 			 * don't reclaim unnecessarily, just compact.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97a4c7e4a579..8a8b241508f7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1480,6 +1480,24 @@ void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
 	}
 }
 
+void destroy_compound_gigantic_page(struct page *page,
+					unsigned int order)
+{
+	int i;
+	int nr_pages = 1 << order;
+	struct page *p = page + 1;
+
+	atomic_set(compound_mapcount_ptr(page), 0);
+	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		clear_compound_head(p);
+		set_page_refcounted(p);
+	}
+
+	set_compound_order(page, 0);
+	__ClearPageHead(page);
+	set_page_refcounted(page);
+}
+
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
@@ -1489,11 +1507,16 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	if (!free_pages_prepare(page, order, true))
 		return;
 
-	migratetype = get_pfnblock_migratetype(page, pfn);
-	local_irq_save(flags);
-	__count_vm_events(PGFREE, 1 << order);
-	free_one_page(page_zone(page), page, pfn, order, migratetype);
-	local_irq_restore(flags);
+	if (order >= MAX_ORDER) {
+		destroy_compound_gigantic_page(page, order);
+		free_contig_range(page_to_pfn(page), 1 << order);
+	} else {
+		migratetype = get_pfnblock_migratetype(page, pfn);
+		local_irq_save(flags);
+		__count_vm_events(PGFREE, 1 << order);
+		free_one_page(page_zone(page), page, pfn, order, migratetype);
+		local_irq_restore(flags);
+	}
 }
 
 void __free_pages_core(struct page *page, unsigned int order)
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH 15/16] hugetlb: cma: move cma reserve function to cma.c.
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
                   ` (13 preceding siblings ...)
  2020-09-02 18:06 ` [RFC PATCH 14/16] mm: page_alloc: >=MAX_ORDER pages allocation an deallocation Zi Yan
@ 2020-09-02 18:06 ` Zi Yan
  2020-09-02 18:06 ` [RFC PATCH 16/16] mm: thp: use cma reservation for pud thp allocation Zi Yan
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It will be used by other allocations, like 1GB THP allocation in the
upcoming commit.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 .../admin-guide/kernel-parameters.txt         |  2 +-
 arch/arm64/mm/hugetlbpage.c                   |  2 +-
 arch/powerpc/mm/hugetlbpage.c                 |  2 +-
 arch/x86/kernel/setup.c                       |  8 +-
 include/linux/cma.h                           | 15 ++++
 include/linux/hugetlb.h                       | 12 ---
 mm/cma.c                                      | 88 +++++++++++++++++++
 mm/hugetlb.c                                  | 88 ++-----------------
 8 files changed, 118 insertions(+), 99 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 68fee5e034ca..600668ee0ac7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1507,7 +1507,7 @@
 	hpet_mmap=	[X86, HPET_MMAP] Allow userspace to mmap HPET
 			registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
 
-	hugetlb_cma=	[HW] The size of a cma area used for allocation
+	hugepage_cma=	[HW] The size of a cma area used for allocation
 			of gigantic hugepages.
 			Format: nn[KMGTPE]
 
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 55ecf6de9ff7..8a3ad7eaae49 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -52,7 +52,7 @@ void __init arm64_hugetlb_cma_reserve(void)
 	 * breaking this assumption.
 	 */
 	WARN_ON(order <= MAX_ORDER);
-	hugetlb_cma_reserve(order);
+	hugepage_cma_reserve(order);
 }
 #endif /* CONFIG_CMA */
 
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 26292544630f..d608e58cb69b 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -699,6 +699,6 @@ void __init gigantic_hugetlb_cma_reserve(void)
 
 	if (order) {
 		VM_WARN_ON(order < MAX_ORDER);
-		hugetlb_cma_reserve(order);
+		hugepage_cma_reserve(order);
 	}
 }
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 52e83ba607b3..93c8fbdff972 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -16,7 +16,7 @@
 #include <linux/pci.h>
 #include <linux/root_dev.h>
 #include <linux/sfi.h>
-#include <linux/hugetlb.h>
+#include <linux/cma.h>
 #include <linux/tboot.h>
 #include <linux/usb/xhci-dbgp.h>
 
@@ -640,7 +640,7 @@ static void __init trim_snb_memory(void)
 	 * already been reserved.
 	 */
 	memblock_reserve(0, 1<<20);
-	
+
 	for (i = 0; i < ARRAY_SIZE(bad_pages); i++) {
 		if (memblock_reserve(bad_pages[i], PAGE_SIZE))
 			printk(KERN_WARNING "failed to reserve 0x%08lx\n",
@@ -732,7 +732,7 @@ static void __init trim_low_memory_range(void)
 {
 	memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
 }
-	
+
 /*
  * Dump out kernel offset information on panic.
  */
@@ -1142,7 +1142,7 @@ void __init setup_arch(char **cmdline_p)
 	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 
 	if (boot_cpu_has(X86_FEATURE_GBPAGES))
-		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
+		hugepage_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
 
 	/*
 	 * Reserve memory for crash kernel after SRAT is parsed so that it
diff --git a/include/linux/cma.h b/include/linux/cma.h
index 6ff79fefd01f..abcf7ab712f9 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -47,4 +47,19 @@ extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);
 
 extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
+
+extern void cma_reserve(int min_order, unsigned long requested_size,
+			const char *name, struct cma *cma_struct[N_MEMORY]);
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+extern void __init hugepage_cma_reserve(int order);
+extern void __init hugepage_cma_check(void);
+#else
+static inline void __init hugepage_cma_check(void)
+{
+}
+static inline void __init hugepage_cma_reserve(int order)
+{
+}
+#endif
+
 #endif
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d5cc5f802dd4..087d13a1dc24 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -935,16 +935,4 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
 	return ptl;
 }
 
-#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
-extern void __init hugetlb_cma_reserve(int order);
-extern void __init hugetlb_cma_check(void);
-#else
-static inline __init void hugetlb_cma_reserve(int order)
-{
-}
-static inline __init void hugetlb_cma_check(void)
-{
-}
-#endif
-
 #endif /* _LINUX_HUGETLB_H */
diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7cda9f..aa3a17d8a191 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -37,6 +37,10 @@
 #include "cma.h"
 
 struct cma cma_areas[MAX_CMA_AREAS];
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+struct cma *hugepage_cma[MAX_NUMNODES];
+#endif
+unsigned long hugepage_cma_size __initdata;
 unsigned cma_area_count;
 static DEFINE_MUTEX(cma_mutex);
 
@@ -541,3 +545,87 @@ int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data)
 
 	return 0;
 }
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+/*
+ * cma_reserve() - reserve CMA for gigantic pages on nodes with memory
+ *
+ * must be called after free_area_init() that updates N_MEMORY via node_set_state().
+ * cma_reserve() scans over N_MEMORY nodemask and hence expects the platforms
+ * to have initialized N_MEMORY state.
+ */
+void __init cma_reserve(int min_order, unsigned long requested_size, const char *name,
+		 struct cma *cma_struct[MAX_NUMNODES])
+{
+	unsigned long size, reserved, per_node;
+	int nid;
+
+	if (!requested_size)
+		return;
+
+	if (requested_size < (PAGE_SIZE << min_order)) {
+		pr_warn("%s_cma: cma area should be at least %lu MiB\n",
+			name, (PAGE_SIZE << min_order) / SZ_1M);
+		return;
+	}
+
+	/*
+	 * If 3 GB area is requested on a machine with 4 numa nodes,
+	 * let's allocate 1 GB on first three nodes and ignore the last one.
+	 */
+	per_node = DIV_ROUND_UP(requested_size, nr_online_nodes);
+	pr_info("%s_cma: reserve %lu MiB, up to %lu MiB per node\n",
+		name, requested_size / SZ_1M, per_node / SZ_1M);
+
+	reserved = 0;
+	for_each_node_state(nid, N_ONLINE) {
+		int res;
+		char node_name[20];
+
+		size = min(per_node, requested_size - reserved);
+		size = round_up(size, PAGE_SIZE << min_order);
+
+		snprintf(node_name, 20, "%s%d", name, nid);
+		res = cma_declare_contiguous_nid(0, size, 0,
+						 PAGE_SIZE << min_order,
+						 0, false, node_name,
+						 &cma_struct[nid], nid);
+		if (res) {
+			pr_warn("%s_cma: reservation failed: err %d, node %d",
+				name, res, nid);
+			continue;
+		}
+
+		reserved += size;
+		pr_info("%s_cma: reserved %lu MiB on node %d\n",
+			name, size / SZ_1M, nid);
+
+		if (reserved >= requested_size)
+			break;
+	}
+}
+
+static bool hugepage_cma_reserve_called __initdata;
+
+static int __init cmdline_parse_hugepage_cma(char *p)
+{
+	hugepage_cma_size = memparse(p, &p);
+	return 0;
+}
+
+early_param("hugepage_cma", cmdline_parse_hugepage_cma);
+
+void __init hugepage_cma_reserve(int order)
+{
+	hugepage_cma_reserve_called = true;
+	cma_reserve(order, hugepage_cma_size, "hugepage", hugepage_cma);
+}
+
+void __init hugepage_cma_check(void)
+{
+	if (!hugepage_cma_size || hugepage_cma_reserve_called)
+		return;
+
+	pr_warn("hugepage_cma: the option isn't supported by current arch\n");
+}
+#endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d5357778b026..6685cad879d0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -48,9 +48,9 @@ unsigned int default_hstate_idx;
 struct hstate hstates[HUGE_MAX_HSTATE];
 
 #ifdef CONFIG_CMA
-static struct cma *hugetlb_cma[MAX_NUMNODES];
+extern struct cma *hugepage_cma[MAX_NUMNODES];
 #endif
-static unsigned long hugetlb_cma_size __initdata;
+extern unsigned long hugepage_cma_size __initdata;
 
 /*
  * Minimum page order among possible hugepage sizes, set to a proper value
@@ -1218,7 +1218,7 @@ static void free_gigantic_page(struct page *page, unsigned int order)
 	 * cma_release() returns false.
 	 */
 #ifdef CONFIG_CMA
-	if (cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order))
+	if (cma_release(hugepage_cma[page_to_nid(page)], page, 1 << order))
 		return;
 #endif
 
@@ -1237,10 +1237,10 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 		int node;
 
 		for_each_node_mask(node, *nodemask) {
-			if (!hugetlb_cma[node])
+			if (!hugepage_cma[node])
 				continue;
 
-			page = cma_alloc(hugetlb_cma[node], nr_pages,
+			page = cma_alloc(hugepage_cma[node], nr_pages,
 					 huge_page_order(h), true);
 			if (page)
 				return page;
@@ -2532,8 +2532,8 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (hstate_is_gigantic(h)) {
-			if (hugetlb_cma_size) {
-				pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
+			if (hugepage_cma_size) {
+				pr_warn_once("HugeTLB: hugepage_cma is enabled, skip boot time allocation\n");
 				break;
 			}
 			if (!alloc_bootmem_huge_page(h))
@@ -3209,7 +3209,7 @@ static int __init hugetlb_init(void)
 		}
 	}
 
-	hugetlb_cma_check();
+	hugepage_cma_check();
 	hugetlb_init_hstates();
 	gather_bootmem_prealloc();
 	report_hugepages();
@@ -5622,75 +5622,3 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
 		spin_unlock(&hugetlb_lock);
 	}
 }
-
-#ifdef CONFIG_CMA
-static bool cma_reserve_called __initdata;
-
-static int __init cmdline_parse_hugetlb_cma(char *p)
-{
-	hugetlb_cma_size = memparse(p, &p);
-	return 0;
-}
-
-early_param("hugetlb_cma", cmdline_parse_hugetlb_cma);
-
-void __init hugetlb_cma_reserve(int order)
-{
-	unsigned long size, reserved, per_node;
-	int nid;
-
-	cma_reserve_called = true;
-
-	if (!hugetlb_cma_size)
-		return;
-
-	if (hugetlb_cma_size < (PAGE_SIZE << order)) {
-		pr_warn("hugetlb_cma: cma area should be at least %lu MiB\n",
-			(PAGE_SIZE << order) / SZ_1M);
-		return;
-	}
-
-	/*
-	 * If 3 GB area is requested on a machine with 4 numa nodes,
-	 * let's allocate 1 GB on first three nodes and ignore the last one.
-	 */
-	per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes);
-	pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n",
-		hugetlb_cma_size / SZ_1M, per_node / SZ_1M);
-
-	reserved = 0;
-	for_each_node_state(nid, N_ONLINE) {
-		int res;
-		char name[20];
-
-		size = min(per_node, hugetlb_cma_size - reserved);
-		size = round_up(size, PAGE_SIZE << order);
-
-		snprintf(name, 20, "hugetlb%d", nid);
-		res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order,
-						 0, false, name,
-						 &hugetlb_cma[nid], nid);
-		if (res) {
-			pr_warn("hugetlb_cma: reservation failed: err %d, node %d",
-				res, nid);
-			continue;
-		}
-
-		reserved += size;
-		pr_info("hugetlb_cma: reserved %lu MiB on node %d\n",
-			size / SZ_1M, nid);
-
-		if (reserved >= hugetlb_cma_size)
-			break;
-	}
-}
-
-void __init hugetlb_cma_check(void)
-{
-	if (!hugetlb_cma_size || cma_reserve_called)
-		return;
-
-	pr_warn("hugetlb_cma: the option isn't supported by current arch\n");
-}
-
-#endif /* CONFIG_CMA */
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH 16/16] mm: thp: use cma reservation for pud thp allocation.
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
                   ` (14 preceding siblings ...)
  2020-09-02 18:06 ` [RFC PATCH 15/16] hugetlb: cma: move cma reserve function to cma.c Zi Yan
@ 2020-09-02 18:06 ` Zi Yan
  2020-09-02 18:40 ` [RFC PATCH 00/16] 1GB THP support on x86_64 Jason Gunthorpe
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:06 UTC (permalink / raw)
  To: linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Sharing hugepage_cma reservation with hugetlb for pud thp allocaiton.
The reserved cma regions still can be used for moveable page allocations.

During 1GB page split, all subpages are cleared from the CMA bitmap,
since they are no more 1GB pages and will be freed via the normal path
instead of cma_release().

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/cma.h     |  3 +++
 include/linux/huge_mm.h | 10 ++++++++++
 mm/cma.c                | 31 +++++++++++++++++++++++++++++++
 mm/huge_memory.c        | 30 ++++++++++++++++++++++++++++++
 mm/mempolicy.c          | 12 +++++++++---
 mm/page_alloc.c         |  3 ++-
 6 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/include/linux/cma.h b/include/linux/cma.h
index abcf7ab712f9..b765d19e4052 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -46,6 +46,9 @@ extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 			      bool no_warn);
 extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);
 
+extern bool cma_clear_bitmap_if_in_range(struct cma *cma, const struct page *page,
+					unsigned int count);
+
 extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
 
 extern void cma_reserve(int min_order, unsigned long requested_size,
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3bf8d8a09f08..5a45877055bb 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -24,6 +24,8 @@ extern struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
 					  unsigned long addr,
 					  pud_t *pud,
 					  unsigned int flags);
+extern struct page *alloc_thp_pud_page(int nid);
+extern bool free_thp_pud_page(struct page *page, int order);
 #else
 static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 {
@@ -43,6 +45,14 @@ struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
 {
 	return NULL;
 }
+struct page *alloc_thp_pud_page(int nid)
+{
+	return NULL;
+}
+extern bool free_thp_pud_page(struct page *page, int order);
+{
+	return false;
+}
 #endif
 
 extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
diff --git a/mm/cma.c b/mm/cma.c
index aa3a17d8a191..3f721b8f7ccd 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -532,6 +532,37 @@ bool cma_release(struct cma *cma, const struct page *pages, unsigned int count)
 	return true;
 }
 
+/**
+ * cma_clear_bitmap_if_in_range() - clear bitmap for a given page
+ * @cma:   Contiguous memory region for which the allocation is performed.
+ * @pages: Allocated pages.
+ * @count: Number of allocated pages.
+ *
+ * This function clears bitmap of memory allocated by cma_alloc().
+ * It returns false when provided pages do not belong to contiguous area and
+ * true otherwise.
+ */
+bool cma_clear_bitmap_if_in_range(struct cma *cma, const struct page *pages,
+				  unsigned int count)
+{
+	unsigned long pfn;
+
+	if (!cma || !pages)
+		return false;
+
+	pfn = page_to_pfn(pages);
+
+	if (pfn < cma->base_pfn || pfn >= cma->base_pfn + cma->count)
+		return false;
+
+	if (pfn + count > cma->base_pfn + cma->count)
+		return false;
+
+	cma_clear_bitmap(cma, pfn, count);
+
+	return true;
+}
+
 int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data)
 {
 	int i;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e1440a13da63..2020b843fd97 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -33,6 +33,7 @@
 #include <linux/oom.h>
 #include <linux/numa.h>
 #include <linux/page_owner.h>
+#include <linux/cma.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -64,6 +65,10 @@ static struct shrinker deferred_split_shrinker;
 static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 
+#ifdef CONFIG_CMA
+extern struct cma *hugepage_cma[MAX_NUMNODES];
+#endif
+
 bool transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
 	/* The addr is used to check if the vma size fits */
@@ -2526,6 +2531,13 @@ static void __split_huge_pud_page(struct page *page, struct list_head *list,
 	/* no file-back page support yet */
 	VM_BUG_ON(!PageAnon(page));
 
+	/*  */
+	if (IS_ENABLED(CONFIG_CMA)) {
+		struct cma *cma = hugepage_cma[page_to_nid(head)];
+		VM_BUG_ON(!cma_clear_bitmap_if_in_range(cma, head,
+				thp_nr_pages(head)));
+	}
+
 	for (i = HPAGE_PUD_NR - HPAGE_PMD_NR; i >= 1; i -= HPAGE_PMD_NR) {
 		__split_huge_pud_page_tail(head, i, lruvec, list);
 	}
@@ -3753,3 +3765,21 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	update_mmu_cache_pmd(vma, address, pvmw->pmd);
 }
 #endif
+
+struct page *alloc_thp_pud_page(int nid)
+{
+	struct page *page = NULL;
+#ifdef CONFIG_CMA
+	page = cma_alloc(hugepage_cma[nid], HPAGE_PUD_NR, HPAGE_PUD_ORDER, true);
+#endif
+	return page;
+}
+
+bool free_thp_pud_page(struct page *page, int order)
+{
+	bool ret = false;
+#ifdef CONFIG_CMA
+	ret = cma_release(hugepage_cma[page_to_nid(page)], page, 1<<order);
+#endif
+	return ret;
+}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4bae089e7a89..82b496922196 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2139,7 +2139,10 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 	struct page *page;
 
 	if (order > MAX_ORDER) {
-		page = alloc_contig_pages(1UL<<order, gfp, nid, NULL);
+		if (order == HPAGE_PUD_ORDER)
+			page = alloc_thp_pud_page(nid);
+		if (!page)
+			page = alloc_contig_pages(1UL<<order, gfp, nid, NULL);
 		if (page && (gfp & __GFP_COMP))
 			prep_compound_page(page, order);
 	} else
@@ -2219,8 +2222,11 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 			mpol_cond_put(pol);
 
 			if (order > MAX_ORDER) {
-				page = alloc_contig_pages(1UL<<order, gfp,
-							  hpage_node, NULL);
+				if (order == HPAGE_PUD_ORDER)
+					page = alloc_thp_pud_page(hpage_node);
+				if (!page)
+					page = alloc_contig_pages(1UL<<order,
+							gfp, hpage_node, NULL);
 				if (page && (gfp & __GFP_COMP))
 					prep_compound_page(page, order);
 				goto out;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8a8b241508f7..eff307b4dc57 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1509,7 +1509,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 
 	if (order >= MAX_ORDER) {
 		destroy_compound_gigantic_page(page, order);
-		free_contig_range(page_to_pfn(page), 1 << order);
+		if (!free_thp_pud_page(page, order))
+			free_contig_range(page_to_pfn(page), 1 << order);
 	} else {
 		migratetype = get_pfnblock_migratetype(page, pfn);
 		local_irq_save(flags);
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
                   ` (15 preceding siblings ...)
  2020-09-02 18:06 ` [RFC PATCH 16/16] mm: thp: use cma reservation for pud thp allocation Zi Yan
@ 2020-09-02 18:40 ` Jason Gunthorpe
  2020-09-02 18:45   ` Zi Yan
  2020-09-03  7:32 ` Michal Hocko
  2020-09-03 14:23 ` Kirill A. Shutemov
  18 siblings, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2020-09-02 18:40 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> This patchset adds support for 1GB THP on x86_64. It is on top of
> v5.9-rc2-mmots-2020-08-25-21-13.
> 
> 1GB THP is more flexible for reducing translation overhead and increasing the
> performance of applications with large memory footprint without application
> changes compared to hugetlb.
> 
> Design
> =======
> 
> 1GB THP implementation looks similar to exiting THP code except some new designs
> for the additional page table level.
> 
> 1. Page table deposit and withdraw using a new pagechain data structure:
>    instead of one PTE page table page, 1GB THP requires 513 page table pages
>    (one PMD page table page and 512 PTE page table pages) to be deposited
>    at the page allocaiton time, so that we can split the page later. Currently,
>    the page table deposit is using ->lru, thus only one page can be deposited.
>    A new pagechain data structure is added to enable multi-page deposit.
> 
> 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
>    and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
>    PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
>    sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
>    page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
>    page[N*512 + 3].compound_mapcount.
> 
> 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
>    to use something less intrusive. So all 1GB THPs are allocated from reserved
>    CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
>    THP is cleared as the resulting pages can be freed via normal page free path.
>    We can fall back to alloc_contig_pages for 1GB THP if necessary.
> 
> 
> Patch Organization
> =======
> 
> Patch 01 adds the new pagechain data structure.
> 
> Patch 02 to 13 adds 1GB THP support in variable places.
> 
> Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.
> 
> Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.
> 
> Patch 16 use hugepage_cma reservation for 1GB THP allocation.
> 
> 
> Any suggestions and comments are welcome.
> 
> 
> Zi Yan (16):
>   mm: add pagechain container for storing multiple pages.
>   mm: thp: 1GB anonymous page implementation.
>   mm: proc: add 1GB THP kpageflag.
>   mm: thp: 1GB THP copy on write implementation.
>   mm: thp: handling 1GB THP reference bit.
>   mm: thp: add 1GB THP split_huge_pud_page() function.
>   mm: stats: make smap stats understand PUD THPs.
>   mm: page_vma_walk: teach it about PMD-mapped PUD THP.
>   mm: thp: 1GB THP support in try_to_unmap().
>   mm: thp: split 1GB THPs at page reclaim.
>   mm: thp: 1GB THP follow_p*d_page() support.
>   mm: support 1GB THP pagemap support.
>   mm: thp: add a knob to enable/disable 1GB THPs.
>   mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
>   hugetlb: cma: move cma reserve function to cma.c.
>   mm: thp: use cma reservation for pud thp allocation.

Surprised this doesn't touch mm/pagewalk.c ?

Jason


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-02 18:40 ` [RFC PATCH 00/16] 1GB THP support on x86_64 Jason Gunthorpe
@ 2020-09-02 18:45   ` Zi Yan
  2020-09-02 18:48     ` Jason Gunthorpe
  0 siblings, 1 reply; 82+ messages in thread
From: Zi Yan @ 2020-09-02 18:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3558 bytes --]

On 2 Sep 2020, at 14:40, Jason Gunthorpe wrote:

> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Hi all,
>>
>> This patchset adds support for 1GB THP on x86_64. It is on top of
>> v5.9-rc2-mmots-2020-08-25-21-13.
>>
>> 1GB THP is more flexible for reducing translation overhead and increasing the
>> performance of applications with large memory footprint without application
>> changes compared to hugetlb.
>>
>> Design
>> =======
>>
>> 1GB THP implementation looks similar to exiting THP code except some new designs
>> for the additional page table level.
>>
>> 1. Page table deposit and withdraw using a new pagechain data structure:
>>    instead of one PTE page table page, 1GB THP requires 513 page table pages
>>    (one PMD page table page and 512 PTE page table pages) to be deposited
>>    at the page allocaiton time, so that we can split the page later. Currently,
>>    the page table deposit is using ->lru, thus only one page can be deposited.
>>    A new pagechain data structure is added to enable multi-page deposit.
>>
>> 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
>>    and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
>>    PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
>>    sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
>>    page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
>>    page[N*512 + 3].compound_mapcount.
>>
>> 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
>>    to use something less intrusive. So all 1GB THPs are allocated from reserved
>>    CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
>>    THP is cleared as the resulting pages can be freed via normal page free path.
>>    We can fall back to alloc_contig_pages for 1GB THP if necessary.
>>
>>
>> Patch Organization
>> =======
>>
>> Patch 01 adds the new pagechain data structure.
>>
>> Patch 02 to 13 adds 1GB THP support in variable places.
>>
>> Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.
>>
>> Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.
>>
>> Patch 16 use hugepage_cma reservation for 1GB THP allocation.
>>
>>
>> Any suggestions and comments are welcome.
>>
>>
>> Zi Yan (16):
>>   mm: add pagechain container for storing multiple pages.
>>   mm: thp: 1GB anonymous page implementation.
>>   mm: proc: add 1GB THP kpageflag.
>>   mm: thp: 1GB THP copy on write implementation.
>>   mm: thp: handling 1GB THP reference bit.
>>   mm: thp: add 1GB THP split_huge_pud_page() function.
>>   mm: stats: make smap stats understand PUD THPs.
>>   mm: page_vma_walk: teach it about PMD-mapped PUD THP.
>>   mm: thp: 1GB THP support in try_to_unmap().
>>   mm: thp: split 1GB THPs at page reclaim.
>>   mm: thp: 1GB THP follow_p*d_page() support.
>>   mm: support 1GB THP pagemap support.
>>   mm: thp: add a knob to enable/disable 1GB THPs.
>>   mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
>>   hugetlb: cma: move cma reserve function to cma.c.
>>   mm: thp: use cma reservation for pud thp allocation.
>
> Surprised this doesn't touch mm/pagewalk.c ?

1GB PUD page support is present for DAX purpose, so the code is there
in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
the functions in mm/pagewalk.c. :)

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-02 18:45   ` Zi Yan
@ 2020-09-02 18:48     ` Jason Gunthorpe
  2020-09-02 19:05       ` Zi Yan
  0 siblings, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2020-09-02 18:48 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:

> > Surprised this doesn't touch mm/pagewalk.c ?
> 
> 1GB PUD page support is present for DAX purpose, so the code is there
> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
> the functions in mm/pagewalk.c. :)

Yes, but doesn't this change what is possible under the mmap_sem
without the page table locks?

ie I would expect some thing like pmd_trans_unstable() to be required
as well for lockless walkers. (and I don't think the pmd code is 100%
right either)

Jason



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-02 18:48     ` Jason Gunthorpe
@ 2020-09-02 19:05       ` Zi Yan
  2020-09-02 19:57         ` Jason Gunthorpe
  0 siblings, 1 reply; 82+ messages in thread
From: Zi Yan @ 2020-09-02 19:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1481 bytes --]

On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote:

> On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:
>
>>> Surprised this doesn't touch mm/pagewalk.c ?
>>
>> 1GB PUD page support is present for DAX purpose, so the code is there
>> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
>> the functions in mm/pagewalk.c. :)
>
> Yes, but doesn't this change what is possible under the mmap_sem
> without the page table locks?
>
> ie I would expect some thing like pmd_trans_unstable() to be required
> as well for lockless walkers. (and I don't think the pmd code is 100%
> right either)
>

Right. I missed that. Thanks for pointing it out.
The code like this, right?

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..4fe6ce4a92eb 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -152,10 +152,11 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
                    !(ops->pmd_entry || ops->pte_entry))
                        continue;

-               if (walk->vma)
+               if (walk->vma) {
                        split_huge_pud(walk->vma, pud, addr);
-               if (pud_none(*pud))
-                       goto again;
+                       if (pud_trans_unstable(pud))
+                               goto again;
+               }

                err = walk_pmd_range(pud, addr, next, walk);
                if (err)


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-02 19:05       ` Zi Yan
@ 2020-09-02 19:57         ` Jason Gunthorpe
  2020-09-02 20:29           ` Zi Yan
  0 siblings, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2020-09-02 19:57 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Wed, Sep 02, 2020 at 03:05:39PM -0400, Zi Yan wrote:
> On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote:
> 
> > On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:
> >
> >>> Surprised this doesn't touch mm/pagewalk.c ?
> >>
> >> 1GB PUD page support is present for DAX purpose, so the code is there
> >> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
> >> the functions in mm/pagewalk.c. :)
> >
> > Yes, but doesn't this change what is possible under the mmap_sem
> > without the page table locks?
> >
> > ie I would expect some thing like pmd_trans_unstable() to be required
> > as well for lockless walkers. (and I don't think the pmd code is 100%
> > right either)
> >
> 
> Right. I missed that. Thanks for pointing it out.
> The code like this, right?

Technically all those *pud's are racy too, the design here with the
_unstable function call always seemed weird. I strongly suspect it
should mirror how get_user_pages_fast works for lockless walking

Jason


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 01/16] mm: add pagechain container for storing multiple pages.
  2020-09-02 18:06 ` [RFC PATCH 01/16] mm: add pagechain container for storing multiple pages Zi Yan
@ 2020-09-02 20:29   ` Randy Dunlap
  2020-09-02 20:48     ` Zi Yan
  2020-09-03  3:15   ` Matthew Wilcox
  2020-09-07 12:22   ` Kirill A. Shutemov
  2 siblings, 1 reply; 82+ messages in thread
From: Randy Dunlap @ 2020-09-02 20:29 UTC (permalink / raw)
  To: Zi Yan, linux-mm, Roman Gushchin
  Cc: Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel

On 9/2/20 11:06 AM, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> When depositing page table pages for 1GB THPs, we need 512 PTE pages +
> 1 PMD page. Instead of counting and depositing 513 pages, we can use the
> PMD page as a leader page and chain the rest 512 PTE pages with ->lru.
> This, however, prevents us depositing PMD pages with ->lru, which is
> currently used by depositing PTE pages for 2MB THPs. So add a new
> pagechain container for PMD pages.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  include/linux/pagechain.h | 73 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 73 insertions(+)
>  create mode 100644 include/linux/pagechain.h
> 
> diff --git a/include/linux/pagechain.h b/include/linux/pagechain.h
> new file mode 100644
> index 000000000000..be536142b413
> --- /dev/null
> +++ b/include/linux/pagechain.h
> @@ -0,0 +1,73 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * include/linux/pagechain.h
> + *
> + * In many places it is efficient to batch an operation up against multiple
> + * pages. A pagechain is a multipage container which is used for that.
> + */
> +
> +#ifndef _LINUX_PAGECHAIN_H
> +#define _LINUX_PAGECHAIN_H
> +
> +#include <linux/slab.h>
> +
> +/* 14 pointers + two long's align the pagechain structure to a power of two */
> +#define PAGECHAIN_SIZE	13

OK, I'll bite.  I see neither 14 pointers nor 2 longs below.
Is the comment out of date or am I just confuzed?

Update: struct list_head is 2 pointers, so I see 15 pointers & one unsigned int.
Where are the 2 longs?

> +
> +struct page;
> +
> +struct pagechain {
> +	struct list_head list;
> +	unsigned int nr;
> +	struct page *pages[PAGECHAIN_SIZE];
> +};

thanks.

-- 
~Randy


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-02 19:57         ` Jason Gunthorpe
@ 2020-09-02 20:29           ` Zi Yan
  2020-09-03 16:40             ` Jason Gunthorpe
  0 siblings, 1 reply; 82+ messages in thread
From: Zi Yan @ 2020-09-02 20:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1310 bytes --]

On 2 Sep 2020, at 15:57, Jason Gunthorpe wrote:

> On Wed, Sep 02, 2020 at 03:05:39PM -0400, Zi Yan wrote:
>> On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote:
>>
>>> On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:
>>>
>>>>> Surprised this doesn't touch mm/pagewalk.c ?
>>>>
>>>> 1GB PUD page support is present for DAX purpose, so the code is there
>>>> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
>>>> the functions in mm/pagewalk.c. :)
>>>
>>> Yes, but doesn't this change what is possible under the mmap_sem
>>> without the page table locks?
>>>
>>> ie I would expect some thing like pmd_trans_unstable() to be required
>>> as well for lockless walkers. (and I don't think the pmd code is 100%
>>> right either)
>>>
>>
>> Right. I missed that. Thanks for pointing it out.
>> The code like this, right?
>
> Technically all those *pud's are racy too, the design here with the
> _unstable function call always seemed weird. I strongly suspect it
> should mirror how get_user_pages_fast works for lockless walking

You mean READ_ONCE on page table entry pointer first, then use the value
for the rest of the loop? I am not quite familiar with this racy check
part of the code and happy to hear more about it.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 01/16] mm: add pagechain container for storing multiple pages.
  2020-09-02 20:29   ` Randy Dunlap
@ 2020-09-02 20:48     ` Zi Yan
  0 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-02 20:48 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2012 bytes --]

On 2 Sep 2020, at 16:29, Randy Dunlap wrote:

> On 9/2/20 11:06 AM, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> When depositing page table pages for 1GB THPs, we need 512 PTE pages +
>> 1 PMD page. Instead of counting and depositing 513 pages, we can use the
>> PMD page as a leader page and chain the rest 512 PTE pages with ->lru.
>> This, however, prevents us depositing PMD pages with ->lru, which is
>> currently used by depositing PTE pages for 2MB THPs. So add a new
>> pagechain container for PMD pages.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  include/linux/pagechain.h | 73 +++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 73 insertions(+)
>>  create mode 100644 include/linux/pagechain.h
>>
>> diff --git a/include/linux/pagechain.h b/include/linux/pagechain.h
>> new file mode 100644
>> index 000000000000..be536142b413
>> --- /dev/null
>> +++ b/include/linux/pagechain.h
>> @@ -0,0 +1,73 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * include/linux/pagechain.h
>> + *
>> + * In many places it is efficient to batch an operation up against multiple
>> + * pages. A pagechain is a multipage container which is used for that.
>> + */
>> +
>> +#ifndef _LINUX_PAGECHAIN_H
>> +#define _LINUX_PAGECHAIN_H
>> +
>> +#include <linux/slab.h>
>> +
>> +/* 14 pointers + two long's align the pagechain structure to a power of two */
>> +#define PAGECHAIN_SIZE	13
>
> OK, I'll bite.  I see neither 14 pointers nor 2 longs below.
> Is the comment out of date or am I just confuzed?
>
> Update: struct list_head is 2 pointers, so I see 15 pointers & one unsigned int.
> Where are the 2 longs?

My bad. Will change this to:

/* 15 pointers + one long align the pagechain structure to a power of two */
#define PAGECHAIN_SIZE  13

struct page;

struct pagechain {
    struct list_head list;
    unsigned long nr;
    struct page *pages[PAGECHAIN_SIZE];
};


Thanks for checking.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 01/16] mm: add pagechain container for storing multiple pages.
  2020-09-02 18:06 ` [RFC PATCH 01/16] mm: add pagechain container for storing multiple pages Zi Yan
  2020-09-02 20:29   ` Randy Dunlap
@ 2020-09-03  3:15   ` Matthew Wilcox
  2020-09-07 12:22   ` Kirill A. Shutemov
  2 siblings, 0 replies; 82+ messages in thread
From: Matthew Wilcox @ 2020-09-03  3:15 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Shakeel Butt, Yang Shi, David Nellans, linux-kernel

On Wed, Sep 02, 2020 at 02:06:13PM -0400, Zi Yan wrote:
> When depositing page table pages for 1GB THPs, we need 512 PTE pages +
> 1 PMD page. Instead of counting and depositing 513 pages, we can use the
> PMD page as a leader page and chain the rest 512 PTE pages with ->lru.
> This, however, prevents us depositing PMD pages with ->lru, which is
> currently used by depositing PTE pages for 2MB THPs. So add a new
> pagechain container for PMD pages.

But you've allocated a page for the PMD table.  Why can't you use that
4kB to store pointers to the 512 PTE tables?

You could also use an existing data structure like the XArray (although
not a pagevec).



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
                   ` (16 preceding siblings ...)
  2020-09-02 18:40 ` [RFC PATCH 00/16] 1GB THP support on x86_64 Jason Gunthorpe
@ 2020-09-03  7:32 ` Michal Hocko
  2020-09-03 16:25   ` Roman Gushchin
  2020-09-03 14:23 ` Kirill A. Shutemov
  18 siblings, 1 reply; 82+ messages in thread
From: Michal Hocko @ 2020-09-03  7:32 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Wed 02-09-20 14:06:12, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> This patchset adds support for 1GB THP on x86_64. It is on top of
> v5.9-rc2-mmots-2020-08-25-21-13.
> 
> 1GB THP is more flexible for reducing translation overhead and increasing the
> performance of applications with large memory footprint without application
> changes compared to hugetlb.

Please be more specific about usecases. This better have some strong
ones because THP code is complex enough already to add on top solely
based on a generic TLB pressure easing.

> Design
> =======
> 
> 1GB THP implementation looks similar to exiting THP code except some new designs
> for the additional page table level.
> 
> 1. Page table deposit and withdraw using a new pagechain data structure:
>    instead of one PTE page table page, 1GB THP requires 513 page table pages
>    (one PMD page table page and 512 PTE page table pages) to be deposited
>    at the page allocaiton time, so that we can split the page later. Currently,
>    the page table deposit is using ->lru, thus only one page can be deposited.
>    A new pagechain data structure is added to enable multi-page deposit.
> 
> 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
>    and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
>    PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
>    sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
>    page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
>    page[N*512 + 3].compound_mapcount.
> 
> 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
>    to use something less intrusive. So all 1GB THPs are allocated from reserved
>    CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
>    THP is cleared as the resulting pages can be freed via normal page free path.
>    We can fall back to alloc_contig_pages for 1GB THP if necessary.

Do those pages get instantiated during the page fault or only via
khugepaged? This is an important design detail because then we have to
think carefully about how much automatic we want this to be. Memory
overhead can be quite large with 2MB THPs already. Also what about the
allocation overhead? Do you have any numbers?

Maybe all these details are described in the patcheset but the cover
letter should contain all that information. It doesn't make much sense
to dig into details in a patchset this large without having an idea how
feasible this is.

Thanks.
 
> Patch Organization
> =======
> 
> Patch 01 adds the new pagechain data structure.
> 
> Patch 02 to 13 adds 1GB THP support in variable places.
> 
> Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.
> 
> Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.
> 
> Patch 16 use hugepage_cma reservation for 1GB THP allocation.
> 
> 
> Any suggestions and comments are welcome.
> 
> 
> Zi Yan (16):
>   mm: add pagechain container for storing multiple pages.
>   mm: thp: 1GB anonymous page implementation.
>   mm: proc: add 1GB THP kpageflag.
>   mm: thp: 1GB THP copy on write implementation.
>   mm: thp: handling 1GB THP reference bit.
>   mm: thp: add 1GB THP split_huge_pud_page() function.
>   mm: stats: make smap stats understand PUD THPs.
>   mm: page_vma_walk: teach it about PMD-mapped PUD THP.
>   mm: thp: 1GB THP support in try_to_unmap().
>   mm: thp: split 1GB THPs at page reclaim.
>   mm: thp: 1GB THP follow_p*d_page() support.
>   mm: support 1GB THP pagemap support.
>   mm: thp: add a knob to enable/disable 1GB THPs.
>   mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
>   hugetlb: cma: move cma reserve function to cma.c.
>   mm: thp: use cma reservation for pud thp allocation.
> 
>  .../admin-guide/kernel-parameters.txt         |   2 +-
>  arch/arm64/mm/hugetlbpage.c                   |   2 +-
>  arch/powerpc/mm/hugetlbpage.c                 |   2 +-
>  arch/x86/include/asm/pgalloc.h                |  68 ++
>  arch/x86/include/asm/pgtable.h                |  26 +
>  arch/x86/kernel/setup.c                       |   8 +-
>  arch/x86/mm/pgtable.c                         |  38 +
>  drivers/base/node.c                           |   3 +
>  fs/proc/meminfo.c                             |   2 +
>  fs/proc/page.c                                |   2 +
>  fs/proc/task_mmu.c                            | 122 ++-
>  include/linux/cma.h                           |  18 +
>  include/linux/huge_mm.h                       |  84 +-
>  include/linux/hugetlb.h                       |  12 -
>  include/linux/memcontrol.h                    |   5 +
>  include/linux/mm.h                            |  29 +-
>  include/linux/mm_types.h                      |   1 +
>  include/linux/mmu_notifier.h                  |  13 +
>  include/linux/mmzone.h                        |   1 +
>  include/linux/page-flags.h                    |  47 +
>  include/linux/pagechain.h                     |  73 ++
>  include/linux/pgtable.h                       |  34 +
>  include/linux/rmap.h                          |  10 +-
>  include/linux/swap.h                          |   2 +
>  include/linux/vm_event_item.h                 |   7 +
>  include/uapi/linux/kernel-page-flags.h        |   2 +
>  kernel/events/uprobes.c                       |   4 +-
>  kernel/fork.c                                 |   5 +
>  mm/cma.c                                      | 119 +++
>  mm/gup.c                                      |  60 +-
>  mm/huge_memory.c                              | 939 +++++++++++++++++-
>  mm/hugetlb.c                                  | 114 +--
>  mm/internal.h                                 |   2 +
>  mm/khugepaged.c                               |   6 +-
>  mm/ksm.c                                      |   4 +-
>  mm/memcontrol.c                               |  13 +
>  mm/memory.c                                   |  51 +-
>  mm/mempolicy.c                                |  21 +-
>  mm/migrate.c                                  |  12 +-
>  mm/page_alloc.c                               |  57 +-
>  mm/page_vma_mapped.c                          | 129 ++-
>  mm/pgtable-generic.c                          |  56 ++
>  mm/rmap.c                                     | 289 ++++--
>  mm/swap.c                                     |  31 +
>  mm/swap_slots.c                               |   2 +
>  mm/swapfile.c                                 |   8 +-
>  mm/userfaultfd.c                              |   2 +-
>  mm/util.c                                     |  16 +-
>  mm/vmscan.c                                   |  58 +-
>  mm/vmstat.c                                   |   8 +
>  50 files changed, 2270 insertions(+), 349 deletions(-)
>  create mode 100644 include/linux/pagechain.h
> 
> --
> 2.28.0
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
                   ` (17 preceding siblings ...)
  2020-09-03  7:32 ` Michal Hocko
@ 2020-09-03 14:23 ` Kirill A. Shutemov
  2020-09-03 16:30   ` Roman Gushchin
  18 siblings, 1 reply; 82+ messages in thread
From: Kirill A. Shutemov @ 2020-09-03 14:23 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> This patchset adds support for 1GB THP on x86_64. It is on top of
> v5.9-rc2-mmots-2020-08-25-21-13.
> 
> 1GB THP is more flexible for reducing translation overhead and increasing the
> performance of applications with large memory footprint without application
> changes compared to hugetlb.

This statement needs a lot of justification. I don't see 1GB THP as viable
for any workload. Opportunistic 1GB allocation is very questionable
strategy.

> Design
> =======
> 
> 1GB THP implementation looks similar to exiting THP code except some new designs
> for the additional page table level.
> 
> 1. Page table deposit and withdraw using a new pagechain data structure:
>    instead of one PTE page table page, 1GB THP requires 513 page table pages
>    (one PMD page table page and 512 PTE page table pages) to be deposited
>    at the page allocaiton time, so that we can split the page later. Currently,
>    the page table deposit is using ->lru, thus only one page can be deposited.

False. Current code can deposit arbitrary number of page tables.

What can be problem to you is that these page tables tied to struct page
of PMD page table.

>    A new pagechain data structure is added to enable multi-page deposit.
> 
> 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
>    and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
>    PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
>    sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
>    page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
>    page[N*512 + 3].compound_mapcount.

I had hard time reasoning about DoubleMap vs. rmap. Good for you if you
get it right.

> 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
>    to use something less intrusive. So all 1GB THPs are allocated from reserved
>    CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
>    THP is cleared as the resulting pages can be freed via normal page free path.
>    We can fall back to alloc_contig_pages for 1GB THP if necessary.
> 

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-03  7:32 ` Michal Hocko
@ 2020-09-03 16:25   ` Roman Gushchin
  2020-09-03 16:50     ` Jason Gunthorpe
                       ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Roman Gushchin @ 2020-09-03 16:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zi Yan, linux-mm, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> On Wed 02-09-20 14:06:12, Zi Yan wrote:
> > From: Zi Yan <ziy@nvidia.com>
> > 
> > Hi all,
> > 
> > This patchset adds support for 1GB THP on x86_64. It is on top of
> > v5.9-rc2-mmots-2020-08-25-21-13.
> > 
> > 1GB THP is more flexible for reducing translation overhead and increasing the
> > performance of applications with large memory footprint without application
> > changes compared to hugetlb.
> 
> Please be more specific about usecases. This better have some strong
> ones because THP code is complex enough already to add on top solely
> based on a generic TLB pressure easing.

Hello, Michal!

We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
performance wins on some workloads.

Historically we allocated gigantic pages at the boot time, but recently moved
to cma-based dynamic approach. Still, hugetlbfs interface requires more management
than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
see it as a very useful feature.

Given the cost of an allocation, I'm slightly skeptical about an automatic
heuristics-based approach, but if an application can explicitly mark target areas
with madvise(), I don't see why it wouldn't work.

In our case we'd like to have a reliable way to get 1 GB THPs at some point
(usually at the start of an application), and transparently destroy them on
the application exit.

Once we'll have the patchset in a relatively good shape, I'll be happy to give
it a test in our environment and share results.

Thanks!

> 
> > Design
> > =======
> > 
> > 1GB THP implementation looks similar to exiting THP code except some new designs
> > for the additional page table level.
> > 
> > 1. Page table deposit and withdraw using a new pagechain data structure:
> >    instead of one PTE page table page, 1GB THP requires 513 page table pages
> >    (one PMD page table page and 512 PTE page table pages) to be deposited
> >    at the page allocaiton time, so that we can split the page later. Currently,
> >    the page table deposit is using ->lru, thus only one page can be deposited.
> >    A new pagechain data structure is added to enable multi-page deposit.
> > 
> > 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
> >    and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
> >    PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
> >    sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
> >    page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
> >    page[N*512 + 3].compound_mapcount.
> > 
> > 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
> >    to use something less intrusive. So all 1GB THPs are allocated from reserved
> >    CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
> >    THP is cleared as the resulting pages can be freed via normal page free path.
> >    We can fall back to alloc_contig_pages for 1GB THP if necessary.
> 
> Do those pages get instantiated during the page fault or only via
> khugepaged? This is an important design detail because then we have to
> think carefully about how much automatic we want this to be. Memory
> overhead can be quite large with 2MB THPs already. Also what about the
> allocation overhead? Do you have any numbers?
> 
> Maybe all these details are described in the patcheset but the cover
> letter should contain all that information. It doesn't make much sense
> to dig into details in a patchset this large without having an idea how
> feasible this is.
> 
> Thanks.
>  
> > Patch Organization
> > =======
> > 
> > Patch 01 adds the new pagechain data structure.
> > 
> > Patch 02 to 13 adds 1GB THP support in variable places.
> > 
> > Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.
> > 
> > Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.
> > 
> > Patch 16 use hugepage_cma reservation for 1GB THP allocation.
> > 
> > 
> > Any suggestions and comments are welcome.
> > 
> > 
> > Zi Yan (16):
> >   mm: add pagechain container for storing multiple pages.
> >   mm: thp: 1GB anonymous page implementation.
> >   mm: proc: add 1GB THP kpageflag.
> >   mm: thp: 1GB THP copy on write implementation.
> >   mm: thp: handling 1GB THP reference bit.
> >   mm: thp: add 1GB THP split_huge_pud_page() function.
> >   mm: stats: make smap stats understand PUD THPs.
> >   mm: page_vma_walk: teach it about PMD-mapped PUD THP.
> >   mm: thp: 1GB THP support in try_to_unmap().
> >   mm: thp: split 1GB THPs at page reclaim.
> >   mm: thp: 1GB THP follow_p*d_page() support.
> >   mm: support 1GB THP pagemap support.
> >   mm: thp: add a knob to enable/disable 1GB THPs.
> >   mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
> >   hugetlb: cma: move cma reserve function to cma.c.
> >   mm: thp: use cma reservation for pud thp allocation.
> > 
> >  .../admin-guide/kernel-parameters.txt         |   2 +-
> >  arch/arm64/mm/hugetlbpage.c                   |   2 +-
> >  arch/powerpc/mm/hugetlbpage.c                 |   2 +-
> >  arch/x86/include/asm/pgalloc.h                |  68 ++
> >  arch/x86/include/asm/pgtable.h                |  26 +
> >  arch/x86/kernel/setup.c                       |   8 +-
> >  arch/x86/mm/pgtable.c                         |  38 +
> >  drivers/base/node.c                           |   3 +
> >  fs/proc/meminfo.c                             |   2 +
> >  fs/proc/page.c                                |   2 +
> >  fs/proc/task_mmu.c                            | 122 ++-
> >  include/linux/cma.h                           |  18 +
> >  include/linux/huge_mm.h                       |  84 +-
> >  include/linux/hugetlb.h                       |  12 -
> >  include/linux/memcontrol.h                    |   5 +
> >  include/linux/mm.h                            |  29 +-
> >  include/linux/mm_types.h                      |   1 +
> >  include/linux/mmu_notifier.h                  |  13 +
> >  include/linux/mmzone.h                        |   1 +
> >  include/linux/page-flags.h                    |  47 +
> >  include/linux/pagechain.h                     |  73 ++
> >  include/linux/pgtable.h                       |  34 +
> >  include/linux/rmap.h                          |  10 +-
> >  include/linux/swap.h                          |   2 +
> >  include/linux/vm_event_item.h                 |   7 +
> >  include/uapi/linux/kernel-page-flags.h        |   2 +
> >  kernel/events/uprobes.c                       |   4 +-
> >  kernel/fork.c                                 |   5 +
> >  mm/cma.c                                      | 119 +++
> >  mm/gup.c                                      |  60 +-
> >  mm/huge_memory.c                              | 939 +++++++++++++++++-
> >  mm/hugetlb.c                                  | 114 +--
> >  mm/internal.h                                 |   2 +
> >  mm/khugepaged.c                               |   6 +-
> >  mm/ksm.c                                      |   4 +-
> >  mm/memcontrol.c                               |  13 +
> >  mm/memory.c                                   |  51 +-
> >  mm/mempolicy.c                                |  21 +-
> >  mm/migrate.c                                  |  12 +-
> >  mm/page_alloc.c                               |  57 +-
> >  mm/page_vma_mapped.c                          | 129 ++-
> >  mm/pgtable-generic.c                          |  56 ++
> >  mm/rmap.c                                     | 289 ++++--
> >  mm/swap.c                                     |  31 +
> >  mm/swap_slots.c                               |   2 +
> >  mm/swapfile.c                                 |   8 +-
> >  mm/userfaultfd.c                              |   2 +-
> >  mm/util.c                                     |  16 +-
> >  mm/vmscan.c                                   |  58 +-
> >  mm/vmstat.c                                   |   8 +
> >  50 files changed, 2270 insertions(+), 349 deletions(-)
> >  create mode 100644 include/linux/pagechain.h
> > 
> > --
> > 2.28.0
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-03 14:23 ` Kirill A. Shutemov
@ 2020-09-03 16:30   ` Roman Gushchin
  2020-09-08 11:57     ` David Hildenbrand
  0 siblings, 1 reply; 82+ messages in thread
From: Roman Gushchin @ 2020-09-03 16:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Zi Yan, linux-mm, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
> > From: Zi Yan <ziy@nvidia.com>
> > 
> > Hi all,
> > 
> > This patchset adds support for 1GB THP on x86_64. It is on top of
> > v5.9-rc2-mmots-2020-08-25-21-13.
> > 
> > 1GB THP is more flexible for reducing translation overhead and increasing the
> > performance of applications with large memory footprint without application
> > changes compared to hugetlb.
> 
> This statement needs a lot of justification. I don't see 1GB THP as viable
> for any workload. Opportunistic 1GB allocation is very questionable
> strategy.

Hello, Kirill!

I share your skepticism about opportunistic 1 GB allocations, however it might be useful
if backed by an madvise() annotations from userspace application. In this case,
1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
interface.

Thanks!

> 
> > Design
> > =======
> > 
> > 1GB THP implementation looks similar to exiting THP code except some new designs
> > for the additional page table level.
> > 
> > 1. Page table deposit and withdraw using a new pagechain data structure:
> >    instead of one PTE page table page, 1GB THP requires 513 page table pages
> >    (one PMD page table page and 512 PTE page table pages) to be deposited
> >    at the page allocaiton time, so that we can split the page later. Currently,
> >    the page table deposit is using ->lru, thus only one page can be deposited.
> 
> False. Current code can deposit arbitrary number of page tables.
> 
> What can be problem to you is that these page tables tied to struct page
> of PMD page table.
> 
> >    A new pagechain data structure is added to enable multi-page deposit.
> > 
> > 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
> >    and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
> >    PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
> >    sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
> >    page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
> >    page[N*512 + 3].compound_mapcount.
> 
> I had hard time reasoning about DoubleMap vs. rmap. Good for you if you
> get it right.
> 
> > 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
> >    to use something less intrusive. So all 1GB THPs are allocated from reserved
> >    CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
> >    THP is cleared as the resulting pages can be freed via normal page free path.
> >    We can fall back to alloc_contig_pages for 1GB THP if necessary.
> > 
> 
> -- 
>  Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-02 20:29           ` Zi Yan
@ 2020-09-03 16:40             ` Jason Gunthorpe
  2020-09-03 16:55               ` Matthew Wilcox
  0 siblings, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2020-09-03 16:40 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Wed, Sep 02, 2020 at 04:29:46PM -0400, Zi Yan wrote:
> On 2 Sep 2020, at 15:57, Jason Gunthorpe wrote:
> 
> > On Wed, Sep 02, 2020 at 03:05:39PM -0400, Zi Yan wrote:
> >> On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote:
> >>
> >>> On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:
> >>>
> >>>>> Surprised this doesn't touch mm/pagewalk.c ?
> >>>>
> >>>> 1GB PUD page support is present for DAX purpose, so the code is there
> >>>> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
> >>>> the functions in mm/pagewalk.c. :)
> >>>
> >>> Yes, but doesn't this change what is possible under the mmap_sem
> >>> without the page table locks?
> >>>
> >>> ie I would expect some thing like pmd_trans_unstable() to be required
> >>> as well for lockless walkers. (and I don't think the pmd code is 100%
> >>> right either)
> >>>
> >>
> >> Right. I missed that. Thanks for pointing it out.
> >> The code like this, right?
> >
> > Technically all those *pud's are racy too, the design here with the
> > _unstable function call always seemed weird. I strongly suspect it
> > should mirror how get_user_pages_fast works for lockless walking
> 
> You mean READ_ONCE on page table entry pointer first, then use the value
> for the rest of the loop? I am not quite familiar with this racy check
> part of the code and happy to hear more about it.

There are two main issues with the THPs and lockless walks
 
- The *pXX value can change at any time, as THPs can be split at any
  moment. However, once observed to be a sub page table pointer the
  value is fixed under the read side of the mmap (I think, I never
  did find the code path supporting this, but everything is busted if
  it isn't true...)
 
- Reading the *pXX without load tearing is difficult on 32 bit arches

So if you do READ_ONCE() it defeats the first problem.

However if the sizeof(*pXX) is 8 on a 32 bit platform then load
tearing is a problem. At lest the various pXX_*() test functions
operate on a single 32 bit word so don't tear, but to to convert the
*pXX to a lower level page table pointer a coherent, untorn, read is
required.

So, looking again, I remember now, I could never quite figure out why
gup_pmd_range() was safe to do:

                pmd_t pmd = READ_ONCE(*pmdp);
[..]
                } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
[..]
        ptem = ptep = pte_offset_map(&pmd, addr);

As I don't see what prevents load tearing a 64 bit pmd.. Eg no
pmd_trans_unstable() or equivalent here.

But we see gup_get_pte() using an anti-load tearing technique.. 

Jason


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-03 16:25   ` Roman Gushchin
@ 2020-09-03 16:50     ` Jason Gunthorpe
  2020-09-03 17:01       ` Matthew Wilcox
  2020-09-03 20:57     ` Mike Kravetz
  2020-09-04  7:42     ` Michal Hocko
  2 siblings, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2020-09-03 16:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, Zi Yan, linux-mm, Rik van Riel,
	Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt, Yang Shi,
	David Nellans, linux-kernel

On Thu, Sep 03, 2020 at 09:25:27AM -0700, Roman Gushchin wrote:
> On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> > On Wed 02-09-20 14:06:12, Zi Yan wrote:
> > > From: Zi Yan <ziy@nvidia.com>
> > > 
> > > Hi all,
> > > 
> > > This patchset adds support for 1GB THP on x86_64. It is on top of
> > > v5.9-rc2-mmots-2020-08-25-21-13.
> > > 
> > > 1GB THP is more flexible for reducing translation overhead and increasing the
> > > performance of applications with large memory footprint without application
> > > changes compared to hugetlb.
> > 
> > Please be more specific about usecases. This better have some strong
> > ones because THP code is complex enough already to add on top solely
> > based on a generic TLB pressure easing.
> 
> Hello, Michal!
> 
> We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
> performance wins on some workloads.

At least from a RDMA NIC perspective I've heard from a lot of users
that higher order pages at the DMA level is giving big speed ups too.

It is basically the same dynamic as CPU TLB, except missing a 'TLB'
cache in a PCI-E device is dramatically more expensive to refill. With
200G and soon 400G networking these misses are a growing problem.

With HPC nodes now pushing 1TB of actual physical RAM and single
applications basically using all of it, there is definately some
meaningful return - if pages can be reliably available.

At least for HPC where the node returns to an idle state after each
job and most of the 1TB memory becomes freed up again, it seems more
believable to me that a large cache of 1G pages could be available?

Even triggering some kind of cleaner between jobs to defragment could
be a reasonable approach..

Jason


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-03 16:40             ` Jason Gunthorpe
@ 2020-09-03 16:55               ` Matthew Wilcox
  2020-09-03 17:08                 ` Jason Gunthorpe
  0 siblings, 1 reply; 82+ messages in thread
From: Matthew Wilcox @ 2020-09-03 16:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zi Yan, linux-mm, Roman Gushchin, Rik van Riel,
	Kirill A . Shutemov, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Thu, Sep 03, 2020 at 01:40:32PM -0300, Jason Gunthorpe wrote:
> However if the sizeof(*pXX) is 8 on a 32 bit platform then load
> tearing is a problem. At lest the various pXX_*() test functions
> operate on a single 32 bit word so don't tear, but to to convert the
> *pXX to a lower level page table pointer a coherent, untorn, read is
> required.
> 
> So, looking again, I remember now, I could never quite figure out why
> gup_pmd_range() was safe to do:
> 
>                 pmd_t pmd = READ_ONCE(*pmdp);
> [..]
>                 } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
> [..]
>         ptem = ptep = pte_offset_map(&pmd, addr);
> 
> As I don't see what prevents load tearing a 64 bit pmd.. Eg no
> pmd_trans_unstable() or equivalent here.

I don't think there are any 32-bit page tables which support a PUD-sized
page.  Pretty sure x86 doesn't until you get to 4- or 5- level page tables
(which need you to be running in 64-bit mode).  There's not much utility
in having 1GB of your 3GB process address space taken up by a single page.

I'm OK if there are some oddball architectures which support it, but
Linux doesn't.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-03 16:50     ` Jason Gunthorpe
@ 2020-09-03 17:01       ` Matthew Wilcox
  2020-09-03 17:18         ` Jason Gunthorpe
  0 siblings, 1 reply; 82+ messages in thread
From: Matthew Wilcox @ 2020-09-03 17:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Roman Gushchin, Michal Hocko, Zi Yan, linux-mm, Rik van Riel,
	Kirill A . Shutemov, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Thu, Sep 03, 2020 at 01:50:51PM -0300, Jason Gunthorpe wrote:
> At least from a RDMA NIC perspective I've heard from a lot of users
> that higher order pages at the DMA level is giving big speed ups too.
> 
> It is basically the same dynamic as CPU TLB, except missing a 'TLB'
> cache in a PCI-E device is dramatically more expensive to refill. With
> 200G and soon 400G networking these misses are a growing problem.
> 
> With HPC nodes now pushing 1TB of actual physical RAM and single
> applications basically using all of it, there is definately some
> meaningful return - if pages can be reliably available.
> 
> At least for HPC where the node returns to an idle state after each
> job and most of the 1TB memory becomes freed up again, it seems more
> believable to me that a large cache of 1G pages could be available?

You may be interested in trying out my current THP patchset:

http://git.infradead.org/users/willy/pagecache.git

It doesn't allocate pages larger than PMD size, but it does allocate pages
*up to* PMD size for the page cache which means that larger pages are
easier to create as larger pages aren't fragmented all over the system.

If someone wants to opportunistically allocate pages larger than PMD
size, I've put some preliminary support in for that, but I've never
tested any of it.  That's not my goal at the moment.

I'm not clear whether these HPC users primarily use page cache or
anonymous memory (with O_DIRECT).  Probably a mixture.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-03 16:55               ` Matthew Wilcox
@ 2020-09-03 17:08                 ` Jason Gunthorpe
  0 siblings, 0 replies; 82+ messages in thread
From: Jason Gunthorpe @ 2020-09-03 17:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zi Yan, linux-mm, Roman Gushchin, Rik van Riel,
	Kirill A . Shutemov, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Thu, Sep 03, 2020 at 05:55:59PM +0100, Matthew Wilcox wrote:
> On Thu, Sep 03, 2020 at 01:40:32PM -0300, Jason Gunthorpe wrote:
> > However if the sizeof(*pXX) is 8 on a 32 bit platform then load
> > tearing is a problem. At lest the various pXX_*() test functions
> > operate on a single 32 bit word so don't tear, but to to convert the
> > *pXX to a lower level page table pointer a coherent, untorn, read is
> > required.
> > 
> > So, looking again, I remember now, I could never quite figure out why
> > gup_pmd_range() was safe to do:
> > 
> >                 pmd_t pmd = READ_ONCE(*pmdp);
> > [..]
> >                 } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
> > [..]
> >         ptem = ptep = pte_offset_map(&pmd, addr);
> > 
> > As I don't see what prevents load tearing a 64 bit pmd.. Eg no
> > pmd_trans_unstable() or equivalent here.
> 
> I don't think there are any 32-bit page tables which support a PUD-sized
> page.  Pretty sure x86 doesn't until you get to 4- or 5- level page tables
> (which need you to be running in 64-bit mode).  There's not much utility
> in having 1GB of your 3GB process address space taken up by a single page.

Make sense for PUD, but why is the above GUP code OK for PMD?
pmd_trans_unstable() exists specifically to close read tearing races,
so it looks like a real problem?

> I'm OK if there are some oddball architectures which support it, but
> Linux doesn't.

So, based on that observation, I think something approximately like
this is needed for the page walker for PUD: (this has been on my
backlog to return to these patches..)

From 00a361ecb2d9e1226600d9e78e6e1803a886f2d6 Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe <jgg@mellanox.com>
Date: Fri, 13 Mar 2020 13:15:36 -0300
Subject: [RFC] mm/pagewalk: use READ_ONCE when reading the PUD entry
 unlocked

The pagewalker runs while only holding the mmap_sem for read. The pud can
be set asynchronously, while also holding the mmap_sem for read

eg from:

 handle_mm_fault()
  __handle_mm_fault()
   create_huge_pmd()
    dev_dax_huge_fault()
     __dev_dax_pud_fault()
      vmf_insert_pfn_pud()
       insert_pfn_pud()
        pud_lock()
        set_pud_at()

At least x86 sets the PUD using WRITE_ONCE(), so an unlocked read of
unstable data should be paired to use READ_ONCE().

For the pagewalker to work locklessly the PUD must work similarly to the
PMD: once the PUD entry becomes a pointer to a PMD, it must be stable, and
safe to pass to pmd_offset()

Passing the value from READ_ONCE into the callbacks prevents the callers
from seeing inconsistencies after they re-read, such as seeing pud_none().

If a callback does obtain the pud_lock then it should trigger ACTION_AGAIN
if a data race caused the original value to change.

Use the same pattern as gup_pmd_range() and pass in the address of the
local READ_ONCE stack variable to pmd_offset() to avoid reading it again.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/linux/pagewalk.h   |  2 +-
 mm/hmm.c                   | 16 +++++++---------
 mm/mapping_dirty_helpers.c |  6 ++----
 mm/pagewalk.c              | 28 ++++++++++++++++------------
 mm/ptdump.c                |  3 +--
 5 files changed, 27 insertions(+), 28 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index b1cb6b753abb53..6caf28aadafbff 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -39,7 +39,7 @@ struct mm_walk_ops {
 			 unsigned long next, struct mm_walk *walk);
 	int (*p4d_entry)(p4d_t *p4d, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
-	int (*pud_entry)(pud_t *pud, unsigned long addr,
+	int (*pud_entry)(pud_t pud, pud_t *pudp, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
 	int (*pmd_entry)(pmd_t *pmd, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
diff --git a/mm/hmm.c b/mm/hmm.c
index 6d9da4b0f0a9f8..98ced96421b913 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -459,28 +459,26 @@ static inline uint64_t pud_to_hmm_pfn_flags(struct hmm_range *range, pud_t pud)
 				range->flags[HMM_PFN_VALID];
 }
 
-static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
-		struct mm_walk *walk)
+static int hmm_vma_walk_pud(pud_t pud, pud_t *pudp, unsigned long start,
+			    unsigned long end, struct mm_walk *walk)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
 	unsigned long addr = start;
-	pud_t pud;
 	int ret = 0;
 	spinlock_t *ptl = pud_trans_huge_lock(pudp, walk->vma);
 
 	if (!ptl)
 		return 0;
+	if (memcmp(pudp, &pud, sizeof(pud)) != 0) {
+		walk->action = ACTION_AGAIN;
+		spin_unlock(ptl);
+		return 0;
+	}
 
 	/* Normally we don't want to split the huge page */
 	walk->action = ACTION_CONTINUE;
 
-	pud = READ_ONCE(*pudp);
-	if (pud_none(pud)) {
-		spin_unlock(ptl);
-		return hmm_vma_walk_hole(start, end, -1, walk);
-	}
-
 	if (pud_huge(pud) && pud_devmap(pud)) {
 		unsigned long i, npages, pfn;
 		uint64_t *pfns, cpu_flags;
diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index 71070dda9643d4..8943c2509ec0f7 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -125,12 +125,10 @@ static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
 }
 
 /* wp_clean_pud_entry - The pagewalk pud callback. */
-static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end,
-			      struct mm_walk *walk)
+static int wp_clean_pud_entry(pud_t pudval, pud_t *pudp, unsigned long addr,
+			      unsigned long end, struct mm_walk *walk)
 {
 	/* Dirty-tracking should be handled on the pte level */
-	pud_t pudval = READ_ONCE(*pud);
-
 	if (pud_trans_huge(pudval) || pud_devmap(pudval))
 		WARN_ON(pud_write(pudval) || pud_dirty(pudval));
 
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 928df1638c30d1..cf99536cec23be 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	return err;
 }
 
-static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+static int walk_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
 {
 	pmd_t *pmd;
@@ -67,7 +67,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	int err = 0;
 	int depth = real_depth(3);
 
-	pmd = pmd_offset(pud, addr);
+	pmd = pmd_offset(&pud, addr);
 	do {
 again:
 		next = pmd_addr_end(addr, end);
@@ -119,17 +119,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
 {
-	pud_t *pud;
+	pud_t *pudp;
+	pud_t pud;
 	unsigned long next;
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
 	int depth = real_depth(2);
 
-	pud = pud_offset(p4d, addr);
+	pudp = pud_offset(p4d, addr);
 	do {
  again:
+		pud = READ_ONCE(*pudp);
 		next = pud_addr_end(addr, end);
-		if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
+		if (pud_none(pud) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
@@ -140,27 +142,29 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 		walk->action = ACTION_SUBTREE;
 
 		if (ops->pud_entry)
-			err = ops->pud_entry(pud, addr, next, walk);
+			err = ops->pud_entry(pud, pudp, addr, next, walk);
 		if (err)
 			break;
 
 		if (walk->action == ACTION_AGAIN)
 			goto again;
 
-		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
+		if ((!walk->vma && (pud_leaf(pud) || !pud_present(pud))) ||
 		    walk->action == ACTION_CONTINUE ||
 		    !(ops->pmd_entry || ops->pte_entry))
 			continue;
 
-		if (walk->vma)
-			split_huge_pud(walk->vma, pud, addr);
-		if (pud_none(*pud))
-			goto again;
+		if (walk->vma) {
+			split_huge_pud(walk->vma, pudp, addr);
+			pud = READ_ONCE(*pudp);
+			if (pud_none(pud))
+				goto again;
+		}
 
 		err = walk_pmd_range(pud, addr, next, walk);
 		if (err)
 			break;
-	} while (pud++, addr = next, addr != end);
+	} while (pudp++, addr = next, addr != end);
 
 	return err;
 }
diff --git a/mm/ptdump.c b/mm/ptdump.c
index 26208d0d03b7a9..c5e1717671e36a 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -59,11 +59,10 @@ static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr,
 	return 0;
 }
 
-static int ptdump_pud_entry(pud_t *pud, unsigned long addr,
+static int ptdump_pud_entry(pud_t val, pud_t *pudp, unsigned long addr,
 			    unsigned long next, struct mm_walk *walk)
 {
 	struct ptdump_state *st = walk->private;
-	pud_t val = READ_ONCE(*pud);
 
 #if CONFIG_PGTABLE_LEVELS > 2 && defined(CONFIG_KASAN)
 	if (pud_page(val) == virt_to_page(lm_alias(kasan_early_shadow_pmd)))
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-03 17:01       ` Matthew Wilcox
@ 2020-09-03 17:18         ` Jason Gunthorpe
  0 siblings, 0 replies; 82+ messages in thread
From: Jason Gunthorpe @ 2020-09-03 17:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Roman Gushchin, Michal Hocko, Zi Yan, linux-mm, Rik van Riel,
	Kirill A . Shutemov, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Thu, Sep 03, 2020 at 06:01:57PM +0100, Matthew Wilcox wrote:
> On Thu, Sep 03, 2020 at 01:50:51PM -0300, Jason Gunthorpe wrote:
> > At least from a RDMA NIC perspective I've heard from a lot of users
> > that higher order pages at the DMA level is giving big speed ups too.
> > 
> > It is basically the same dynamic as CPU TLB, except missing a 'TLB'
> > cache in a PCI-E device is dramatically more expensive to refill. With
> > 200G and soon 400G networking these misses are a growing problem.
> > 
> > With HPC nodes now pushing 1TB of actual physical RAM and single
> > applications basically using all of it, there is definately some
> > meaningful return - if pages can be reliably available.
> > 
> > At least for HPC where the node returns to an idle state after each
> > job and most of the 1TB memory becomes freed up again, it seems more
> > believable to me that a large cache of 1G pages could be available?
> 
> You may be interested in trying out my current THP patchset:
> 
> http://git.infradead.org/users/willy/pagecache.git
> 
> It doesn't allocate pages larger than PMD size, but it does allocate pages
> *up to* PMD size for the page cache which means that larger pages are
> easier to create as larger pages aren't fragmented all over the system.

Yeah, I saw that, it looks like a great direction.

> If someone wants to opportunistically allocate pages larger than PMD
> size, I've put some preliminary support in for that, but I've never
> tested any of it.  That's not my goal at the moment.
> 
> I'm not clear whether these HPC users primarily use page cache or
> anonymous memory (with O_DIRECT).  Probably a mixture.

There are defiantly HPC systems now that are filesystem-less - they
import data for computation from the network using things like blob
storage or some other kind of non-POSIX userspace based data storage
scheme.

Jason


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-03 16:25   ` Roman Gushchin
  2020-09-03 16:50     ` Jason Gunthorpe
@ 2020-09-03 20:57     ` Mike Kravetz
  2020-09-03 21:06       ` Roman Gushchin
  2020-09-04  7:42     ` Michal Hocko
  2 siblings, 1 reply; 82+ messages in thread
From: Mike Kravetz @ 2020-09-03 20:57 UTC (permalink / raw)
  To: Roman Gushchin, Michal Hocko
  Cc: Zi Yan, linux-mm, Rik van Riel, Kirill A.Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On 9/3/20 9:25 AM, Roman Gushchin wrote:
> On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
>> On Wed 02-09-20 14:06:12, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> Hi all,
>>>
>>> This patchset adds support for 1GB THP on x86_64. It is on top of
>>> v5.9-rc2-mmots-2020-08-25-21-13.
>>>
>>> 1GB THP is more flexible for reducing translation overhead and increasing the
>>> performance of applications with large memory footprint without application
>>> changes compared to hugetlb.
>>
>> Please be more specific about usecases. This better have some strong
>> ones because THP code is complex enough already to add on top solely
>> based on a generic TLB pressure easing.
> 
> Hello, Michal!
> 
> We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
> performance wins on some workloads.
> 
> Historically we allocated gigantic pages at the boot time, but recently moved
> to cma-based dynamic approach. Still, hugetlbfs interface requires more management
> than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
> see it as a very useful feature.
> 
> Given the cost of an allocation, I'm slightly skeptical about an automatic
> heuristics-based approach, but if an application can explicitly mark target areas
> with madvise(), I don't see why it wouldn't work.
> 
> In our case we'd like to have a reliable way to get 1 GB THPs at some point
> (usually at the start of an application), and transparently destroy them on
> the application exit.

Hi Roman,

In your current use case at Facebook, are you adding 1G hugetlb pages to
the hugetlb pool and then using them within applications?  Or, are you
dynamically allocating them at fault time (hugetlb overcommit/surplus)?

Latency time for use of such pages includes:
- Putting together 1G contiguous
- Clearing 1G memory

In the 'allocation at fault time' mode you incur both costs at fault time.
If using pages from the pool, your only cost at fault time is clearing the
page.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-03 20:57     ` Mike Kravetz
@ 2020-09-03 21:06       ` Roman Gushchin
  0 siblings, 0 replies; 82+ messages in thread
From: Roman Gushchin @ 2020-09-03 21:06 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Michal Hocko, Zi Yan, linux-mm, Rik van Riel, Kirill A.Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Thu, Sep 03, 2020 at 01:57:54PM -0700, Mike Kravetz wrote:
> On 9/3/20 9:25 AM, Roman Gushchin wrote:
> > On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> >> On Wed 02-09-20 14:06:12, Zi Yan wrote:
> >>> From: Zi Yan <ziy@nvidia.com>
> >>>
> >>> Hi all,
> >>>
> >>> This patchset adds support for 1GB THP on x86_64. It is on top of
> >>> v5.9-rc2-mmots-2020-08-25-21-13.
> >>>
> >>> 1GB THP is more flexible for reducing translation overhead and increasing the
> >>> performance of applications with large memory footprint without application
> >>> changes compared to hugetlb.
> >>
> >> Please be more specific about usecases. This better have some strong
> >> ones because THP code is complex enough already to add on top solely
> >> based on a generic TLB pressure easing.
> > 
> > Hello, Michal!
> > 
> > We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
> > performance wins on some workloads.
> > 
> > Historically we allocated gigantic pages at the boot time, but recently moved
> > to cma-based dynamic approach. Still, hugetlbfs interface requires more management
> > than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
> > see it as a very useful feature.
> > 
> > Given the cost of an allocation, I'm slightly skeptical about an automatic
> > heuristics-based approach, but if an application can explicitly mark target areas
> > with madvise(), I don't see why it wouldn't work.
> > 
> > In our case we'd like to have a reliable way to get 1 GB THPs at some point
> > (usually at the start of an application), and transparently destroy them on
> > the application exit.
> 
> Hi Roman,
> 
> In your current use case at Facebook, are you adding 1G hugetlb pages to
> the hugetlb pool and then using them within applications?  Or, are you
> dynamically allocating them at fault time (hugetlb overcommit/surplus)?
> 
> Latency time for use of such pages includes:
> - Putting together 1G contiguous
> - Clearing 1G memory
> 
> In the 'allocation at fault time' mode you incur both costs at fault time.
> If using pages from the pool, your only cost at fault time is clearing the
> page.

Hi Mike,

We're using a pool. Under dynamic I mean that gigantic pages are not
allocated at a boot time.

Thanks!


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-03 16:25   ` Roman Gushchin
  2020-09-03 16:50     ` Jason Gunthorpe
  2020-09-03 20:57     ` Mike Kravetz
@ 2020-09-04  7:42     ` Michal Hocko
  2020-09-04 21:10       ` Roman Gushchin
  2 siblings, 1 reply; 82+ messages in thread
From: Michal Hocko @ 2020-09-04  7:42 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Zi Yan, linux-mm, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Thu 03-09-20 09:25:27, Roman Gushchin wrote:
> On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> > On Wed 02-09-20 14:06:12, Zi Yan wrote:
> > > From: Zi Yan <ziy@nvidia.com>
> > > 
> > > Hi all,
> > > 
> > > This patchset adds support for 1GB THP on x86_64. It is on top of
> > > v5.9-rc2-mmots-2020-08-25-21-13.
> > > 
> > > 1GB THP is more flexible for reducing translation overhead and increasing the
> > > performance of applications with large memory footprint without application
> > > changes compared to hugetlb.
> > 
> > Please be more specific about usecases. This better have some strong
> > ones because THP code is complex enough already to add on top solely
> > based on a generic TLB pressure easing.
> 
> Hello, Michal!
> 
> We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
> performance wins on some workloads.

Let me clarify. I am not questioning 1GB (or large) pages in general. I
believe it is quite clear that there are usecases which hugely benefit
from them.  I am mostly asking for the transparent part of it which
traditionally means that userspace mostly doesn't have to care and get
them. 2MB THPs have established certain expectations mostly a really    
aggressive pro-active instanciation. This has bitten us many times and
create a "you need to disable THP to fix your problem whatever that is"
cargo cult. I hope we do not want to repeat that mistake here again.

> Historically we allocated gigantic pages at the boot time, but recently moved
> to cma-based dynamic approach. Still, hugetlbfs interface requires more management
> than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
> see it as a very useful feature.
> 
> Given the cost of an allocation, I'm slightly skeptical about an automatic
> heuristics-based approach, but if an application can explicitly mark target areas
> with madvise(), I don't see why it wouldn't work.

An explicit opt-in sounds much more appropriate to me as well. If we go
with a specific API then I would not make it 1GB pages specific. Why
cannot we have an explicit interface to "defragment" address space
range into large pages and the kernel would use large pages where
appropriate? Or is the additional copying prohibitively expensive?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-04  7:42     ` Michal Hocko
@ 2020-09-04 21:10       ` Roman Gushchin
  2020-09-07  7:20         ` Michal Hocko
  0 siblings, 1 reply; 82+ messages in thread
From: Roman Gushchin @ 2020-09-04 21:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zi Yan, linux-mm, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
> On Thu 03-09-20 09:25:27, Roman Gushchin wrote:
> > On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> > > On Wed 02-09-20 14:06:12, Zi Yan wrote:
> > > > From: Zi Yan <ziy@nvidia.com>
> > > > 
> > > > Hi all,
> > > > 
> > > > This patchset adds support for 1GB THP on x86_64. It is on top of
> > > > v5.9-rc2-mmots-2020-08-25-21-13.
> > > > 
> > > > 1GB THP is more flexible for reducing translation overhead and increasing the
> > > > performance of applications with large memory footprint without application
> > > > changes compared to hugetlb.
> > > 
> > > Please be more specific about usecases. This better have some strong
> > > ones because THP code is complex enough already to add on top solely
> > > based on a generic TLB pressure easing.
> > 
> > Hello, Michal!
> > 
> > We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
> > performance wins on some workloads.
> 
> Let me clarify. I am not questioning 1GB (or large) pages in general. I
> believe it is quite clear that there are usecases which hugely benefit
> from them.  I am mostly asking for the transparent part of it which
> traditionally means that userspace mostly doesn't have to care and get
> them. 2MB THPs have established certain expectations mostly a really    
> aggressive pro-active instanciation. This has bitten us many times and
> create a "you need to disable THP to fix your problem whatever that is"
> cargo cult. I hope we do not want to repeat that mistake here again.

Absolutely, I agree with all above. 1 GB THPs have even fewer chances
to be allocated automatically without hurting overall performance.

I believe that historically the THP allocation success rate and cost were not good
enough to have a strict interface, that's why the "best effort" approach was used.
Maybe I'm wrong here. Also in some cases (e.g. desktop) an opportunistic approach
looks like "it's some perf boost for free". However in case of large distributed
systems it's important to get a predictable and uniform performance across nodes,
so "maybe some hosts will perform better" is not giving much.

> 
> > Historically we allocated gigantic pages at the boot time, but recently moved
> > to cma-based dynamic approach. Still, hugetlbfs interface requires more management
> > than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
> > see it as a very useful feature.
> > 
> > Given the cost of an allocation, I'm slightly skeptical about an automatic
> > heuristics-based approach, but if an application can explicitly mark target areas
> > with madvise(), I don't see why it wouldn't work.
> 
> An explicit opt-in sounds much more appropriate to me as well. If we go
> with a specific API then I would not make it 1GB pages specific. Why
> cannot we have an explicit interface to "defragment" address space
> range into large pages and the kernel would use large pages where
> appropriate? Or is the additional copying prohibitively expensive?

Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE)
provides something similar to what you're describing, but there are lot
of details here, so I'm probably missing something.

Thank you!


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-04 21:10       ` Roman Gushchin
@ 2020-09-07  7:20         ` Michal Hocko
  2020-09-08 15:09           ` Zi Yan
  0 siblings, 1 reply; 82+ messages in thread
From: Michal Hocko @ 2020-09-07  7:20 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Zi Yan, linux-mm, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Fri 04-09-20 14:10:45, Roman Gushchin wrote:
> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
[...]
> > An explicit opt-in sounds much more appropriate to me as well. If we go
> > with a specific API then I would not make it 1GB pages specific. Why
> > cannot we have an explicit interface to "defragment" address space
> > range into large pages and the kernel would use large pages where
> > appropriate? Or is the additional copying prohibitively expensive?
> 
> Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE)
> provides something similar to what you're describing, but there are lot
> of details here, so I'm probably missing something.

MADV_HUGEPAGE is controlling a preference for THP to be used for a
particular address range. So it looks similar but the historical
behavior is to control page faults as well and the behavior depends on
the global setup.

I've had in mind something much simpler. Effectively an API to invoke
khugepaged (like) functionality synchronously from the calling context
on the specific address range. It could be more aggressive than the
regular khugepaged and create even 1G pages (or as large THPs as page
tables can handle on the particular arch for that matter).

As this would be an explicit call we do not have to be worried about
the resulting latency because it would be an explicit call by the
userspace.  The default khugepaged has a harder position there because
has no understanding of the target address space and cannot make any
cost/benefit evaluation so it has to be more conservative.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 01/16] mm: add pagechain container for storing multiple pages.
  2020-09-02 18:06 ` [RFC PATCH 01/16] mm: add pagechain container for storing multiple pages Zi Yan
  2020-09-02 20:29   ` Randy Dunlap
  2020-09-03  3:15   ` Matthew Wilcox
@ 2020-09-07 12:22   ` Kirill A. Shutemov
  2020-09-07 15:11     ` Zi Yan
  2 siblings, 1 reply; 82+ messages in thread
From: Kirill A. Shutemov @ 2020-09-07 12:22 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Wed, Sep 02, 2020 at 02:06:13PM -0400, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> When depositing page table pages for 1GB THPs, we need 512 PTE pages +
> 1 PMD page. Instead of counting and depositing 513 pages, we can use the
> PMD page as a leader page and chain the rest 512 PTE pages with ->lru.
> This, however, prevents us depositing PMD pages with ->lru, which is
> currently used by depositing PTE pages for 2MB THPs. So add a new
> pagechain container for PMD pages.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>

Just deposit it to a linked list in the mm_struct as we do for PMD if
split ptl disabled.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 01/16] mm: add pagechain container for storing multiple pages.
  2020-09-07 12:22   ` Kirill A. Shutemov
@ 2020-09-07 15:11     ` Zi Yan
  2020-09-09 13:46       ` Kirill A. Shutemov
  0 siblings, 1 reply; 82+ messages in thread
From: Zi Yan @ 2020-09-07 15:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2325 bytes --]

On 7 Sep 2020, at 8:22, Kirill A. Shutemov wrote:

> On Wed, Sep 02, 2020 at 02:06:13PM -0400, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> When depositing page table pages for 1GB THPs, we need 512 PTE pages +
>> 1 PMD page. Instead of counting and depositing 513 pages, we can use the
>> PMD page as a leader page and chain the rest 512 PTE pages with ->lru.
>> This, however, prevents us depositing PMD pages with ->lru, which is
>> currently used by depositing PTE pages for 2MB THPs. So add a new
>> pagechain container for PMD pages.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>
> Just deposit it to a linked list in the mm_struct as we do for PMD if
> split ptl disabled.
>

Thank you for checking the patches. Since we don’t have PUD split lock
yet, I store the PMD page table pages in a newly added linked list head
in mm_struct like you suggested above.

I was too vague about my pagechain design for depositing page table pages
for PUD THPs. Sorry about the confusion. Let me clarify why
I am doing this pagechain here too. I am sure there would be
some other designs and I am happy to change my code.

In my design, I did not store all page table pages in a single list.
I first deposit 512 PTE pages in one PMD page table page’s pmd_huge_pte
using pgtable_trans_huge_depsit(), then deposit the PMD page to
a newly added linked list in mm_struct. Since pmd_huge_pte shares space
with half of lru in struct page, we cannot use lru to link all PMD
pages together. As a result, I added pagechain. Also in this way,
we can avoid these things:

1. when we withdraw the PMD page during PUD THP split, we don’t need
to withdraw 513 page, set up one PMD page, then, deposit 512 PTE pages
in that PMD page.

2. we don’t mix PMD page table pages and PTE page table pages in a single
list, since they are initialized in different ways. Otherwise, we need
to maintain a subtle rule in the single page table page list that in every
513 pages, first one is PMD page table page and the rest are PTE page
table pages.

As I am typing, I also realize that my current design does not work
when PMD split lock is disabled, so I will fix it. I would store PMD pages
and PTE pages in two separate lists in mm_struct.


Any comments?


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-03 16:30   ` Roman Gushchin
@ 2020-09-08 11:57     ` David Hildenbrand
  2020-09-08 14:05       ` Zi Yan
  0 siblings, 1 reply; 82+ messages in thread
From: David Hildenbrand @ 2020-09-08 11:57 UTC (permalink / raw)
  To: Roman Gushchin, Kirill A. Shutemov
  Cc: Zi Yan, linux-mm, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On 03.09.20 18:30, Roman Gushchin wrote:
> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> Hi all,
>>>
>>> This patchset adds support for 1GB THP on x86_64. It is on top of
>>> v5.9-rc2-mmots-2020-08-25-21-13.
>>>
>>> 1GB THP is more flexible for reducing translation overhead and increasing the
>>> performance of applications with large memory footprint without application
>>> changes compared to hugetlb.
>>
>> This statement needs a lot of justification. I don't see 1GB THP as viable
>> for any workload. Opportunistic 1GB allocation is very questionable
>> strategy.
> 
> Hello, Kirill!
> 
> I share your skepticism about opportunistic 1 GB allocations, however it might be useful
> if backed by an madvise() annotations from userspace application. In this case,
> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
> interface.

I have concerns if we would silently use 1~GB THPs in most scenarios
where be would have used 2~MB THP. I'd appreciate a trigger to
explicitly enable that - MADV_HUGEPAGE is not sufficient because some
applications relying on that assume that the THP size will be 2~MB
(especially, if you want sparse, large VMAs).

E.g., read via man page

"This  feature  is  primarily  aimed at applications that use large
mappings of data and access large regions of that memory at a time
(e.g., virtualization systems such as QEMU).  It  can  very  easily
waste  memory (e.g., a 2 MB mapping that only ever accesses 1 byte will
result in 2 MB of wired memory instead of one 4 KB page)."



Having that said, I consider having 1~GB THP - similar to 512~MP THP on
arm64 - useless in most setup and I am not sure if it is worth the
trouble. Just use hugetlbfs for the handful of applications where it
makes sense.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-08 11:57     ` David Hildenbrand
@ 2020-09-08 14:05       ` Zi Yan
  2020-09-08 14:22         ` David Hildenbrand
                           ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-08 14:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Roman Gushchin, Kirill A. Shutemov, linux-mm, Rik van Riel,
	Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt, Yang Shi,
	David Nellans, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1955 bytes --]

On 8 Sep 2020, at 7:57, David Hildenbrand wrote:

> On 03.09.20 18:30, Roman Gushchin wrote:
>> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
>>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>> Hi all,
>>>>
>>>> This patchset adds support for 1GB THP on x86_64. It is on top of
>>>> v5.9-rc2-mmots-2020-08-25-21-13.
>>>>
>>>> 1GB THP is more flexible for reducing translation overhead and increasing the
>>>> performance of applications with large memory footprint without application
>>>> changes compared to hugetlb.
>>>
>>> This statement needs a lot of justification. I don't see 1GB THP as viable
>>> for any workload. Opportunistic 1GB allocation is very questionable
>>> strategy.
>>
>> Hello, Kirill!
>>
>> I share your skepticism about opportunistic 1 GB allocations, however it might be useful
>> if backed by an madvise() annotations from userspace application. In this case,
>> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
>> interface.
>
> I have concerns if we would silently use 1~GB THPs in most scenarios
> where be would have used 2~MB THP. I'd appreciate a trigger to
> explicitly enable that - MADV_HUGEPAGE is not sufficient because some
> applications relying on that assume that the THP size will be 2~MB
> (especially, if you want sparse, large VMAs).

This patchset is not intended to silently use 1GB THP in place of 2MB THP.
First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
region (although I had alloc_contig_pages as a fallback, which can be removed
in next version), so users need to add hugepage_cma=nG kernel parameter to
enable 1GB THP allocation. If a finer control is necessary, we can add
a new MADV_HUGEPAGE_1GB for 1GB THP.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-08 14:05       ` Zi Yan
@ 2020-09-08 14:22         ` David Hildenbrand
  2020-09-08 15:36           ` Zi Yan
  2020-09-08 14:27         ` Matthew Wilcox
  2020-09-08 14:35         ` Michal Hocko
  2 siblings, 1 reply; 82+ messages in thread
From: David Hildenbrand @ 2020-09-08 14:22 UTC (permalink / raw)
  To: Zi Yan
  Cc: Roman Gushchin, Kirill A. Shutemov, linux-mm, Rik van Riel,
	Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt, Yang Shi,
	David Nellans, linux-kernel

On 08.09.20 16:05, Zi Yan wrote:
> On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
> 
>> On 03.09.20 18:30, Roman Gushchin wrote:
>>> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
>>>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> This patchset adds support for 1GB THP on x86_64. It is on top of
>>>>> v5.9-rc2-mmots-2020-08-25-21-13.
>>>>>
>>>>> 1GB THP is more flexible for reducing translation overhead and increasing the
>>>>> performance of applications with large memory footprint without application
>>>>> changes compared to hugetlb.
>>>>
>>>> This statement needs a lot of justification. I don't see 1GB THP as viable
>>>> for any workload. Opportunistic 1GB allocation is very questionable
>>>> strategy.
>>>
>>> Hello, Kirill!
>>>
>>> I share your skepticism about opportunistic 1 GB allocations, however it might be useful
>>> if backed by an madvise() annotations from userspace application. In this case,
>>> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
>>> interface.
>>
>> I have concerns if we would silently use 1~GB THPs in most scenarios
>> where be would have used 2~MB THP. I'd appreciate a trigger to
>> explicitly enable that - MADV_HUGEPAGE is not sufficient because some
>> applications relying on that assume that the THP size will be 2~MB
>> (especially, if you want sparse, large VMAs).
> 
> This patchset is not intended to silently use 1GB THP in place of 2MB THP.
> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
> region (although I had alloc_contig_pages as a fallback, which can be removed
> in next version), so users need to add hugepage_cma=nG kernel parameter to
> enable 1GB THP allocation. If a finer control is necessary, we can add
> a new MADV_HUGEPAGE_1GB for 1GB THP.

Thanks for the information - I would have loved to see important
information like that (esp. how to use) in the cover letter.

So what you propose is (excluding alloc_contig_pages()) really just
automatically using (previously reserved) 1GB huge pages as 1GB THP
instead of explicitly using them in an application using hugetlbfs.
Still, not convinced how helpful that actually is - most certainly you
really want a mechanism to control this per application (+ maybe make
the application indicate actual ranges where it makes sense - but then
you can directly modify the application to use hugetlbfs).

I guess the interesting thing of this approach is that we can
mix-and-match THP of differing granularity within a single mapping -
whereby a hugetlbfs allocation would fail in case there isn't sufficient
1GB pages available. However, there are no guarantees for applications
anymore (thinking about RT KVM and similar, we really want gigantic
pages and cannot tolerate falling back to smaller granularity).

What are intended use cases/applications that could benefit? I doubt
databases and virtualization are really a good fit - they know how to
handle hugetlbfs just fine.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-08 14:05       ` Zi Yan
  2020-09-08 14:22         ` David Hildenbrand
@ 2020-09-08 14:27         ` Matthew Wilcox
  2020-09-08 15:50           ` Zi Yan
  2020-09-09 12:11           ` Jason Gunthorpe
  2020-09-08 14:35         ` Michal Hocko
  2 siblings, 2 replies; 82+ messages in thread
From: Matthew Wilcox @ 2020-09-08 14:27 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Roman Gushchin, Kirill A. Shutemov, linux-mm,
	Rik van Riel, Kirill A . Shutemov, Shakeel Butt, Yang Shi,
	David Nellans, linux-kernel

On Tue, Sep 08, 2020 at 10:05:11AM -0400, Zi Yan wrote:
> On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
> > I have concerns if we would silently use 1~GB THPs in most scenarios
> > where be would have used 2~MB THP. I'd appreciate a trigger to
> > explicitly enable that - MADV_HUGEPAGE is not sufficient because some
> > applications relying on that assume that the THP size will be 2~MB
> > (especially, if you want sparse, large VMAs).
> 
> This patchset is not intended to silently use 1GB THP in place of 2MB THP.
> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
> region (although I had alloc_contig_pages as a fallback, which can be removed
> in next version), so users need to add hugepage_cma=nG kernel parameter to
> enable 1GB THP allocation. If a finer control is necessary, we can add
> a new MADV_HUGEPAGE_1GB for 1GB THP.

I think we do need that flag.  Machines don't run a single workload
(arguably with VMs, we're getting closer to going back to the single
workload per machine, but that's a different matter).  So if there's
one app that wants 2MB pages and one that wants 1GB pages, we need to
be able to distinguish them.

I could also see there being an app which benefits from 1GB for
one mapping and prefers 2GB for a different mapping, so I think the
per-mapping madvise flag is best.

I'm a little wary of encoding the size of an x86 PUD in the Linux API
though.  Probably best to follow the example set in
include/uapi/asm-generic/hugetlb_encode.h, but I don't love it.  I
don't have a better suggestion though.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-08 14:05       ` Zi Yan
  2020-09-08 14:22         ` David Hildenbrand
  2020-09-08 14:27         ` Matthew Wilcox
@ 2020-09-08 14:35         ` Michal Hocko
  2020-09-08 14:41           ` Rik van Riel
  2 siblings, 1 reply; 82+ messages in thread
From: Michal Hocko @ 2020-09-08 14:35 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Roman Gushchin, Kirill A. Shutemov, linux-mm,
	Rik van Riel, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel

On Tue 08-09-20 10:05:11, Zi Yan wrote:
> On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
> 
> > On 03.09.20 18:30, Roman Gushchin wrote:
> >> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
> >>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
> >>>> From: Zi Yan <ziy@nvidia.com>
> >>>>
> >>>> Hi all,
> >>>>
> >>>> This patchset adds support for 1GB THP on x86_64. It is on top of
> >>>> v5.9-rc2-mmots-2020-08-25-21-13.
> >>>>
> >>>> 1GB THP is more flexible for reducing translation overhead and increasing the
> >>>> performance of applications with large memory footprint without application
> >>>> changes compared to hugetlb.
> >>>
> >>> This statement needs a lot of justification. I don't see 1GB THP as viable
> >>> for any workload. Opportunistic 1GB allocation is very questionable
> >>> strategy.
> >>
> >> Hello, Kirill!
> >>
> >> I share your skepticism about opportunistic 1 GB allocations, however it might be useful
> >> if backed by an madvise() annotations from userspace application. In this case,
> >> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
> >> interface.
> >
> > I have concerns if we would silently use 1~GB THPs in most scenarios
> > where be would have used 2~MB THP. I'd appreciate a trigger to
> > explicitly enable that - MADV_HUGEPAGE is not sufficient because some
> > applications relying on that assume that the THP size will be 2~MB
> > (especially, if you want sparse, large VMAs).
> 
> This patchset is not intended to silently use 1GB THP in place of 2MB THP.
> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
> region (although I had alloc_contig_pages as a fallback, which can be removed
> in next version), so users need to add hugepage_cma=nG kernel parameter to
> enable 1GB THP allocation. If a finer control is necessary, we can add
> a new MADV_HUGEPAGE_1GB for 1GB THP.

A global knob is insufficient. 1G pages will become a very precious
resource as it requires a pre-allocation (reservation). So it really has
to be an opt-in and the question is whether there is also some sort of
access control needed.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-08 14:35         ` Michal Hocko
@ 2020-09-08 14:41           ` Rik van Riel
  2020-09-08 15:02             ` David Hildenbrand
  2020-09-09  7:04             ` Michal Hocko
  0 siblings, 2 replies; 82+ messages in thread
From: Rik van Riel @ 2020-09-08 14:41 UTC (permalink / raw)
  To: Michal Hocko, Zi Yan
  Cc: David Hildenbrand, Roman Gushchin, Kirill A. Shutemov, linux-mm,
	Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt, Yang Shi,
	David Nellans, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 783 bytes --]

On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:

> A global knob is insufficient. 1G pages will become a very precious
> resource as it requires a pre-allocation (reservation). So it really
> has
> to be an opt-in and the question is whether there is also some sort
> of
> access control needed.

The 1GB pages do not require that much in the way of
pre-allocation. The memory can be obtained through CMA,
which means it can be used for movable 4kB and 2MB
allocations when not
being used for 1GB pages.

That makes it relatively easy to set aside
some fraction
of system memory in every system for 1GB and movable
allocations, and use it for whatever way it is needed
depending on what workload(s) end up running on a system.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-08 14:41           ` Rik van Riel
@ 2020-09-08 15:02             ` David Hildenbrand
  2020-09-09  7:04             ` Michal Hocko
  1 sibling, 0 replies; 82+ messages in thread
From: David Hildenbrand @ 2020-09-08 15:02 UTC (permalink / raw)
  To: Rik van Riel, Michal Hocko, Zi Yan
  Cc: Roman Gushchin, Kirill A. Shutemov, linux-mm,
	Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt, Yang Shi,
	David Nellans, linux-kernel, Mike Rapoport

On 08.09.20 16:41, Rik van Riel wrote:
> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
> 
>> A global knob is insufficient. 1G pages will become a very precious
>> resource as it requires a pre-allocation (reservation). So it really
>> has
>> to be an opt-in and the question is whether there is also some sort
>> of
>> access control needed.
> 
> The 1GB pages do not require that much in the way of
> pre-allocation. The memory can be obtained through CMA,
> which means it can be used for movable 4kB and 2MB
> allocations when not
> being used for 1GB pages.
> 
> That makes it relatively easy to set aside
> some fraction
> of system memory in every system for 1GB and movable
> allocations, and use it for whatever way it is needed
> depending on what workload(s) end up running on a system.
> 

Linking secretmem discussion

https://lkml.kernel.org/r/fdda6ba7-9418-2b52-eee8-ce5e9bfdb6ad@redhat.com

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-07  7:20         ` Michal Hocko
@ 2020-09-08 15:09           ` Zi Yan
  2020-09-08 19:58             ` Roman Gushchin
  0 siblings, 1 reply; 82+ messages in thread
From: Zi Yan @ 2020-09-08 15:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, linux-mm, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2378 bytes --]

On 7 Sep 2020, at 3:20, Michal Hocko wrote:

> On Fri 04-09-20 14:10:45, Roman Gushchin wrote:
>> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
> [...]
>>> An explicit opt-in sounds much more appropriate to me as well. If we go
>>> with a specific API then I would not make it 1GB pages specific. Why
>>> cannot we have an explicit interface to "defragment" address space
>>> range into large pages and the kernel would use large pages where
>>> appropriate? Or is the additional copying prohibitively expensive?
>>
>> Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE)
>> provides something similar to what you're describing, but there are lot
>> of details here, so I'm probably missing something.
>
> MADV_HUGEPAGE is controlling a preference for THP to be used for a
> particular address range. So it looks similar but the historical
> behavior is to control page faults as well and the behavior depends on
> the global setup.
>
> I've had in mind something much simpler. Effectively an API to invoke
> khugepaged (like) functionality synchronously from the calling context
> on the specific address range. It could be more aggressive than the
> regular khugepaged and create even 1G pages (or as large THPs as page
> tables can handle on the particular arch for that matter).
>
> As this would be an explicit call we do not have to be worried about
> the resulting latency because it would be an explicit call by the
> userspace.  The default khugepaged has a harder position there because
> has no understanding of the target address space and cannot make any
> cost/benefit evaluation so it has to be more conservative.

Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have
better and clearer control of getting huge pages from the kernel and
know when they will pay the cost of getting the huge pages.

I would think the suggestion is more about the huge page control options
currently provided by the kernel do not have predictable performance
outcome, since MADV_HUGEPAGE is a best-effort option and does not tell
users whether the marked virtual address range is backed by huge pages
or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a
deterministic result to users on whether the huge page(s) are formed
or not.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-08 14:22         ` David Hildenbrand
@ 2020-09-08 15:36           ` Zi Yan
  0 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-08 15:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Roman Gushchin, Kirill A. Shutemov, linux-mm, Rik van Riel,
	Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt, Yang Shi,
	David Nellans, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4314 bytes --]

On 8 Sep 2020, at 10:22, David Hildenbrand wrote:

> On 08.09.20 16:05, Zi Yan wrote:
>> On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
>>
>>> On 03.09.20 18:30, Roman Gushchin wrote:
>>>> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
>>>>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
>>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> This patchset adds support for 1GB THP on x86_64. It is on top of
>>>>>> v5.9-rc2-mmots-2020-08-25-21-13.
>>>>>>
>>>>>> 1GB THP is more flexible for reducing translation overhead and increasing the
>>>>>> performance of applications with large memory footprint without application
>>>>>> changes compared to hugetlb.
>>>>>
>>>>> This statement needs a lot of justification. I don't see 1GB THP as viable
>>>>> for any workload. Opportunistic 1GB allocation is very questionable
>>>>> strategy.
>>>>
>>>> Hello, Kirill!
>>>>
>>>> I share your skepticism about opportunistic 1 GB allocations, however it might be useful
>>>> if backed by an madvise() annotations from userspace application. In this case,
>>>> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
>>>> interface.
>>>
>>> I have concerns if we would silently use 1~GB THPs in most scenarios
>>> where be would have used 2~MB THP. I'd appreciate a trigger to
>>> explicitly enable that - MADV_HUGEPAGE is not sufficient because some
>>> applications relying on that assume that the THP size will be 2~MB
>>> (especially, if you want sparse, large VMAs).
>>
>> This patchset is not intended to silently use 1GB THP in place of 2MB THP.
>> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
>> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
>> region (although I had alloc_contig_pages as a fallback, which can be removed
>> in next version), so users need to add hugepage_cma=nG kernel parameter to
>> enable 1GB THP allocation. If a finer control is necessary, we can add
>> a new MADV_HUGEPAGE_1GB for 1GB THP.
>
> Thanks for the information - I would have loved to see important
> information like that (esp. how to use) in the cover letter.
>
> So what you propose is (excluding alloc_contig_pages()) really just
> automatically using (previously reserved) 1GB huge pages as 1GB THP
> instead of explicitly using them in an application using hugetlbfs.
> Still, not convinced how helpful that actually is - most certainly you
> really want a mechanism to control this per application (+ maybe make
> the application indicate actual ranges where it makes sense - but then
> you can directly modify the application to use hugetlbfs).
>
> I guess the interesting thing of this approach is that we can
> mix-and-match THP of differing granularity within a single mapping -
> whereby a hugetlbfs allocation would fail in case there isn't sufficient
> 1GB pages available. However, there are no guarantees for applications
> anymore (thinking about RT KVM and similar, we really want gigantic
> pages and cannot tolerate falling back to smaller granularity).

I agree that currently THP allocation does not provide a strong guarantee
like hugetlbfs, which can pre-allocate pages at boot time. For users like
RT KVM and such, pre-allocated hugetlb might be the only choice, since
allocating huge pages from CMA (either hugetlb or 1GB THP) would fail
if some pages are pinned and scattered in the CMA that could prevent
huge page allocation.

In other cases, if the user can tolerate fall backs but do not like the
unpredictable huge page formation outcome, we could add an madvise()
option like Michal suggested [1], so the user will know whether he gets
huge pages or not and can act accordingly.


> What are intended use cases/applications that could benefit? I doubt
> databases and virtualization are really a good fit - they know how to
> handle hugetlbfs just fine.

Romand and Jason have provided some use cases [2,3]

[1]https://lore.kernel.org/linux-mm/20200907072014.GD30144@dhcp22.suse.cz/
[2]https://lore.kernel.org/linux-mm/20200903162527.GF60440@carbon.dhcp.thefacebook.com/
[3]https://lore.kernel.org/linux-mm/20200903165051.GN24045@ziepe.ca/

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-08 14:27         ` Matthew Wilcox
@ 2020-09-08 15:50           ` Zi Yan
  2020-09-09 12:11           ` Jason Gunthorpe
  1 sibling, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-08 15:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Roman Gushchin, Kirill A. Shutemov, linux-mm,
	Rik van Riel, Kirill A . Shutemov, Shakeel Butt, Yang Shi,
	David Nellans, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1910 bytes --]

On 8 Sep 2020, at 10:27, Matthew Wilcox wrote:

> On Tue, Sep 08, 2020 at 10:05:11AM -0400, Zi Yan wrote:
>> On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
>>> I have concerns if we would silently use 1~GB THPs in most scenarios
>>> where be would have used 2~MB THP. I'd appreciate a trigger to
>>> explicitly enable that - MADV_HUGEPAGE is not sufficient because some
>>> applications relying on that assume that the THP size will be 2~MB
>>> (especially, if you want sparse, large VMAs).
>>
>> This patchset is not intended to silently use 1GB THP in place of 2MB THP.
>> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
>> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
>> region (although I had alloc_contig_pages as a fallback, which can be removed
>> in next version), so users need to add hugepage_cma=nG kernel parameter to
>> enable 1GB THP allocation. If a finer control is necessary, we can add
>> a new MADV_HUGEPAGE_1GB for 1GB THP.
>
> I think we do need that flag.  Machines don't run a single workload
> (arguably with VMs, we're getting closer to going back to the single
> workload per machine, but that's a different matter).  So if there's
> one app that wants 2MB pages and one that wants 1GB pages, we need to
> be able to distinguish them.
>
> I could also see there being an app which benefits from 1GB for
> one mapping and prefers 2GB for a different mapping, so I think the
> per-mapping madvise flag is best.
>
> I'm a little wary of encoding the size of an x86 PUD in the Linux API
> though.  Probably best to follow the example set in
> include/uapi/asm-generic/hugetlb_encode.h, but I don't love it.  I
> don't have a better suggestion though.

Using hugeltb_encode.h makes sense to me. I will add it in the next version.

Thanks for the suggestion.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-08 15:09           ` Zi Yan
@ 2020-09-08 19:58             ` Roman Gushchin
  2020-09-09  4:01               ` John Hubbard
  2020-09-09  7:15               ` Michal Hocko
  0 siblings, 2 replies; 82+ messages in thread
From: Roman Gushchin @ 2020-09-08 19:58 UTC (permalink / raw)
  To: Zi Yan
  Cc: Michal Hocko, linux-mm, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Tue, Sep 08, 2020 at 11:09:25AM -0400, Zi Yan wrote:
> On 7 Sep 2020, at 3:20, Michal Hocko wrote:
> 
> > On Fri 04-09-20 14:10:45, Roman Gushchin wrote:
> >> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
> > [...]
> >>> An explicit opt-in sounds much more appropriate to me as well. If we go
> >>> with a specific API then I would not make it 1GB pages specific. Why
> >>> cannot we have an explicit interface to "defragment" address space
> >>> range into large pages and the kernel would use large pages where
> >>> appropriate? Or is the additional copying prohibitively expensive?
> >>
> >> Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE)
> >> provides something similar to what you're describing, but there are lot
> >> of details here, so I'm probably missing something.
> >
> > MADV_HUGEPAGE is controlling a preference for THP to be used for a
> > particular address range. So it looks similar but the historical
> > behavior is to control page faults as well and the behavior depends on
> > the global setup.
> >
> > I've had in mind something much simpler. Effectively an API to invoke
> > khugepaged (like) functionality synchronously from the calling context
> > on the specific address range. It could be more aggressive than the
> > regular khugepaged and create even 1G pages (or as large THPs as page
> > tables can handle on the particular arch for that matter).
> >
> > As this would be an explicit call we do not have to be worried about
> > the resulting latency because it would be an explicit call by the
> > userspace.  The default khugepaged has a harder position there because
> > has no understanding of the target address space and cannot make any
> > cost/benefit evaluation so it has to be more conservative.
> 
> Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have
> better and clearer control of getting huge pages from the kernel and
> know when they will pay the cost of getting the huge pages.
> 
> I would think the suggestion is more about the huge page control options
> currently provided by the kernel do not have predictable performance
> outcome, since MADV_HUGEPAGE is a best-effort option and does not tell
> users whether the marked virtual address range is backed by huge pages
> or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a
> deterministic result to users on whether the huge page(s) are formed
> or not.

Yeah, I agree with Michal here, we need a more straightforward interface.

The hard question here is how hard the kernel should try to allocate
a gigantic page and how fast it should give up and return an error?
I'd say to try really hard if there are some chances to succeed,
so that if an error is returned, there are no more reasons to retry.
Any objections/better ideas here?

Given that we need to pass a page size, we probably need either to introduce
a new syscall (madvise2?) with an additional argument, or add a bunch
of new madvise flags, like MADV_HUGEPAGE_SYNC + encoded 2MB, 1GB etc.

Idk what is better long-term, but new madvise flags are probably slightly
easier to deal with in the development process.

Thanks!


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-08 19:58             ` Roman Gushchin
@ 2020-09-09  4:01               ` John Hubbard
  2020-09-09  7:15               ` Michal Hocko
  1 sibling, 0 replies; 82+ messages in thread
From: John Hubbard @ 2020-09-09  4:01 UTC (permalink / raw)
  To: Roman Gushchin, Zi Yan
  Cc: Michal Hocko, linux-mm, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On 9/8/20 12:58 PM, Roman Gushchin wrote:
> On Tue, Sep 08, 2020 at 11:09:25AM -0400, Zi Yan wrote:
>> On 7 Sep 2020, at 3:20, Michal Hocko wrote:
>>> On Fri 04-09-20 14:10:45, Roman Gushchin wrote:
>>>> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
>>> [...]
>> Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have
>> better and clearer control of getting huge pages from the kernel and
>> know when they will pay the cost of getting the huge pages.
>>
>> I would think the suggestion is more about the huge page control options
>> currently provided by the kernel do not have predictable performance
>> outcome, since MADV_HUGEPAGE is a best-effort option and does not tell
>> users whether the marked virtual address range is backed by huge pages
>> or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a
>> deterministic result to users on whether the huge page(s) are formed
>> or not.
> 
> Yeah, I agree with Michal here, we need a more straightforward interface.
> 
> The hard question here is how hard the kernel should try to allocate
> a gigantic page and how fast it should give up and return an error?
> I'd say to try really hard if there are some chances to succeed,
> so that if an error is returned, there are no more reasons to retry.
> Any objections/better ideas here?

I agree, especially because this is starting to look a lot more like an
allocation call. And I think it would be appropriate for the kernel to
try approximately as hard to provide these 1GB pages, as it would to
allocate normal memory to a process.

In fact, for a moment I thought, why not go all the way and make this
actually be a true allocation? However, given that we still have
operations that require page splitting, with no good way to call back
user space to notify it that its "allocated" huge pages are being split,
that fails. But it's still pretty close.


> 
> Given that we need to pass a page size, we probably need either to introduce
> a new syscall (madvise2?) with an additional argument, or add a bunch
> of new madvise flags, like MADV_HUGEPAGE_SYNC + encoded 2MB, 1GB etc.
> 
> Idk what is better long-term, but new madvise flags are probably slightly
> easier to deal with in the development process.
> 

Probably either an MADV_* flag or a new syscall would work fine. But
given that this seems like a pretty distinct new capability, one with
options and man page documentation and possibly future flags itself, I'd
lean toward making it its own new syscall, maybe:

     compact_huge_pages(nbytes or npages, flags /* page size, etc */);

...thus leaving madvise() and it's remaining flags still available, to
further refine things.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-08 14:41           ` Rik van Riel
  2020-09-08 15:02             ` David Hildenbrand
@ 2020-09-09  7:04             ` Michal Hocko
  2020-09-09 13:19               ` Rik van Riel
  1 sibling, 1 reply; 82+ messages in thread
From: Michal Hocko @ 2020-09-09  7:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Zi Yan, David Hildenbrand, Roman Gushchin, Kirill A. Shutemov,
	linux-mm, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel

On Tue 08-09-20 10:41:10, Rik van Riel wrote:
> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
> 
> > A global knob is insufficient. 1G pages will become a very precious
> > resource as it requires a pre-allocation (reservation). So it really
> > has
> > to be an opt-in and the question is whether there is also some sort
> > of
> > access control needed.
> 
> The 1GB pages do not require that much in the way of
> pre-allocation. The memory can be obtained through CMA,
> which means it can be used for movable 4kB and 2MB
> allocations when not
> being used for 1GB pages.

That CMA has to be pre-reserved, right? That requires a configuration.

> That makes it relatively easy to set aside
> some fraction
> of system memory in every system for 1GB and movable
> allocations, and use it for whatever way it is needed
> depending on what workload(s) end up running on a system.

I was not talking about how easy or hard it is. My main concern is that
this is effectively a pre-reserved pool and a global knob is a very
suboptimal way to control access to it. I (rather) strongly believe this
should be an explicit opt-in and ideally not 1GB specific but rather
something to allow large pages to be created as there is a fit. See
other subthread for more details.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-08 19:58             ` Roman Gushchin
  2020-09-09  4:01               ` John Hubbard
@ 2020-09-09  7:15               ` Michal Hocko
  1 sibling, 0 replies; 82+ messages in thread
From: Michal Hocko @ 2020-09-09  7:15 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Zi Yan, linux-mm, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Tue 08-09-20 12:58:59, Roman Gushchin wrote:
> On Tue, Sep 08, 2020 at 11:09:25AM -0400, Zi Yan wrote:
> > On 7 Sep 2020, at 3:20, Michal Hocko wrote:
> > 
> > > On Fri 04-09-20 14:10:45, Roman Gushchin wrote:
> > >> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
> > > [...]
> > >>> An explicit opt-in sounds much more appropriate to me as well. If we go
> > >>> with a specific API then I would not make it 1GB pages specific. Why
> > >>> cannot we have an explicit interface to "defragment" address space
> > >>> range into large pages and the kernel would use large pages where
> > >>> appropriate? Or is the additional copying prohibitively expensive?
> > >>
> > >> Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE)
> > >> provides something similar to what you're describing, but there are lot
> > >> of details here, so I'm probably missing something.
> > >
> > > MADV_HUGEPAGE is controlling a preference for THP to be used for a
> > > particular address range. So it looks similar but the historical
> > > behavior is to control page faults as well and the behavior depends on
> > > the global setup.
> > >
> > > I've had in mind something much simpler. Effectively an API to invoke
> > > khugepaged (like) functionality synchronously from the calling context
> > > on the specific address range. It could be more aggressive than the
> > > regular khugepaged and create even 1G pages (or as large THPs as page
> > > tables can handle on the particular arch for that matter).
> > >
> > > As this would be an explicit call we do not have to be worried about
> > > the resulting latency because it would be an explicit call by the
> > > userspace.  The default khugepaged has a harder position there because
> > > has no understanding of the target address space and cannot make any
> > > cost/benefit evaluation so it has to be more conservative.
> > 
> > Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have
> > better and clearer control of getting huge pages from the kernel and
> > know when they will pay the cost of getting the huge pages.

The name is not really that important. The crucial design decisions are
- THP allocation time - #PF and/or madvise context
- lazy/sync instantiation
- huge page sizes controllable by the userspace?
- aggressiveness - how hard to try
- internal fragmentation - allow to create THPs on sparsely or unpopulated
  ranges
- do we need some sort of access control or privilege check as some THPs
  would be a really scarce (like those that require pre-reservation).

> > I would think the suggestion is more about the huge page control options
> > currently provided by the kernel do not have predictable performance
> > outcome, since MADV_HUGEPAGE is a best-effort option and does not tell
> > users whether the marked virtual address range is backed by huge pages
> > or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a
> > deterministic result to users on whether the huge page(s) are formed
> > or not.
> 
> Yeah, I agree with Michal here, we need a more straightforward interface.
> 
> The hard question here is how hard the kernel should try to allocate
> a gigantic page and how fast it should give up and return an error?
> I'd say to try really hard if there are some chances to succeed,
> so that if an error is returned, there are no more reasons to retry.
> Any objections/better ideas here?

If this is going to be an explicit interface like madvise then I would
follow the same semantic as hugetlb pages allocation - aka try as hard
as feasible (whatever that means).

> Given that we need to pass a page size, we probably need either to introduce
> a new syscall (madvise2?) with an additional argument, or add a bunch
> of new madvise flags, like MADV_HUGEPAGE_SYNC + encoded 2MB, 1GB etc.

Do we really need to bother userspace with making decision about the
page size? I would expect that the userspace only cares to get huge
pages backed memory range. The larger the pages the better. It is up to
the kernel to make the resource control here. Afterall THPs can be
split/reclaimed under a memory pressure so we do not want to make any
promises about pages backing any mapping.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-08 14:27         ` Matthew Wilcox
  2020-09-08 15:50           ` Zi Yan
@ 2020-09-09 12:11           ` Jason Gunthorpe
  2020-09-09 12:32             ` Matthew Wilcox
  1 sibling, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2020-09-09 12:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zi Yan, David Hildenbrand, Roman Gushchin, Kirill A. Shutemov,
	linux-mm, Rik van Riel, Kirill A . Shutemov, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel

On Tue, Sep 08, 2020 at 03:27:58PM +0100, Matthew Wilcox wrote:
> On Tue, Sep 08, 2020 at 10:05:11AM -0400, Zi Yan wrote:
> > On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
> > > I have concerns if we would silently use 1~GB THPs in most scenarios
> > > where be would have used 2~MB THP. I'd appreciate a trigger to
> > > explicitly enable that - MADV_HUGEPAGE is not sufficient because some
> > > applications relying on that assume that the THP size will be 2~MB
> > > (especially, if you want sparse, large VMAs).
> > 
> > This patchset is not intended to silently use 1GB THP in place of 2MB THP.
> > First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
> > to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
> > region (although I had alloc_contig_pages as a fallback, which can be removed
> > in next version), so users need to add hugepage_cma=nG kernel parameter to
> > enable 1GB THP allocation. If a finer control is necessary, we can add
> > a new MADV_HUGEPAGE_1GB for 1GB THP.
> 
> I think we do need that flag.  Machines don't run a single workload
> (arguably with VMs, we're getting closer to going back to the single
> workload per machine, but that's a different matter).  So if there's
> one app that wants 2MB pages and one that wants 1GB pages, we need to
> be able to distinguish them.
> 
> I could also see there being an app which benefits from 1GB for
> one mapping and prefers 2GB for a different mapping, so I think the
> per-mapping madvise flag is best.

I wonder if apps really care about the specific page size?
Particularly from a portability view?

The general app desire seems to be the need for 'efficient' memory (eg
because it is highly accessed) and I suspect comes with a desire to
populate the pages too.

Maybe doing something with MAP_POPULATE is an idea?

eg if I ask for 1GB of MAP_POPULATE it seems fairly natural the thing
that comes back should be a 1GB THP? If I ask for only .5GB then it
could be 2M pages, or whatever depending on arch support.

Jason


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-09 12:11           ` Jason Gunthorpe
@ 2020-09-09 12:32             ` Matthew Wilcox
  2020-09-09 13:14               ` Jason Gunthorpe
  0 siblings, 1 reply; 82+ messages in thread
From: Matthew Wilcox @ 2020-09-09 12:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zi Yan, David Hildenbrand, Roman Gushchin, Kirill A. Shutemov,
	linux-mm, Rik van Riel, Kirill A . Shutemov, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel

On Wed, Sep 09, 2020 at 09:11:17AM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 08, 2020 at 03:27:58PM +0100, Matthew Wilcox wrote:
> > I could also see there being an app which benefits from 1GB for
> > one mapping and prefers 2GB for a different mapping, so I think the
> > per-mapping madvise flag is best.
> 
> I wonder if apps really care about the specific page size?
> Particularly from a portability view?

No, they don't.  They just want to run as fast as possible ;-)

> The general app desire seems to be the need for 'efficient' memory (eg
> because it is highly accessed) and I suspect comes with a desire to
> populate the pages too.

The problem with a MAP_GOES_FASTER flag is that everybody sets it.
Any flag name needs to convey its drawbacks as well as its advantages.
Maybe MAP_EXTREMELY_COARSE_WORKINGSET would do that -- the VM will work
in terms of 1GB pages for this mapping, so any swap-out is going to take
out an entire 1GB at once.

But here's the thing ... we already allow
	mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB)

So if we're not doing THP, what's the point of this thread?
My understanding of THP is "Application doesn't need to change, kernel
makes a decision about what page size is best based on entire system
state and process's behaviour".

An madvise flag is a different beast; that's just letting the kernel
know what the app thinks its behaviour will be.  The kernel can pay
as much (or as little) attention to that hint as it sees fit.  And of
course, it can change over time (either by kernel release as we change
the algorithms, or simple from one minute to the next as more or less
memory comes available).


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-09 12:32             ` Matthew Wilcox
@ 2020-09-09 13:14               ` Jason Gunthorpe
  2020-09-09 13:27                 ` David Hildenbrand
  0 siblings, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2020-09-09 13:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zi Yan, David Hildenbrand, Roman Gushchin, Kirill A. Shutemov,
	linux-mm, Rik van Riel, Kirill A . Shutemov, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel

On Wed, Sep 09, 2020 at 01:32:44PM +0100, Matthew Wilcox wrote:

> But here's the thing ... we already allow
> 	mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB)
> 
> So if we're not doing THP, what's the point of this thread?

I wondered that too..

> An madvise flag is a different beast; that's just letting the kernel
> know what the app thinks its behaviour will be.  The kernel can pay

But madvise is too late, the VMA already has an address, if it is not
1G aligned it cannot be 1G THP already.

Jason


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-09  7:04             ` Michal Hocko
@ 2020-09-09 13:19               ` Rik van Riel
  2020-09-09 13:43                 ` David Hildenbrand
  2020-09-09 13:59                 ` Michal Hocko
  0 siblings, 2 replies; 82+ messages in thread
From: Rik van Riel @ 2020-09-09 13:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zi Yan, David Hildenbrand, Roman Gushchin, Kirill A. Shutemov,
	linux-mm, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1203 bytes --]

On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
> > On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
> > 
> > > A global knob is insufficient. 1G pages will become a very
> > > precious
> > > resource as it requires a pre-allocation (reservation). So it
> > > really
> > > has
> > > to be an opt-in and the question is whether there is also some
> > > sort
> > > of
> > > access control needed.
> > 
> > The 1GB pages do not require that much in the way of
> > pre-allocation. The memory can be obtained through CMA,
> > which means it can be used for movable 4kB and 2MB
> > allocations when not
> > being used for 1GB pages.
> 
> That CMA has to be pre-reserved, right? That requires a
> configuration.

To some extent, yes.

However, because that pool can be used for movable
4kB and 2MB
pages as well as for 1GB pages, it would be easy to just set
the size of that pool to eg. 1/3 or even 1/2 of memory for every
system.

It isn't like the pool needs to be the exact right size. We
just need to avoid the "highmem problem" of having too little
memory for kernel allocations.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-09 13:14               ` Jason Gunthorpe
@ 2020-09-09 13:27                 ` David Hildenbrand
  2020-09-10 10:02                   ` William Kucharski
  0 siblings, 1 reply; 82+ messages in thread
From: David Hildenbrand @ 2020-09-09 13:27 UTC (permalink / raw)
  To: Jason Gunthorpe, Matthew Wilcox
  Cc: Zi Yan, Roman Gushchin, Kirill A. Shutemov, linux-mm,
	Rik van Riel, Kirill A . Shutemov, Shakeel Butt, Yang Shi,
	David Nellans, linux-kernel

On 09.09.20 15:14, Jason Gunthorpe wrote:
> On Wed, Sep 09, 2020 at 01:32:44PM +0100, Matthew Wilcox wrote:
> 
>> But here's the thing ... we already allow
>> 	mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB)
>>
>> So if we're not doing THP, what's the point of this thread?
> 
> I wondered that too..
> 
>> An madvise flag is a different beast; that's just letting the kernel
>> know what the app thinks its behaviour will be.  The kernel can pay
> 
> But madvise is too late, the VMA already has an address, if it is not
> 1G aligned it cannot be 1G THP already.

That's why user space (like QEMU) is THP-aware and selects an address
that is aligned to the expected THP granularity (e.g., 2MB on x86_64).

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-09 13:19               ` Rik van Riel
@ 2020-09-09 13:43                 ` David Hildenbrand
  2020-09-09 13:49                   ` Rik van Riel
  2020-09-10  7:32                   ` Michal Hocko
  2020-09-09 13:59                 ` Michal Hocko
  1 sibling, 2 replies; 82+ messages in thread
From: David Hildenbrand @ 2020-09-09 13:43 UTC (permalink / raw)
  To: Rik van Riel, Michal Hocko
  Cc: Zi Yan, Roman Gushchin, Kirill A. Shutemov, linux-mm,
	Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt, Yang Shi,
	David Nellans, linux-kernel

On 09.09.20 15:19, Rik van Riel wrote:
> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
>> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
>>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
>>>
>>>> A global knob is insufficient. 1G pages will become a very
>>>> precious
>>>> resource as it requires a pre-allocation (reservation). So it
>>>> really
>>>> has
>>>> to be an opt-in and the question is whether there is also some
>>>> sort
>>>> of
>>>> access control needed.
>>>
>>> The 1GB pages do not require that much in the way of
>>> pre-allocation. The memory can be obtained through CMA,
>>> which means it can be used for movable 4kB and 2MB
>>> allocations when not
>>> being used for 1GB pages.
>>
>> That CMA has to be pre-reserved, right? That requires a
>> configuration.
> 
> To some extent, yes.
> 
> However, because that pool can be used for movable
> 4kB and 2MB
> pages as well as for 1GB pages, it would be easy to just set
> the size of that pool to eg. 1/3 or even 1/2 of memory for every
> system.
> 
> It isn't like the pool needs to be the exact right size. We
> just need to avoid the "highmem problem" of having too little
> memory for kernel allocations.
> 

I am not sure I like the trend towards CMA that we are seeing, reserving
huge buffers for specific users (and eventually even doing it
automatically).

What we actually want is ZONE_MOVABLE with relaxed guarantees, such that
anybody who requires large, unmovable allocations can use it.

I once played with the idea of having ZONE_PREFER_MOVABLE, which
a) Is the primary choice for movable allocations
b) Is allowed to contain unmovable allocations (esp., gigantic pages)
c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of
running out of memory

If someone messes up the zone ratio, issues known from zone imbalances
are avoided - large allocations simply become less likely to succeed. In
contrast to ZONE_MOVABLE, memory offlining is not guaranteed to work.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 01/16] mm: add pagechain container for storing multiple pages.
  2020-09-07 15:11     ` Zi Yan
@ 2020-09-09 13:46       ` Kirill A. Shutemov
  2020-09-09 14:15         ` Zi Yan
  0 siblings, 1 reply; 82+ messages in thread
From: Kirill A. Shutemov @ 2020-09-09 13:46 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Mon, Sep 07, 2020 at 11:11:05AM -0400, Zi Yan wrote:
> On 7 Sep 2020, at 8:22, Kirill A. Shutemov wrote:
> 
> > On Wed, Sep 02, 2020 at 02:06:13PM -0400, Zi Yan wrote:
> >> From: Zi Yan <ziy@nvidia.com>
> >>
> >> When depositing page table pages for 1GB THPs, we need 512 PTE pages +
> >> 1 PMD page. Instead of counting and depositing 513 pages, we can use the
> >> PMD page as a leader page and chain the rest 512 PTE pages with ->lru.
> >> This, however, prevents us depositing PMD pages with ->lru, which is
> >> currently used by depositing PTE pages for 2MB THPs. So add a new
> >> pagechain container for PMD pages.
> >>
> >> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >
> > Just deposit it to a linked list in the mm_struct as we do for PMD if
> > split ptl disabled.
> >
> 
> Thank you for checking the patches. Since we don’t have PUD split lock
> yet, I store the PMD page table pages in a newly added linked list head
> in mm_struct like you suggested above.
> 
> I was too vague about my pagechain design for depositing page table pages
> for PUD THPs. Sorry about the confusion. Let me clarify why
> I am doing this pagechain here too. I am sure there would be
> some other designs and I am happy to change my code.
> 
> In my design, I did not store all page table pages in a single list.
> I first deposit 512 PTE pages in one PMD page table page’s pmd_huge_pte
> using pgtable_trans_huge_depsit(), then deposit the PMD page to
> a newly added linked list in mm_struct. Since pmd_huge_pte shares space
> with half of lru in struct page, we cannot use lru to link all PMD
> pages together. As a result, I added pagechain. Also in this way,
> we can avoid these things:
> 
> 1. when we withdraw the PMD page during PUD THP split, we don’t need
> to withdraw 513 page, set up one PMD page, then, deposit 512 PTE pages
> in that PMD page.
> 
> 2. we don’t mix PMD page table pages and PTE page table pages in a single
> list, since they are initialized in different ways. Otherwise, we need
> to maintain a subtle rule in the single page table page list that in every
> 513 pages, first one is PMD page table page and the rest are PTE page
> table pages.
> 
> As I am typing, I also realize that my current design does not work
> when PMD split lock is disabled, so I will fix it. I would store PMD pages
> and PTE pages in two separate lists in mm_struct.
> 
> 
> Any comments?

Okay, fair enough.

Although, I think you can get away without a new data structure. We don't
need double-linked list to deposit page tables. You can rework PTE tables
deposit code to have single-linked list and use one pointer of ->lru (with
proper name) and make PMD tables deposit to use the other one. This way
you can avoid conflict for ->lru.

Does it make sense?

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 03/16] mm: proc: add 1GB THP kpageflag.
  2020-09-02 18:06 ` [RFC PATCH 03/16] mm: proc: add 1GB THP kpageflag Zi Yan
@ 2020-09-09 13:46   ` Kirill A. Shutemov
  0 siblings, 0 replies; 82+ messages in thread
From: Kirill A. Shutemov @ 2020-09-09 13:46 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Wed, Sep 02, 2020 at 02:06:15PM -0400, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Bit 27 is used to identify 1GB THP.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  fs/proc/page.c                         | 2 ++
>  include/uapi/linux/kernel-page-flags.h | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index f3b39a7d2bf3..e4e2ad3612c9 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -161,6 +161,8 @@ u64 stable_page_flags(struct page *page)
>  			u |= BIT_ULL(KPF_ZERO_PAGE);
>  			u |= BIT_ULL(KPF_THP);
>  		}
> +		if (compound_order(head) == HPAGE_PUD_ORDER)
> +			u |= 1 << KPF_PUD_THP;
>  	} else if (is_zero_pfn(page_to_pfn(page)))
>  		u |= BIT_ULL(KPF_ZERO_PAGE);
>  
> diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
> index 6f2f2720f3ac..cdeb33ab655c 100644
> --- a/include/uapi/linux/kernel-page-flags.h
> +++ b/include/uapi/linux/kernel-page-flags.h
> @@ -36,5 +36,7 @@
>  #define KPF_ZERO_PAGE		24
>  #define KPF_IDLE		25
>  #define KPF_PGTABLE		26
> +#define KPF_PUD_THP		27
> +

Redundant newline.

>  #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
> -- 
> 2.28.0
> 
> 

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-09 13:43                 ` David Hildenbrand
@ 2020-09-09 13:49                   ` Rik van Riel
  2020-09-09 13:54                     ` David Hildenbrand
  2020-09-10  7:32                   ` Michal Hocko
  1 sibling, 1 reply; 82+ messages in thread
From: Rik van Riel @ 2020-09-09 13:49 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: Zi Yan, Roman Gushchin, Kirill A. Shutemov, linux-mm,
	Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt, Yang Shi,
	David Nellans, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2221 bytes --]

On Wed, 2020-09-09 at 15:43 +0200, David Hildenbrand wrote:
> On 09.09.20 15:19, Rik van Riel wrote:
> > On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
> > 
> > > That CMA has to be pre-reserved, right? That requires a
> > > configuration.
> > 
> > To some extent, yes.
> > 
> > However, because that pool can be used for movable
> > 4kB and 2MB
> > pages as well as for 1GB pages, it would be easy to just set
> > the size of that pool to eg. 1/3 or even 1/2 of memory for every
> > system.
> > 
> > It isn't like the pool needs to be the exact right size. We
> > just need to avoid the "highmem problem" of having too little
> > memory for kernel allocations.
> > 
> 
> I am not sure I like the trend towards CMA that we are seeing,
> reserving
> huge buffers for specific users (and eventually even doing it
> automatically).
> 
> What we actually want is ZONE_MOVABLE with relaxed guarantees, such
> that
> anybody who requires large, unmovable allocations can use it.
> 
> I once played with the idea of having ZONE_PREFER_MOVABLE, which
> a) Is the primary choice for movable allocations
> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead
> of
> running out of memory
> 
> If someone messes up the zone ratio, issues known from zone
> imbalances
> are avoided - large allocations simply become less likely to succeed.
> In
> contrast to ZONE_MOVABLE, memory offlining is not guaranteed to work.

I really like that idea. This will be easier to deal with than
a "just the right size" CMA area, and seems like it would be
pretty forgiving in both directions.

Keeping unmovable allocations
contained to one part of memory
should also make compaction within the ZONE_PREFER_MOVABLE area
a lot easier than compaction for higher order allocations is
today.

I suspect your proposal solves a lot of issues at once.

For (c) from your proposal, we could even claim a whole
2MB or even 1GB area at once for unmovable allocations,
keeping those contained in a limited amount of physical
memory again, to make life easier on compaction.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-09 13:49                   ` Rik van Riel
@ 2020-09-09 13:54                     ` David Hildenbrand
  0 siblings, 0 replies; 82+ messages in thread
From: David Hildenbrand @ 2020-09-09 13:54 UTC (permalink / raw)
  To: Rik van Riel, Michal Hocko
  Cc: Zi Yan, Roman Gushchin, Kirill A. Shutemov, linux-mm,
	Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt, Yang Shi,
	David Nellans, linux-kernel

On 09.09.20 15:49, Rik van Riel wrote:
> On Wed, 2020-09-09 at 15:43 +0200, David Hildenbrand wrote:
>> On 09.09.20 15:19, Rik van Riel wrote:
>>> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
>>>
>>>> That CMA has to be pre-reserved, right? That requires a
>>>> configuration.
>>>
>>> To some extent, yes.
>>>
>>> However, because that pool can be used for movable
>>> 4kB and 2MB
>>> pages as well as for 1GB pages, it would be easy to just set
>>> the size of that pool to eg. 1/3 or even 1/2 of memory for every
>>> system.
>>>
>>> It isn't like the pool needs to be the exact right size. We
>>> just need to avoid the "highmem problem" of having too little
>>> memory for kernel allocations.
>>>
>>
>> I am not sure I like the trend towards CMA that we are seeing,
>> reserving
>> huge buffers for specific users (and eventually even doing it
>> automatically).
>>
>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such
>> that
>> anybody who requires large, unmovable allocations can use it.
>>
>> I once played with the idea of having ZONE_PREFER_MOVABLE, which
>> a) Is the primary choice for movable allocations
>> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
>> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead
>> of
>> running out of memory
>>
>> If someone messes up the zone ratio, issues known from zone
>> imbalances
>> are avoided - large allocations simply become less likely to succeed.
>> In
>> contrast to ZONE_MOVABLE, memory offlining is not guaranteed to work.
> 
> I really like that idea. This will be easier to deal with than
> a "just the right size" CMA area, and seems like it would be
> pretty forgiving in both directions.
> 

Yes, and can be extended using memory hotplug.

> Keeping unmovable allocations
> contained to one part of memory
> should also make compaction within the ZONE_PREFER_MOVABLE area
> a lot easier than compaction for higher order allocations is
> today.
> 
> I suspect your proposal solves a lot of issues at once.
> 
> For (c) from your proposal, we could even claim a whole
> 2MB or even 1GB area at once for unmovable allocations,
> keeping those contained in a limited amount of physical
> memory again, to make life easier on compaction.
> 

Exactly, locally limiting unmovable allocations to a sane minimum.

(with some smart extra work, we could even convert ZONE_PREFER_MOVABLE
to ZONE_NORMAL, one memory section/block at a time where needed, that
direction always works. But that's very tricky.)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-09 13:19               ` Rik van Riel
  2020-09-09 13:43                 ` David Hildenbrand
@ 2020-09-09 13:59                 ` Michal Hocko
  1 sibling, 0 replies; 82+ messages in thread
From: Michal Hocko @ 2020-09-09 13:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Zi Yan, David Hildenbrand, Roman Gushchin, Kirill A. Shutemov,
	linux-mm, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel

On Wed 09-09-20 09:19:16, Rik van Riel wrote:
> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
> > On Tue 08-09-20 10:41:10, Rik van Riel wrote:
> > > On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
> > > 
> > > > A global knob is insufficient. 1G pages will become a very
> > > > precious
> > > > resource as it requires a pre-allocation (reservation). So it
> > > > really
> > > > has
> > > > to be an opt-in and the question is whether there is also some
> > > > sort
> > > > of
> > > > access control needed.
> > > 
> > > The 1GB pages do not require that much in the way of
> > > pre-allocation. The memory can be obtained through CMA,
> > > which means it can be used for movable 4kB and 2MB
> > > allocations when not
> > > being used for 1GB pages.
> > 
> > That CMA has to be pre-reserved, right? That requires a
> > configuration.
> 
> To some extent, yes.
> 
> However, because that pool can be used for movable
> 4kB and 2MB
> pages as well as for 1GB pages, it would be easy to just set
> the size of that pool to eg. 1/3 or even 1/2 of memory for every
> system.
> 
> It isn't like the pool needs to be the exact right size. We
> just need to avoid the "highmem problem" of having too little
> memory for kernel allocations.

Which is the problem why this is not really suitable for an uneducated
guesses. It is really hard to guess the right amount of lowmem. Think of
heavy fs metadata workloads and their memory demand. Memory reclaim
usually struggles when zones are imbalanced from my experience.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 05/16] mm: thp: handling 1GB THP reference bit.
  2020-09-02 18:06 ` [RFC PATCH 05/16] mm: thp: handling 1GB THP reference bit Zi Yan
@ 2020-09-09 14:09   ` Kirill A. Shutemov
  2020-09-09 14:36     ` Zi Yan
  0 siblings, 1 reply; 82+ messages in thread
From: Kirill A. Shutemov @ 2020-09-09 14:09 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Wed, Sep 02, 2020 at 02:06:17PM -0400, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Add PUD-level TLB flush ops and teach page_vma_mapped_talk about 1GB
> THPs.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  arch/x86/include/asm/pgtable.h |  3 +++
>  arch/x86/mm/pgtable.c          | 13 +++++++++++++
>  include/linux/mmu_notifier.h   | 13 +++++++++++++
>  include/linux/pgtable.h        | 14 ++++++++++++++
>  include/linux/rmap.h           |  1 +
>  mm/page_vma_mapped.c           | 33 +++++++++++++++++++++++++++++----
>  mm/rmap.c                      | 12 +++++++++---
>  7 files changed, 82 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 26255cac78c0..15334f5ba172 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1127,6 +1127,9 @@ extern int pudp_test_and_clear_young(struct vm_area_struct *vma,
>  extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
>  				  unsigned long address, pmd_t *pmdp);
>  
> +#define __HAVE_ARCH_PUDP_CLEAR_YOUNG_FLUSH
> +extern int pudp_clear_flush_young(struct vm_area_struct *vma,
> +				  unsigned long address, pud_t *pudp);
>  
>  #define pmd_write pmd_write
>  static inline int pmd_write(pmd_t pmd)
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 7be73aee6183..e4a2dffcc418 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -633,6 +633,19 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
>  
>  	return young;
>  }
> +int pudp_clear_flush_young(struct vm_area_struct *vma,
> +			   unsigned long address, pud_t *pudp)
> +{
> +	int young;
> +
> +	VM_BUG_ON(address & ~HPAGE_PUD_MASK);
> +
> +	young = pudp_test_and_clear_young(vma, address, pudp);
> +	if (young)
> +		flush_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
> +
> +	return young;
> +}
>  #endif
>  
>  /**
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index b8200782dede..4ffa179e654f 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -557,6 +557,19 @@ static inline void mmu_notifier_range_init_migrate(
>  	__young;							\
>  })
>  
> +#define pudp_clear_flush_young_notify(__vma, __address, __pudp)		\
> +({									\
> +	int __young;							\
> +	struct vm_area_struct *___vma = __vma;				\
> +	unsigned long ___address = __address;				\
> +	__young = pudp_clear_flush_young(___vma, ___address, __pudp);	\
> +	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
> +						  ___address,		\
> +						  ___address +		\
> +							PUD_SIZE);	\
> +	__young;							\
> +})
> +
>  #define ptep_clear_young_notify(__vma, __address, __ptep)		\
>  ({									\
>  	int __young;							\
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 255275d5b73e..8ef358c386af 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -240,6 +240,20 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif
>  
> +#ifndef __HAVE_ARCH_PUDP_CLEAR_YOUNG_FLUSH
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +extern int pudp_clear_flush_young(struct vm_area_struct *vma,
> +				  unsigned long address, pud_t *pudp);
> +#else
> +int pudp_clear_flush_young(struct vm_area_struct *vma,
> +				  unsigned long address, pud_t *pudp)
> +{
> +	BUILD_BUG();
> +	return 0;
> +}
> +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD  */
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>  				       unsigned long address,
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 3a6adfa70fb0..0af61dd193d2 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -206,6 +206,7 @@ struct page_vma_mapped_walk {
>  	struct page *page;
>  	struct vm_area_struct *vma;
>  	unsigned long address;
> +	pud_t *pud;
>  	pmd_t *pmd;
>  	pte_t *pte;
>  	spinlock_t *ptl;
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 5e77b269c330..d9d39ec06e21 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -145,9 +145,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  	struct page *page = pvmw->page;
>  	pgd_t *pgd;
>  	p4d_t *p4d;
> -	pud_t *pud;
> +	pud_t pude;
>  	pmd_t pmde;
>  
> +	if (!pvmw->pte && !pvmw->pmd && pvmw->pud)
> +		return not_found(pvmw);
> +
>  	/* The only possible pmd mapping has been handled on last iteration */
>  	if (pvmw->pmd && !pvmw->pte)
>  		return not_found(pvmw);
> @@ -174,10 +177,31 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  	p4d = p4d_offset(pgd, pvmw->address);
>  	if (!p4d_present(*p4d))
>  		return false;
> -	pud = pud_offset(p4d, pvmw->address);
> -	if (!pud_present(*pud))
> +	pvmw->pud = pud_offset(p4d, pvmw->address);
> +
> +	/*
> +	 * Make sure the pud value isn't cached in a register by the
> +	 * compiler and used as a stale value after we've observed a
> +	 * subsequent update.
> +	 */
> +	pude = READ_ONCE(*pvmw->pud);
> +	if (pud_trans_huge(pude)) {
> +		pvmw->ptl = pud_lock(mm, pvmw->pud);
> +		if (likely(pud_trans_huge(*pvmw->pud))) {
> +			if (pvmw->flags & PVMW_MIGRATION)
> +				return not_found(pvmw);
> +			if (pud_page(*pvmw->pud) != page)
> +				return not_found(pvmw);
> +			return true;
> +		} else {
> +			/* THP pud was split under us: handle on pmd level */
> +			spin_unlock(pvmw->ptl);
> +			pvmw->ptl = NULL;

Hm. What makes you sure the pmd table is established here?

I have not looked at PUD THP handling of  MADV_DONTNEED yet, but for PMD
THP can became pmd_none() at any point (unless ptl is locked).

> +		}
> +	} else if (!pud_present(pude))
>  		return false;
> -	pvmw->pmd = pmd_offset(pud, pvmw->address);
> +
> +	pvmw->pmd = pmd_offset(pvmw->pud, pvmw->address);
>  	/*
>  	 * Make sure the pmd value isn't cached in a register by the
>  	 * compiler and used as a stale value after we've observed a
> @@ -213,6 +237,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  	} else if (!pmd_present(pmde)) {
>  		return false;
>  	}
> +
>  	if (!map_pte(pvmw))
>  		goto next_pte;
>  	while (1) {
> diff --git a/mm/rmap.c b/mm/rmap.c

Why?

> index 10195a2421cf..77cec0658b76 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -803,9 +803,15 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
>  					referenced++;
>  			}
>  		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> -			if (pmdp_clear_flush_young_notify(vma, address,
> -						pvmw.pmd))
> -				referenced++;
> +			if (pvmw.pmd) {
> +				if (pmdp_clear_flush_young_notify(vma, address,
> +							pvmw.pmd))
> +					referenced++;
> +			} else if (pvmw.pud) {
> +				if (pudp_clear_flush_young_notify(vma, address,
> +							pvmw.pud))
> +					referenced++;
> +			}
>  		} else {
>  			/* unexpected pmd-mapped page? */
>  			WARN_ON_ONCE(1);
> -- 
> 2.28.0
> 
> 

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 01/16] mm: add pagechain container for storing multiple pages.
  2020-09-09 13:46       ` Kirill A. Shutemov
@ 2020-09-09 14:15         ` Zi Yan
  0 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-09 14:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3070 bytes --]

On 9 Sep 2020, at 9:46, Kirill A. Shutemov wrote:

> On Mon, Sep 07, 2020 at 11:11:05AM -0400, Zi Yan wrote:
>> On 7 Sep 2020, at 8:22, Kirill A. Shutemov wrote:
>>
>>> On Wed, Sep 02, 2020 at 02:06:13PM -0400, Zi Yan wrote:
>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>> When depositing page table pages for 1GB THPs, we need 512 PTE pages +
>>>> 1 PMD page. Instead of counting and depositing 513 pages, we can use the
>>>> PMD page as a leader page and chain the rest 512 PTE pages with ->lru.
>>>> This, however, prevents us depositing PMD pages with ->lru, which is
>>>> currently used by depositing PTE pages for 2MB THPs. So add a new
>>>> pagechain container for PMD pages.
>>>>
>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>
>>> Just deposit it to a linked list in the mm_struct as we do for PMD if
>>> split ptl disabled.
>>>
>>
>> Thank you for checking the patches. Since we don’t have PUD split lock
>> yet, I store the PMD page table pages in a newly added linked list head
>> in mm_struct like you suggested above.
>>
>> I was too vague about my pagechain design for depositing page table pages
>> for PUD THPs. Sorry about the confusion. Let me clarify why
>> I am doing this pagechain here too. I am sure there would be
>> some other designs and I am happy to change my code.
>>
>> In my design, I did not store all page table pages in a single list.
>> I first deposit 512 PTE pages in one PMD page table page’s pmd_huge_pte
>> using pgtable_trans_huge_depsit(), then deposit the PMD page to
>> a newly added linked list in mm_struct. Since pmd_huge_pte shares space
>> with half of lru in struct page, we cannot use lru to link all PMD
>> pages together. As a result, I added pagechain. Also in this way,
>> we can avoid these things:
>>
>> 1. when we withdraw the PMD page during PUD THP split, we don’t need
>> to withdraw 513 page, set up one PMD page, then, deposit 512 PTE pages
>> in that PMD page.
>>
>> 2. we don’t mix PMD page table pages and PTE page table pages in a single
>> list, since they are initialized in different ways. Otherwise, we need
>> to maintain a subtle rule in the single page table page list that in every
>> 513 pages, first one is PMD page table page and the rest are PTE page
>> table pages.
>>
>> As I am typing, I also realize that my current design does not work
>> when PMD split lock is disabled, so I will fix it. I would store PMD pages
>> and PTE pages in two separate lists in mm_struct.
>>
>>
>> Any comments?
>
> Okay, fair enough.
>
> Although, I think you can get away without a new data structure. We don't
> need double-linked list to deposit page tables. You can rework PTE tables
> deposit code to have single-linked list and use one pointer of ->lru (with
> proper name) and make PMD tables deposit to use the other one. This way
> you can avoid conflict for ->lru.
>
> Does it make sense?

Yes. Thanks. Will do this in the next version. I think the single linked list
from llist.h can be used.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 06/16] mm: thp: add 1GB THP split_huge_pud_page() function.
  2020-09-02 18:06 ` [RFC PATCH 06/16] mm: thp: add 1GB THP split_huge_pud_page() function Zi Yan
@ 2020-09-09 14:18   ` Kirill A. Shutemov
  2020-09-09 14:19     ` Zi Yan
  0 siblings, 1 reply; 82+ messages in thread
From: Kirill A. Shutemov @ 2020-09-09 14:18 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

On Wed, Sep 02, 2020 at 02:06:18PM -0400, Zi Yan wrote:
>  25 files changed, 852 insertions(+), 98 deletions(-)

It's way too big to have meaningful review.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 06/16] mm: thp: add 1GB THP split_huge_pud_page() function.
  2020-09-09 14:18   ` Kirill A. Shutemov
@ 2020-09-09 14:19     ` Zi Yan
  0 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-09 14:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 308 bytes --]

On 9 Sep 2020, at 10:18, Kirill A. Shutemov wrote:

> On Wed, Sep 02, 2020 at 02:06:18PM -0400, Zi Yan wrote:
>>  25 files changed, 852 insertions(+), 98 deletions(-)
>
> It's way too big to have meaningful review.

Will split it into small patches in the next version.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 05/16] mm: thp: handling 1GB THP reference bit.
  2020-09-09 14:09   ` Kirill A. Shutemov
@ 2020-09-09 14:36     ` Zi Yan
  0 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-09 14:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Roman Gushchin, Rik van Riel, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7699 bytes --]

On 9 Sep 2020, at 10:09, Kirill A. Shutemov wrote:

> On Wed, Sep 02, 2020 at 02:06:17PM -0400, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Add PUD-level TLB flush ops and teach page_vma_mapped_talk about 1GB
>> THPs.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  arch/x86/include/asm/pgtable.h |  3 +++
>>  arch/x86/mm/pgtable.c          | 13 +++++++++++++
>>  include/linux/mmu_notifier.h   | 13 +++++++++++++
>>  include/linux/pgtable.h        | 14 ++++++++++++++
>>  include/linux/rmap.h           |  1 +
>>  mm/page_vma_mapped.c           | 33 +++++++++++++++++++++++++++++----
>>  mm/rmap.c                      | 12 +++++++++---
>>  7 files changed, 82 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>> index 26255cac78c0..15334f5ba172 100644
>> --- a/arch/x86/include/asm/pgtable.h
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -1127,6 +1127,9 @@ extern int pudp_test_and_clear_young(struct vm_area_struct *vma,
>>  extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
>>  				  unsigned long address, pmd_t *pmdp);
>>
>> +#define __HAVE_ARCH_PUDP_CLEAR_YOUNG_FLUSH
>> +extern int pudp_clear_flush_young(struct vm_area_struct *vma,
>> +				  unsigned long address, pud_t *pudp);
>>
>>  #define pmd_write pmd_write
>>  static inline int pmd_write(pmd_t pmd)
>> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
>> index 7be73aee6183..e4a2dffcc418 100644
>> --- a/arch/x86/mm/pgtable.c
>> +++ b/arch/x86/mm/pgtable.c
>> @@ -633,6 +633,19 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
>>
>>  	return young;
>>  }
>> +int pudp_clear_flush_young(struct vm_area_struct *vma,
>> +			   unsigned long address, pud_t *pudp)
>> +{
>> +	int young;
>> +
>> +	VM_BUG_ON(address & ~HPAGE_PUD_MASK);
>> +
>> +	young = pudp_test_and_clear_young(vma, address, pudp);
>> +	if (young)
>> +		flush_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
>> +
>> +	return young;
>> +}
>>  #endif
>>
>>  /**
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index b8200782dede..4ffa179e654f 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -557,6 +557,19 @@ static inline void mmu_notifier_range_init_migrate(
>>  	__young;							\
>>  })
>>
>> +#define pudp_clear_flush_young_notify(__vma, __address, __pudp)		\
>> +({									\
>> +	int __young;							\
>> +	struct vm_area_struct *___vma = __vma;				\
>> +	unsigned long ___address = __address;				\
>> +	__young = pudp_clear_flush_young(___vma, ___address, __pudp);	\
>> +	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
>> +						  ___address,		\
>> +						  ___address +		\
>> +							PUD_SIZE);	\
>> +	__young;							\
>> +})
>> +
>>  #define ptep_clear_young_notify(__vma, __address, __ptep)		\
>>  ({									\
>>  	int __young;							\
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 255275d5b73e..8ef358c386af 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -240,6 +240,20 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>  #endif
>>
>> +#ifndef __HAVE_ARCH_PUDP_CLEAR_YOUNG_FLUSH
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +extern int pudp_clear_flush_young(struct vm_area_struct *vma,
>> +				  unsigned long address, pud_t *pudp);
>> +#else
>> +int pudp_clear_flush_young(struct vm_area_struct *vma,
>> +				  unsigned long address, pud_t *pudp)
>> +{
>> +	BUILD_BUG();
>> +	return 0;
>> +}
>> +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD  */
>> +#endif
>> +
>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>  				       unsigned long address,
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index 3a6adfa70fb0..0af61dd193d2 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -206,6 +206,7 @@ struct page_vma_mapped_walk {
>>  	struct page *page;
>>  	struct vm_area_struct *vma;
>>  	unsigned long address;
>> +	pud_t *pud;
>>  	pmd_t *pmd;
>>  	pte_t *pte;
>>  	spinlock_t *ptl;
>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>> index 5e77b269c330..d9d39ec06e21 100644
>> --- a/mm/page_vma_mapped.c
>> +++ b/mm/page_vma_mapped.c
>> @@ -145,9 +145,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>  	struct page *page = pvmw->page;
>>  	pgd_t *pgd;
>>  	p4d_t *p4d;
>> -	pud_t *pud;
>> +	pud_t pude;
>>  	pmd_t pmde;
>>
>> +	if (!pvmw->pte && !pvmw->pmd && pvmw->pud)
>> +		return not_found(pvmw);
>> +
>>  	/* The only possible pmd mapping has been handled on last iteration */
>>  	if (pvmw->pmd && !pvmw->pte)
>>  		return not_found(pvmw);
>> @@ -174,10 +177,31 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>  	p4d = p4d_offset(pgd, pvmw->address);
>>  	if (!p4d_present(*p4d))
>>  		return false;
>> -	pud = pud_offset(p4d, pvmw->address);
>> -	if (!pud_present(*pud))
>> +	pvmw->pud = pud_offset(p4d, pvmw->address);
>> +
>> +	/*
>> +	 * Make sure the pud value isn't cached in a register by the
>> +	 * compiler and used as a stale value after we've observed a
>> +	 * subsequent update.
>> +	 */
>> +	pude = READ_ONCE(*pvmw->pud);
>> +	if (pud_trans_huge(pude)) {
>> +		pvmw->ptl = pud_lock(mm, pvmw->pud);
>> +		if (likely(pud_trans_huge(*pvmw->pud))) {
>> +			if (pvmw->flags & PVMW_MIGRATION)
>> +				return not_found(pvmw);
>> +			if (pud_page(*pvmw->pud) != page)
>> +				return not_found(pvmw);
>> +			return true;
>> +		} else {
>> +			/* THP pud was split under us: handle on pmd level */
>> +			spin_unlock(pvmw->ptl);
>> +			pvmw->ptl = NULL;
>
> Hm. What makes you sure the pmd table is established here?
>
> I have not looked at PUD THP handling of  MADV_DONTNEED yet, but for PMD
> THP can became pmd_none() at any point (unless ptl is locked).

You are right. I need to check pud_present here and only
go to pmd level when pud is present, otherwise just return with not_found.

Thanks.


>
>> +		}
>> +	} else if (!pud_present(pude))
>>  		return false;
>> -	pvmw->pmd = pmd_offset(pud, pvmw->address);
>> +
>> +	pvmw->pmd = pmd_offset(pvmw->pud, pvmw->address);
>>  	/*
>>  	 * Make sure the pmd value isn't cached in a register by the
>>  	 * compiler and used as a stale value after we've observed a
>> @@ -213,6 +237,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>  	} else if (!pmd_present(pmde)) {
>>  		return false;
>>  	}
>> +
>>  	if (!map_pte(pvmw))
>>  		goto next_pte;
>>  	while (1) {
>> diff --git a/mm/rmap.c b/mm/rmap.c
>
> Why?
>

The extra newline? Will remove it.

>> index 10195a2421cf..77cec0658b76 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -803,9 +803,15 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
>>  					referenced++;
>>  			}
>>  		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>> -			if (pmdp_clear_flush_young_notify(vma, address,
>> -						pvmw.pmd))
>> -				referenced++;
>> +			if (pvmw.pmd) {
>> +				if (pmdp_clear_flush_young_notify(vma, address,
>> +							pvmw.pmd))
>> +					referenced++;
>> +			} else if (pvmw.pud) {
>> +				if (pudp_clear_flush_young_notify(vma, address,
>> +							pvmw.pud))
>> +					referenced++;
>> +			}
>>  		} else {
>>  			/* unexpected pmd-mapped page? */
>>  			WARN_ON_ONCE(1);
>> -- 
>> 2.28.0
>>
>>
>
> -- 
>  Kirill A. Shutemov


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-09 13:43                 ` David Hildenbrand
  2020-09-09 13:49                   ` Rik van Riel
@ 2020-09-10  7:32                   ` Michal Hocko
  2020-09-10  8:27                     ` David Hildenbrand
  2020-09-10 13:32                     ` Rik van Riel
  1 sibling, 2 replies; 82+ messages in thread
From: Michal Hocko @ 2020-09-10  7:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Rik van Riel, Zi Yan, Roman Gushchin, Kirill A. Shutemov,
	linux-mm, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Vlastimil Babka,
	Mel Gorman

[Cc Vlastimil and Mel - the whole email thread starts
 http://lkml.kernel.org/r/20200902180628.4052244-1-zi.yan@sent.com
 but this particular subthread has diverged a bit and you might find it
 interesting]

On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
> On 09.09.20 15:19, Rik van Riel wrote:
> > On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
> >> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
> >>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
> >>>
> >>>> A global knob is insufficient. 1G pages will become a very
> >>>> precious
> >>>> resource as it requires a pre-allocation (reservation). So it
> >>>> really
> >>>> has
> >>>> to be an opt-in and the question is whether there is also some
> >>>> sort
> >>>> of
> >>>> access control needed.
> >>>
> >>> The 1GB pages do not require that much in the way of
> >>> pre-allocation. The memory can be obtained through CMA,
> >>> which means it can be used for movable 4kB and 2MB
> >>> allocations when not
> >>> being used for 1GB pages.
> >>
> >> That CMA has to be pre-reserved, right? That requires a
> >> configuration.
> > 
> > To some extent, yes.
> > 
> > However, because that pool can be used for movable
> > 4kB and 2MB
> > pages as well as for 1GB pages, it would be easy to just set
> > the size of that pool to eg. 1/3 or even 1/2 of memory for every
> > system.
> > 
> > It isn't like the pool needs to be the exact right size. We
> > just need to avoid the "highmem problem" of having too little
> > memory for kernel allocations.
> > 
> 
> I am not sure I like the trend towards CMA that we are seeing, reserving
> huge buffers for specific users (and eventually even doing it
> automatically).
> 
> What we actually want is ZONE_MOVABLE with relaxed guarantees, such that
> anybody who requires large, unmovable allocations can use it.
> 
> I once played with the idea of having ZONE_PREFER_MOVABLE, which
> a) Is the primary choice for movable allocations
> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of
> running out of memory

I might be missing something but how can this work longterm? Or put in
another words why would this work any better than existing fragmentation
avoidance techniques that page allocator implements already - movability
grouping etc. Please note that I am not deeply familiar with those but
my high level understanding is that we already try hard to not mix
movable and unmovable objects in same page blocks as much as we can.

My suspicion is that a separate zone would work in a similar fashion. As
long as there is a lot of free memory then zone will be effectively
MOVABLE. Similar applies to normal zone when unmovable allocations are
in minority. As long as the Normal zone gets full of unmovable objects
they start overflowing to ZONE_PREFER_MOVABLE and it will resemble page
block stealing when unmovable objects start spreading over movable page
blocks.

Again, my level of expertise to page allocator is quite low so all the
above might be simply wrong...

> If someone messes up the zone ratio, issues known from zone imbalances
> are avoided - large allocations simply become less likely to succeed. In
> contrast to ZONE_MOVABLE, memory offlining is not guaranteed to work.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-10  7:32                   ` Michal Hocko
@ 2020-09-10  8:27                     ` David Hildenbrand
  2020-09-10 14:21                       ` Zi Yan
  2020-09-10 13:32                     ` Rik van Riel
  1 sibling, 1 reply; 82+ messages in thread
From: David Hildenbrand @ 2020-09-10  8:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Zi Yan, Roman Gushchin, Kirill A. Shutemov,
	linux-mm, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Vlastimil Babka,
	Mel Gorman

On 10.09.20 09:32, Michal Hocko wrote:
> [Cc Vlastimil and Mel - the whole email thread starts
>  http://lkml.kernel.org/r/20200902180628.4052244-1-zi.yan@sent.com
>  but this particular subthread has diverged a bit and you might find it
>  interesting]
> 
> On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
>> On 09.09.20 15:19, Rik van Riel wrote:
>>> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
>>>> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
>>>>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
>>>>>
>>>>>> A global knob is insufficient. 1G pages will become a very
>>>>>> precious
>>>>>> resource as it requires a pre-allocation (reservation). So it
>>>>>> really
>>>>>> has
>>>>>> to be an opt-in and the question is whether there is also some
>>>>>> sort
>>>>>> of
>>>>>> access control needed.
>>>>>
>>>>> The 1GB pages do not require that much in the way of
>>>>> pre-allocation. The memory can be obtained through CMA,
>>>>> which means it can be used for movable 4kB and 2MB
>>>>> allocations when not
>>>>> being used for 1GB pages.
>>>>
>>>> That CMA has to be pre-reserved, right? That requires a
>>>> configuration.
>>>
>>> To some extent, yes.
>>>
>>> However, because that pool can be used for movable
>>> 4kB and 2MB
>>> pages as well as for 1GB pages, it would be easy to just set
>>> the size of that pool to eg. 1/3 or even 1/2 of memory for every
>>> system.
>>>
>>> It isn't like the pool needs to be the exact right size. We
>>> just need to avoid the "highmem problem" of having too little
>>> memory for kernel allocations.
>>>
>>
>> I am not sure I like the trend towards CMA that we are seeing, reserving
>> huge buffers for specific users (and eventually even doing it
>> automatically).
>>
>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such that
>> anybody who requires large, unmovable allocations can use it.
>>
>> I once played with the idea of having ZONE_PREFER_MOVABLE, which
>> a) Is the primary choice for movable allocations
>> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
>> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of
>> running out of memory
> 
> I might be missing something but how can this work longterm? Or put in
> another words why would this work any better than existing fragmentation
> avoidance techniques that page allocator implements already - movability
> grouping etc. Please note that I am not deeply familiar with those but
> my high level understanding is that we already try hard to not mix
> movable and unmovable objects in same page blocks as much as we can.

Note that we group in pageblock granularity, which avoids fragmentation
on a pageblock level, not on anything bigger than that. Especially
MAX_ORDER - 1 pages (e.g., on x86-64) and gigantic pages.

So once you run for some time on a system (especially thinking about
page shuffling *within* a zone), trying to allocate a gigantic page will
simply always fail - even if you always had plenty of free memory in
your single zone.

> 
> My suspicion is that a separate zone would work in a similar fashion. As
> long as there is a lot of free memory then zone will be effectively
> MOVABLE. Similar applies to normal zone when unmovable allocations are

Note the difference to MOVABLE: if you really want, you *can* put
movable allocations into that zone. So you can happily allocate gigantic
pages from it. Or anything else you like. As the name suggests "prefer
movable allocations".

> in minority. As long as the Normal zone gets full of unmovable objects
> they start overflowing to ZONE_PREFER_MOVABLE and it will resemble page
> block stealing when unmovable objects start spreading over movable page
> blocks.

Right, the long-term goal would be
1. To limit the chance of that happening. (e.g., size it in a way that's
safe for 99.9% of all setups, resize dynamically on demand)
2. To limit the physical area where that is happening (e.g., find lowest
possible pageblock etc.). That's more tricky but I consider this a pure
optimization on top.

As long as we stay in safe zone boundaries you get a benefit in most
scenarios. As soon as we would have a (temporary) workload that would
require more unmovable allocations we would fallback to polluting some
pageblocks only.

> 
> Again, my level of expertise to page allocator is quite low so all the
> above might be simply wrong...

Same over here. I had this idea in my mind for quite a while but
obviously didn't get to figure out the details/implement yet - that's
why I decided to share the basic idea just now.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-09 13:27                 ` David Hildenbrand
@ 2020-09-10 10:02                   ` William Kucharski
  0 siblings, 0 replies; 82+ messages in thread
From: William Kucharski @ 2020-09-10 10:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, Matthew Wilcox, Zi Yan, Roman Gushchin,
	Kirill A. Shutemov, linux-mm, Rik van Riel, Kirill A . Shutemov,
	Shakeel Butt, Yang Shi, David Nellans, linux-kernel



> On Sep 9, 2020, at 7:27 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 09.09.20 15:14, Jason Gunthorpe wrote:
>> On Wed, Sep 09, 2020 at 01:32:44PM +0100, Matthew Wilcox wrote:
>> 
>>> But here's the thing ... we already allow
>>> 	mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB)
>>> 
>>> So if we're not doing THP, what's the point of this thread?
>> 
>> I wondered that too..
>> 
>>> An madvise flag is a different beast; that's just letting the kernel
>>> know what the app thinks its behaviour will be.  The kernel can pay
>> 
>> But madvise is too late, the VMA already has an address, if it is not
>> 1G aligned it cannot be 1G THP already.
> 
> That's why user space (like QEMU) is THP-aware and selects an address
> that is aligned to the expected THP granularity (e.g., 2MB on x86_64).


To me it's always seemed like there are two major divisions among THP use
cases:

1) Applications that KNOW they would benefit from use of THPs, so they
call madvise() with an appropriate parameter and explicitly inform the
kernel of such

2) Applications that know nothing about THP but there may be an
advantage that comes from "automatic" THP mapping when possible.

This is an approach that I am more familiar with that comes down to:

    1) Is a VMA properly aligned for a (whatever size) THP?

    2) Is the mapping request for a length >= (whatever size) THP?

    3) Let's try allocating memory to map the space using (whatever size)
       THP, and:

        -- If we succeed, great, awesome, let's do it.
        -- If not, no big deal, map using as large a page as we CAN get.

There of course are myriad performance implications to this. Processes
that start early after boot have a better chance of getting a THP,
but that also means frequently mapped large memory spaces have a better
chance of being mapped in a shared manner via a THP, e.g. libc, X servers
or Firefox/Chrome. It also means that processes that would be mapped
using THPs early in boot may not be if they should crash and need to be
restarted.

There are all sorts of tunables that would likely need to be in place to make
the second approach more viable, but I think it's certainly worth investigating.

The address selection you suggest is the basis of one of the patches I wrote
for a previous iteration of THP support (and that is in Matthew's THP tree)
that will try to round VM addresses to the proper alignment if possible so a
THP can then be used to map the area.





^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-10  7:32                   ` Michal Hocko
  2020-09-10  8:27                     ` David Hildenbrand
@ 2020-09-10 13:32                     ` Rik van Riel
  2020-09-10 14:30                       ` Zi Yan
  1 sibling, 1 reply; 82+ messages in thread
From: Rik van Riel @ 2020-09-10 13:32 UTC (permalink / raw)
  To: Michal Hocko, David Hildenbrand
  Cc: Zi Yan, Roman Gushchin, Kirill A. Shutemov, linux-mm,
	Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt, Yang Shi,
	David Nellans, linux-kernel, Vlastimil Babka, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 2308 bytes --]

On Thu, 2020-09-10 at 09:32 +0200, Michal Hocko wrote:
> [Cc Vlastimil and Mel - the whole email thread starts
>  http://lkml.kernel.org/r/20200902180628.4052244-1-zi.yan@sent.com
>  but this particular subthread has diverged a bit and you might find
> it
>  interesting]
> 
> On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
> > 
> > I am not sure I like the trend towards CMA that we are seeing,
> > reserving
> > huge buffers for specific users (and eventually even doing it
> > automatically).
> > 
> > What we actually want is ZONE_MOVABLE with relaxed guarantees, such
> > that
> > anybody who requires large, unmovable allocations can use it.
> > 
> > I once played with the idea of having ZONE_PREFER_MOVABLE, which
> > a) Is the primary choice for movable allocations
> > b) Is allowed to contain unmovable allocations (esp., gigantic
> > pages)
> > c) Is the fallback for ZONE_NORMAL for unmovable allocations,
> > instead of
> > running out of memory
> 
> I might be missing something but how can this work longterm? Or put
> in
> another words why would this work any better than existing
> fragmentation
> avoidance techniques that page allocator implements already - 

One big difference is reclaim. If ZONE_NORMAL runs low on
free memory, page reclaim would kick in and evict some
movable/reclaimable things, to free up more space for
unmovable allocations.

The current fragmentation avoidance techniques don't do
things like reclaim, or proactively migrating movable
pages out of unmovable page blocks to prevent unmovable
allocations in currently movable page blocks.

> My suspicion is that a separate zone would work in a similar fashion.
> As
> long as there is a lot of free memory then zone will be effectively
> MOVABLE. Similar applies to normal zone when unmovable allocations
> are
> in minority. As long as the Normal zone gets full of unmovable
> objects
> they start overflowing to ZONE_PREFER_MOVABLE and it will resemble
> page
> block stealing when unmovable objects start spreading over movable
> page
> blocks.

You are right, with the difference being reclaim and/or
migration, which could make a real difference in limiting
the number of pageblocks that have unmovable allocations.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-10  8:27                     ` David Hildenbrand
@ 2020-09-10 14:21                       ` Zi Yan
  2020-09-10 14:34                         ` David Hildenbrand
  0 siblings, 1 reply; 82+ messages in thread
From: Zi Yan @ 2020-09-10 14:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Rik van Riel, Roman Gushchin, Kirill A. Shutemov,
	linux-mm, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Vlastimil Babka,
	Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 5589 bytes --]

On 10 Sep 2020, at 4:27, David Hildenbrand wrote:

> On 10.09.20 09:32, Michal Hocko wrote:
>> [Cc Vlastimil and Mel - the whole email thread starts
>> http://lkml.kernel.org/r/20200902180628.4052244-1-zi.yan@sent.com
>>  but this particular subthread has diverged a bit and you might find it
>>  interesting]
>>
>> On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
>>> On 09.09.20 15:19, Rik van Riel wrote:
>>>> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
>>>>> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
>>>>>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
>>>>>>
>>>>>>> A global knob is insufficient. 1G pages will become a very
>>>>>>> precious
>>>>>>> resource as it requires a pre-allocation (reservation). So it
>>>>>>> really
>>>>>>> has
>>>>>>> to be an opt-in and the question is whether there is also some
>>>>>>> sort
>>>>>>> of
>>>>>>> access control needed.
>>>>>>
>>>>>> The 1GB pages do not require that much in the way of
>>>>>> pre-allocation. The memory can be obtained through CMA,
>>>>>> which means it can be used for movable 4kB and 2MB
>>>>>> allocations when not
>>>>>> being used for 1GB pages.
>>>>>
>>>>> That CMA has to be pre-reserved, right? That requires a
>>>>> configuration.
>>>>
>>>> To some extent, yes.
>>>>
>>>> However, because that pool can be used for movable
>>>> 4kB and 2MB
>>>> pages as well as for 1GB pages, it would be easy to just set
>>>> the size of that pool to eg. 1/3 or even 1/2 of memory for every
>>>> system.
>>>>
>>>> It isn't like the pool needs to be the exact right size. We
>>>> just need to avoid the "highmem problem" of having too little
>>>> memory for kernel allocations.
>>>>
>>>
>>> I am not sure I like the trend towards CMA that we are seeing, reserving
>>> huge buffers for specific users (and eventually even doing it
>>> automatically).
>>>
>>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such that
>>> anybody who requires large, unmovable allocations can use it.
>>>
>>> I once played with the idea of having ZONE_PREFER_MOVABLE, which
>>> a) Is the primary choice for movable allocations
>>> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
>>> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of
>>> running out of memory
>>
>> I might be missing something but how can this work longterm? Or put in
>> another words why would this work any better than existing fragmentation
>> avoidance techniques that page allocator implements already - movability
>> grouping etc. Please note that I am not deeply familiar with those but
>> my high level understanding is that we already try hard to not mix
>> movable and unmovable objects in same page blocks as much as we can.
>
> Note that we group in pageblock granularity, which avoids fragmentation
> on a pageblock level, not on anything bigger than that. Especially
> MAX_ORDER - 1 pages (e.g., on x86-64) and gigantic pages.
>
> So once you run for some time on a system (especially thinking about
> page shuffling *within* a zone), trying to allocate a gigantic page will
> simply always fail - even if you always had plenty of free memory in
> your single zone.
>
>>
>> My suspicion is that a separate zone would work in a similar fashion. As
>> long as there is a lot of free memory then zone will be effectively
>> MOVABLE. Similar applies to normal zone when unmovable allocations are
>
> Note the difference to MOVABLE: if you really want, you *can* put
> movable allocations into that zone. So you can happily allocate gigantic
> pages from it. Or anything else you like. As the name suggests "prefer
> movable allocations".
>
>> in minority. As long as the Normal zone gets full of unmovable objects
>> they start overflowing to ZONE_PREFER_MOVABLE and it will resemble page
>> block stealing when unmovable objects start spreading over movable page
>> blocks.
>
> Right, the long-term goal would be
> 1. To limit the chance of that happening. (e.g., size it in a way that's
> safe for 99.9% of all setups, resize dynamically on demand)
> 2. To limit the physical area where that is happening (e.g., find lowest
> possible pageblock etc.). That's more tricky but I consider this a pure
> optimization on top.
>
> As long as we stay in safe zone boundaries you get a benefit in most
> scenarios. As soon as we would have a (temporary) workload that would
> require more unmovable allocations we would fallback to polluting some
> pageblocks only.

The idea would work well until unmoveable pages begin to overflow into
ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to
avoid unmoveable page overflow. The issue comes from the lifetime of
the unmoveable pages. Since some long-live ones can be around the boundary,
there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back
even if other unmoveable pages are deallocated. Ultimately,
ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is
back to what we have now.

OK. I have a stupid question here. Why not just grow pageblock to a larger
size, like 1GB? So the fragmentation of unmoveable pages will be at larger
granularity. But it is less likely unmoveable pages will be allocated at
a movable pageblock, since the kernel has 1GB pageblock for them after
a pageblock stealing. If other kinds of pageblocks run out, moveable and
reclaimable pages can fall back to unmoveable pageblocks.
What am I missing here?

Thanks.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-10 13:32                     ` Rik van Riel
@ 2020-09-10 14:30                       ` Zi Yan
  0 siblings, 0 replies; 82+ messages in thread
From: Zi Yan @ 2020-09-10 14:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michal Hocko, David Hildenbrand, Roman Gushchin,
	Kirill A. Shutemov, linux-mm, Kirill A . Shutemov,
	Matthew Wilcox, Shakeel Butt, Yang Shi, David Nellans,
	linux-kernel, Vlastimil Babka, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 1937 bytes --]

On 10 Sep 2020, at 9:32, Rik van Riel wrote:

> On Thu, 2020-09-10 at 09:32 +0200, Michal Hocko wrote:
>> [Cc Vlastimil and Mel - the whole email thread starts
>> http://lkml.kernel.org/r/20200902180628.4052244-1-zi.yan@sent.com
>>  but this particular subthread has diverged a bit and you might find
>> it
>>  interesting]
>>
>> On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
>>>
>>> I am not sure I like the trend towards CMA that we are seeing,
>>> reserving
>>> huge buffers for specific users (and eventually even doing it
>>> automatically).
>>>
>>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such
>>> that
>>> anybody who requires large, unmovable allocations can use it.
>>>
>>> I once played with the idea of having ZONE_PREFER_MOVABLE, which
>>> a) Is the primary choice for movable allocations
>>> b) Is allowed to contain unmovable allocations (esp., gigantic
>>> pages)
>>> c) Is the fallback for ZONE_NORMAL for unmovable allocations,
>>> instead of
>>> running out of memory
>>
>> I might be missing something but how can this work longterm? Or put
>> in
>> another words why would this work any better than existing
>> fragmentation
>> avoidance techniques that page allocator implements already -
>
> One big difference is reclaim. If ZONE_NORMAL runs low on
> free memory, page reclaim would kick in and evict some
> movable/reclaimable things, to free up more space for
> unmovable allocations.
>
> The current fragmentation avoidance techniques don't do
> things like reclaim, or proactively migrating movable
> pages out of unmovable page blocks to prevent unmovable
> allocations in currently movable page blocks.

Isn’t Mel Gorman’s watermark boost patch[1] (merged about a year ago)
doing what you are describing?


[1]https://lore.kernel.org/linux-mm/20181123114528.28802-1-mgorman@techsingularity.net/


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-10 14:21                       ` Zi Yan
@ 2020-09-10 14:34                         ` David Hildenbrand
  2020-09-10 14:41                           ` Zi Yan
  0 siblings, 1 reply; 82+ messages in thread
From: David Hildenbrand @ 2020-09-10 14:34 UTC (permalink / raw)
  To: Zi Yan
  Cc: Michal Hocko, Rik van Riel, Roman Gushchin, Kirill A. Shutemov,
	linux-mm, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Vlastimil Babka,
	Mel Gorman

>> As long as we stay in safe zone boundaries you get a benefit in most
>> scenarios. As soon as we would have a (temporary) workload that would
>> require more unmovable allocations we would fallback to polluting some
>> pageblocks only.
> 
> The idea would work well until unmoveable pages begin to overflow into
> ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to
> avoid unmoveable page overflow. The issue comes from the lifetime of
> the unmoveable pages. Since some long-live ones can be around the boundary,
> there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back
> even if other unmoveable pages are deallocated. Ultimately,
> ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is
> back to what we have now.

As discussed this would not happen in the usual case in case we size it
reasonable. Of course, if you push it to the extreme (which was never
suggested!), you would create mess. There is always a way to create a
mess if you abuse such mechanism. Also see Rik's reply regarding reclaim.

> 
> OK. I have a stupid question here. Why not just grow pageblock to a larger
> size, like 1GB? So the fragmentation of unmoveable pages will be at larger
> granularity. But it is less likely unmoveable pages will be allocated at
> a movable pageblock, since the kernel has 1GB pageblock for them after
> a pageblock stealing. If other kinds of pageblocks run out, moveable and
> reclaimable pages can fall back to unmoveable pageblocks.
> What am I missing here?

Oh no. For example pageblocks have to completely fit into a single
section (that's where metadata is maintained). Please refrain from
suggesting to increase the section size ;)

There is plenty of code relying on pageblocks/MAX_ORDER - 1 to be
reasonable in size. Examples in VMs are free page reporting or virtio-mem.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-10 14:34                         ` David Hildenbrand
@ 2020-09-10 14:41                           ` Zi Yan
  2020-09-10 15:15                             ` David Hildenbrand
  0 siblings, 1 reply; 82+ messages in thread
From: Zi Yan @ 2020-09-10 14:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Rik van Riel, Roman Gushchin, Kirill A. Shutemov,
	linux-mm, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Vlastimil Babka,
	Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 2014 bytes --]

On 10 Sep 2020, at 10:34, David Hildenbrand wrote:

>>> As long as we stay in safe zone boundaries you get a benefit in most
>>> scenarios. As soon as we would have a (temporary) workload that would
>>> require more unmovable allocations we would fallback to polluting some
>>> pageblocks only.
>>
>> The idea would work well until unmoveable pages begin to overflow into
>> ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to
>> avoid unmoveable page overflow. The issue comes from the lifetime of
>> the unmoveable pages. Since some long-live ones can be around the boundary,
>> there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back
>> even if other unmoveable pages are deallocated. Ultimately,
>> ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is
>> back to what we have now.
>
> As discussed this would not happen in the usual case in case we size it
> reasonable. Of course, if you push it to the extreme (which was never
> suggested!), you would create mess. There is always a way to create a
> mess if you abuse such mechanism. Also see Rik's reply regarding reclaim.
>
>>
>> OK. I have a stupid question here. Why not just grow pageblock to a larger
>> size, like 1GB? So the fragmentation of unmoveable pages will be at larger
>> granularity. But it is less likely unmoveable pages will be allocated at
>> a movable pageblock, since the kernel has 1GB pageblock for them after
>> a pageblock stealing. If other kinds of pageblocks run out, moveable and
>> reclaimable pages can fall back to unmoveable pageblocks.
>> What am I missing here?
>
> Oh no. For example pageblocks have to completely fit into a single
> section (that's where metadata is maintained). Please refrain from
> suggesting to increase the section size ;)

Thank you for the explanation. I have no idea about the restrictions on
pageblock and section. Out of curiosity, what prevents the growth of
the section size?

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH 00/16] 1GB THP support on x86_64
  2020-09-10 14:41                           ` Zi Yan
@ 2020-09-10 15:15                             ` David Hildenbrand
  0 siblings, 0 replies; 82+ messages in thread
From: David Hildenbrand @ 2020-09-10 15:15 UTC (permalink / raw)
  To: Zi Yan
  Cc: Michal Hocko, Rik van Riel, Roman Gushchin, Kirill A. Shutemov,
	linux-mm, Kirill A . Shutemov, Matthew Wilcox, Shakeel Butt,
	Yang Shi, David Nellans, linux-kernel, Vlastimil Babka,
	Mel Gorman

On 10.09.20 16:41, Zi Yan wrote:
> On 10 Sep 2020, at 10:34, David Hildenbrand wrote:
> 
>>>> As long as we stay in safe zone boundaries you get a benefit in most
>>>> scenarios. As soon as we would have a (temporary) workload that would
>>>> require more unmovable allocations we would fallback to polluting some
>>>> pageblocks only.
>>>
>>> The idea would work well until unmoveable pages begin to overflow into
>>> ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to
>>> avoid unmoveable page overflow. The issue comes from the lifetime of
>>> the unmoveable pages. Since some long-live ones can be around the boundary,
>>> there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back
>>> even if other unmoveable pages are deallocated. Ultimately,
>>> ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is
>>> back to what we have now.
>>
>> As discussed this would not happen in the usual case in case we size it
>> reasonable. Of course, if you push it to the extreme (which was never
>> suggested!), you would create mess. There is always a way to create a
>> mess if you abuse such mechanism. Also see Rik's reply regarding reclaim.
>>
>>>
>>> OK. I have a stupid question here. Why not just grow pageblock to a larger
>>> size, like 1GB? So the fragmentation of unmoveable pages will be at larger
>>> granularity. But it is less likely unmoveable pages will be allocated at
>>> a movable pageblock, since the kernel has 1GB pageblock for them after
>>> a pageblock stealing. If other kinds of pageblocks run out, moveable and
>>> reclaimable pages can fall back to unmoveable pageblocks.
>>> What am I missing here?
>>
>> Oh no. For example pageblocks have to completely fit into a single
>> section (that's where metadata is maintained). Please refrain from
>> suggesting to increase the section size ;)
> 
> Thank you for the explanation. I have no idea about the restrictions on
> pageblock and section. Out of curiosity, what prevents the growth of
> the section size?

The section size (and based on that the Linux memory block size) defines
- the minimum size in which we can add_memory()
- the alignment requirement in which we can add_memory()

This is applicable
- in physical environments, where the bios will decide where to place
  DIMMs/NVDIMMs. The coarser the granularity, the less memory we might
  be able to make use of in corner cases.
- in virtualized environments, where we want to add memory in fairly
  small granularity. The coarser the granularity, the less flexibility
  we have.

arm64 has a section size of 1GB (and a THP/MAX_ORDER - 1 size of 512MB
with 64k base pages :/ ). That already turned out to be a problem - see
[1] regarding thoughts on how to shrink the section size. I once read
about thoughts of switching to 2MB THP on arm64 with any base page size,
not sure if that will become real at one point (and we might be able to
reduce the pageblock size there as well ... )

[1]
https://lkml.kernel.org/r/AM6PR08MB40690714A2E77A7128B2B2ADF7700@AM6PR08MB4069.eurprd08.prod.outlook.com
See [1] as

> 
> —
> Best Regards,
> Yan Zi
> 


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2020-09-10 15:15 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
2020-09-02 18:06 ` [RFC PATCH 01/16] mm: add pagechain container for storing multiple pages Zi Yan
2020-09-02 20:29   ` Randy Dunlap
2020-09-02 20:48     ` Zi Yan
2020-09-03  3:15   ` Matthew Wilcox
2020-09-07 12:22   ` Kirill A. Shutemov
2020-09-07 15:11     ` Zi Yan
2020-09-09 13:46       ` Kirill A. Shutemov
2020-09-09 14:15         ` Zi Yan
2020-09-02 18:06 ` [RFC PATCH 02/16] mm: thp: 1GB anonymous page implementation Zi Yan
2020-09-02 18:06 ` [RFC PATCH 03/16] mm: proc: add 1GB THP kpageflag Zi Yan
2020-09-09 13:46   ` Kirill A. Shutemov
2020-09-02 18:06 ` [RFC PATCH 04/16] mm: thp: 1GB THP copy on write implementation Zi Yan
2020-09-02 18:06 ` [RFC PATCH 05/16] mm: thp: handling 1GB THP reference bit Zi Yan
2020-09-09 14:09   ` Kirill A. Shutemov
2020-09-09 14:36     ` Zi Yan
2020-09-02 18:06 ` [RFC PATCH 06/16] mm: thp: add 1GB THP split_huge_pud_page() function Zi Yan
2020-09-09 14:18   ` Kirill A. Shutemov
2020-09-09 14:19     ` Zi Yan
2020-09-02 18:06 ` [RFC PATCH 07/16] mm: stats: make smap stats understand PUD THPs Zi Yan
2020-09-02 18:06 ` [RFC PATCH 08/16] mm: page_vma_walk: teach it about PMD-mapped PUD THP Zi Yan
2020-09-02 18:06 ` [RFC PATCH 09/16] mm: thp: 1GB THP support in try_to_unmap() Zi Yan
2020-09-02 18:06 ` [RFC PATCH 10/16] mm: thp: split 1GB THPs at page reclaim Zi Yan
2020-09-02 18:06 ` [RFC PATCH 11/16] mm: thp: 1GB THP follow_p*d_page() support Zi Yan
2020-09-02 18:06 ` [RFC PATCH 12/16] mm: support 1GB THP pagemap support Zi Yan
2020-09-02 18:06 ` [RFC PATCH 13/16] mm: thp: add a knob to enable/disable 1GB THPs Zi Yan
2020-09-02 18:06 ` [RFC PATCH 14/16] mm: page_alloc: >=MAX_ORDER pages allocation an deallocation Zi Yan
2020-09-02 18:06 ` [RFC PATCH 15/16] hugetlb: cma: move cma reserve function to cma.c Zi Yan
2020-09-02 18:06 ` [RFC PATCH 16/16] mm: thp: use cma reservation for pud thp allocation Zi Yan
2020-09-02 18:40 ` [RFC PATCH 00/16] 1GB THP support on x86_64 Jason Gunthorpe
2020-09-02 18:45   ` Zi Yan
2020-09-02 18:48     ` Jason Gunthorpe
2020-09-02 19:05       ` Zi Yan
2020-09-02 19:57         ` Jason Gunthorpe
2020-09-02 20:29           ` Zi Yan
2020-09-03 16:40             ` Jason Gunthorpe
2020-09-03 16:55               ` Matthew Wilcox
2020-09-03 17:08                 ` Jason Gunthorpe
2020-09-03  7:32 ` Michal Hocko
2020-09-03 16:25   ` Roman Gushchin
2020-09-03 16:50     ` Jason Gunthorpe
2020-09-03 17:01       ` Matthew Wilcox
2020-09-03 17:18         ` Jason Gunthorpe
2020-09-03 20:57     ` Mike Kravetz
2020-09-03 21:06       ` Roman Gushchin
2020-09-04  7:42     ` Michal Hocko
2020-09-04 21:10       ` Roman Gushchin
2020-09-07  7:20         ` Michal Hocko
2020-09-08 15:09           ` Zi Yan
2020-09-08 19:58             ` Roman Gushchin
2020-09-09  4:01               ` John Hubbard
2020-09-09  7:15               ` Michal Hocko
2020-09-03 14:23 ` Kirill A. Shutemov
2020-09-03 16:30   ` Roman Gushchin
2020-09-08 11:57     ` David Hildenbrand
2020-09-08 14:05       ` Zi Yan
2020-09-08 14:22         ` David Hildenbrand
2020-09-08 15:36           ` Zi Yan
2020-09-08 14:27         ` Matthew Wilcox
2020-09-08 15:50           ` Zi Yan
2020-09-09 12:11           ` Jason Gunthorpe
2020-09-09 12:32             ` Matthew Wilcox
2020-09-09 13:14               ` Jason Gunthorpe
2020-09-09 13:27                 ` David Hildenbrand
2020-09-10 10:02                   ` William Kucharski
2020-09-08 14:35         ` Michal Hocko
2020-09-08 14:41           ` Rik van Riel
2020-09-08 15:02             ` David Hildenbrand
2020-09-09  7:04             ` Michal Hocko
2020-09-09 13:19               ` Rik van Riel
2020-09-09 13:43                 ` David Hildenbrand
2020-09-09 13:49                   ` Rik van Riel
2020-09-09 13:54                     ` David Hildenbrand
2020-09-10  7:32                   ` Michal Hocko
2020-09-10  8:27                     ` David Hildenbrand
2020-09-10 14:21                       ` Zi Yan
2020-09-10 14:34                         ` David Hildenbrand
2020-09-10 14:41                           ` Zi Yan
2020-09-10 15:15                             ` David Hildenbrand
2020-09-10 13:32                     ` Rik van Riel
2020-09-10 14:30                       ` Zi Yan
2020-09-09 13:59                 ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).